Extraction of table data with usual and unsual structure in the pdf. #1087
Replies: 3 comments 2 replies
-
Hi @Sylvester-Anthony, and thanks for the kind words about When you say, " is there a way to get all this detected text in any form," would something as simple as |
Beta Was this translation helpful? Give feedback.
-
Thank you for the suggestion @jsvine ! I've got one more hiccup in here , if you have a look at page number 42 in the pdf , its a tabular structure and its similar to page number 43. The data in 43 gets detected while the data in page 42 is not considered a table while using page.extract_tables(). Any idea on what could be causing this and is there some settings I could look at ? |
Beta Was this translation helpful? Give feedback.
-
The major differences I see is , in page 43 we can see the tables getting detected albeit not in whole but as parts but the table in page 42 is not getting detected at all. So my thought process was to extract content of pages where tables are detected but page 42 is not getting detected. |
Beta Was this translation helpful? Give feedback.
-
Hello @jsvine , love the work with pdfplumber and I have been expirementing in extracting table data from pdfs , the problem is the pdfs have both both properly structured tables and tables that are not properly structured. I was able to extract all data from the table but the detection of columns was very irregular and my dataframe was something like this :
As you can see I am getting the data from the columns but the content after the first set of rows is not completely available.
My end goal is to get all the data in full from the table, needn't exlucisvely be a data frame. I saw this piece of code from the other discussions where the full text gets detected:
And the result was all the text getting detected and this is exactly the end goal of mine :
So my question is , is there a way to get all this detected text in any form because the end goal is to get all the data possible from a variety of pdfs not just one table. The expirement with the various page settings works for one table but it doesnt work for the other.
The pdf is :
2022 Sustainability Report_NYSE_WM_2022.pdf
Looking forward towards your suggestions . Thank you !
-Sylvester
Beta Was this translation helpful? Give feedback.
All reactions