Extraction of table data with usual and unsual structure in the pdf. #1087

Sylvester-Anthony · 2024-02-02T15:57:17Z

Sylvester-Anthony
Feb 2, 2024

Hello @jsvine , love the work with pdfplumber and I have been expirementing in extracting table data from pdfs , the problem is the pdfs have both both properly structured tables and tables that are not properly structured. I was able to extract all data from the table but the detection of columns was very irregular and my dataframe was something like this :

As you can see I am getting the data from the columns but the content after the first set of rows is not completely available.

My end goal is to get all the data in full from the table, needn't exlucisvely be a data frame. I saw this piece of code from the other discussions where the full text gets detected:

(
    im.reset()
    .draw_rects(p0.extract_words(keep_blank_chars=False))
    .draw_rects(p0.extract_words(keep_blank_chars=True), stroke="blue", fill=None)
)

And the result was all the text getting detected and this is exactly the end goal of mine :

So my question is , is there a way to get all this detected text in any form because the end goal is to get all the data possible from a variety of pdfs not just one table. The expirement with the various page settings works for one table but it doesnt work for the other.

The pdf is :
2022 Sustainability Report_NYSE_WM_2022.pdf

Looking forward towards your suggestions . Thank you !

-Sylvester

jsvine · 2024-02-10T23:08:37Z

jsvine
Feb 10, 2024
Maintainer

Hi @Sylvester-Anthony, and thanks for the kind words about pdfplumber. Unfortunately, I'm not sure I fully understand your question.

When you say, " is there a way to get all this detected text in any form," would something as simple as page.extract_text(layout=True) suffice? Or do you mean a more structured form?

0 replies

Sylvester-Anthony · 2024-02-14T08:09:33Z

Sylvester-Anthony
Feb 14, 2024
Author

Thank you for the suggestion @jsvine ! I've got one more hiccup in here , if you have a look at page number 42 in the pdf , its a tabular structure and its similar to page number 43. The data in 43 gets detected while the data in page 42 is not considered a table while using page.extract_tables(). Any idea on what could be causing this and is there some settings I could look at ?

1 reply

jsvine Feb 14, 2024
Maintainer

My first suggestion would be to examine the results of page.to_image().debug_tablefinder() for both those pages. Do you see any major differences?

Sylvester-Anthony · 2024-02-23T06:44:31Z

Sylvester-Anthony
Feb 23, 2024
Author

The major differences I see is , in page 43 we can see the tables getting detected albeit not in whole but as parts but the table in page 42 is not getting detected at all. So my thought process was to extract content of pages where tables are detected but page 42 is not getting detected.

1 reply

jsvine Mar 11, 2024
Maintainer

Thanks for flagging this @Sylvester-Anthony — unfortunately, it looks like you've run into a bug, which I've now opened an issue for: #1110

To get grab those horizontal lines in the meantime, you can try:

horizontal_lines = (
    pdfplumber.table.merge_edges(
        pdfplumber.utils.filter_edges(page.edges, "h"),
        snap_x_tolerance=0,
        snap_y_tolerance=0,
        join_x_tolerance=-1,
        join_y_tolerance=0,
    )
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction of table data with usual and unsual structure in the pdf. #1087

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extraction of table data with usual and unsual structure in the pdf. #1087

Sylvester-Anthony Feb 2, 2024

Replies: 3 comments · 2 replies

jsvine Feb 10, 2024 Maintainer

Sylvester-Anthony Feb 14, 2024 Author

jsvine Feb 14, 2024 Maintainer

Sylvester-Anthony Feb 23, 2024 Author

jsvine Mar 11, 2024 Maintainer

Sylvester-Anthony
Feb 2, 2024

Replies: 3 comments 2 replies

jsvine
Feb 10, 2024
Maintainer

Sylvester-Anthony
Feb 14, 2024
Author

jsvine Feb 14, 2024
Maintainer

Sylvester-Anthony
Feb 23, 2024
Author

jsvine Mar 11, 2024
Maintainer