Extracting Text returns empty/None for a similar table #1099

lcabrera07 · 2024-02-23T06:55:56Z

lcabrera07
Feb 23, 2024

Hi, thank you for providing a python pdf extraction library!

I am working on pdfs that have an index I would like to extract. It works on an index with defined lines but it doesn't seem to work on an index that has table lines that don't intercept. I have used some variation of the config to define the (lines_strict) lines but it still returns text only from the one with defined lines. I had to use lines_strict because I think it's mistakenly perceiving the thicker vertical line as a table.

This image below shows two pdfs with the tool's debug image, the left without intersecting lines and the right with intersecting lines.
I got the left by using lines_strict and intersection_tolerance = 10 passed into extract_tables(). The right is what I get without any config passed into extract_tables().

Not sure why the left pdf is ignoring the horizontal lines but I think it can be ignored if it considers the entire table 3 columns, meaning 3 cells. My next guess is that the text may be too far apart to extract. I continue to get [['', '', '']]

The pdf on the right works fine and extracts what I need using extract_tables(). I get the following [['&', '15th 14:3,5,6\n16 116:21\n200:14\n16th 187:15\n17011...

Is this a config issue? I attempted to use horizontal/vertical strategy with text but the text is not aligned correctly so I think it's even harder.

Thanks in advanced.

jsvine · 2024-03-02T23:14:31Z

jsvine
Mar 2, 2024
Maintainer

Hi @lcabrera07, and thanks for your kind words about pdfplumber. Each PDF has a different internal structure, so even two PDFs that look the same externally may require a different strategy. Can you provide a copy of the PDFs you're trying to extract?

0 replies

lcabrera07 · 2024-03-06T03:28:56Z

lcabrera07
Mar 6, 2024
Author

I am attaching two PDFs. One where I can extract text from and one where I cannot. I'll also attach the debug images that pdfplumber outputs. One note is that when I mean it cannot, I mean the method extract_tables() return an empty array with four empty string elements. I am using the configuration: {"intersection_tolerance": 7, "text_keep_blank_chars": True } on both pdfs.
indexThatPdfPlumberCannotReadTextFrom.pdf
indexThatPdfPlumberCanReadTextFrom.pdf

3 replies

mkl-public Mar 6, 2024

Analyzing the indexThatPdfPlumberCannotReadTextFrom.pdf it turns out that the only static contents of that file are the cell borders, all the text is contained in Watermark (!) annotations.

Text extractors (both for unstructured and structured - tabular - text) usually only extract the text from the static content. And if they look at text contained in annotations at all, they are likely to avoid Watermark annotations as watermarks usually only disturb text extraction.

I don't see why one would put that main text into an annotation, let alone a Watermark one, unless one wants to prevent that text from being extracted.

lcabrera07 Mar 7, 2024
Author

Thanks for the response. This is interesting. I did noticed that the entire text content is removed when I redact a similar PDF that also doesn't return back text with the extract_tables(). Would I require another pdf library that more for OCR to extract text from PDFs like these?

mkl-public Mar 7, 2024

Actually I cannot speak for pdfplumber here (I'm mostly interested in interesting examples here to analyze and test them in other contexts).
What you should look for, is an annotation flattening feature. Flattening annotations means adding them to the static content and removing their interactivity. So after flattening the annotations pdfplumber table recognition should work properly.

lcabrera07 · 2024-03-07T02:01:03Z

lcabrera07
Mar 7, 2024
Author

Here is another question...

A pdf (attached below) I have is returning back character arrays instead of words (see the image below). I have tried to adjust the config with different values of {x_tolerance=3, y_tolerance=3, x_tolerance_ratio} but nothing changes when I use the extract_tables(). Am I not using the correct configuration values for words?
I could just be confusing the configuration passed into the debug's extract_words() and the actual extract_tables().

wordsAsArrays.pdf

1 reply

jsvine Mar 11, 2024
Maintainer

Re. the words-as-arrays: This seems to be caused by the PDF inserting a bunch of space (" ") characters between the text:

page = pdf.pages[0]
im = page.to_image()
im.reset().draw_rects([ c for c in page.chars if c["text"] == " " ])

You can fix this by using page.filter:

without_spaces = page.filter(lambda obj: obj.get("text") != " ")
im2 = without_spaces.to_image()
im2.draw_rects(without_spaces.extract_words())

A more general observation: These indices are fairly different from a traditional tables. Some of pdfplumber's lower-level methods (like the .extract_words(...) you note, as well as page.lines/page.rects) might be more useful for writing custom logic to extract the indices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Text returns empty/None for a similar table #1099

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extracting Text returns empty/None for a similar table #1099

lcabrera07 Feb 23, 2024

Replies: 3 comments · 4 replies

jsvine Mar 2, 2024 Maintainer

lcabrera07 Mar 6, 2024 Author

mkl-public Mar 6, 2024

lcabrera07 Mar 7, 2024 Author

mkl-public Mar 7, 2024

lcabrera07 Mar 7, 2024 Author

jsvine Mar 11, 2024 Maintainer

lcabrera07
Feb 23, 2024

Replies: 3 comments 4 replies

jsvine
Mar 2, 2024
Maintainer

lcabrera07
Mar 6, 2024
Author

lcabrera07 Mar 7, 2024
Author

lcabrera07
Mar 7, 2024
Author

jsvine Mar 11, 2024
Maintainer