Extracting Text returns empty/None for a similar table #1099
Replies: 3 comments 4 replies
-
Hi @lcabrera07, and thanks for your kind words about |
Beta Was this translation helpful? Give feedback.
-
I am attaching two PDFs. One where I can extract text from and one where I cannot. I'll also attach the debug images that pdfplumber outputs. One note is that when I mean it cannot, I mean the method extract_tables() return an empty array with four empty string elements. I am using the configuration: {"intersection_tolerance": 7, "text_keep_blank_chars": True } on both pdfs. |
Beta Was this translation helpful? Give feedback.
-
Here is another question... A pdf (attached below) I have is returning back character arrays instead of words (see the image below). I have tried to adjust the config with different values of {x_tolerance=3, y_tolerance=3, x_tolerance_ratio} but nothing changes when I use the extract_tables(). Am I not using the correct configuration values for words? |
Beta Was this translation helpful? Give feedback.
-
Hi, thank you for providing a python pdf extraction library!
I am working on pdfs that have an index I would like to extract. It works on an index with defined lines but it doesn't seem to work on an index that has table lines that don't intercept. I have used some variation of the config to define the (lines_strict) lines but it still returns text only from the one with defined lines. I had to use lines_strict because I think it's mistakenly perceiving the thicker vertical line as a table.
This image below shows two pdfs with the tool's debug image, the left without intersecting lines and the right with intersecting lines.
I got the left by using lines_strict and intersection_tolerance = 10 passed into extract_tables(). The right is what I get without any config passed into extract_tables().
Not sure why the left pdf is ignoring the horizontal lines but I think it can be ignored if it considers the entire table 3 columns, meaning 3 cells. My next guess is that the text may be too far apart to extract. I continue to get [['', '', '']]
The pdf on the right works fine and extracts what I need using extract_tables(). I get the following [['&', '15th 14:3,5,6\n16 116:21\n200:14\n16th 187:15\n17011...
Is this a config issue? I attempted to use horizontal/vertical strategy with text but the text is not aligned correctly so I think it's even harder.
Thanks in advanced.
Beta Was this translation helpful? Give feedback.
All reactions