Difficulty extracting 'cells' from PDF without edges #379
Replies: 4 comments 2 replies
-
Thanks for sharing the PDF @alexreg I don't think there is any straightforward table-settings-only way of extracting the data as tables from the PDF. That is because the data is not necessarily represented as tables. It is actually represented as a 2 column page. I would recommend you try the table settings as {
"vertical_strategy": "explicit",
"horizontal_strategy": "text",
"snap_tolerance": 5,
"explicit_vertical_lines": [Decimal(page.width) * Decimal('0.1'), Decimal(page.width) * Decimal('0.5'), Decimal(page.width) * Decimal('0.9')],
"intersection_x_tolerance": 10,
"keep_blank_chars": True,
} Since there are no lines in separating the columns, I am using the Then after extracting the table, you may apply some postprocessing logic to correctly combine the rows. One other alternative could be to crop the page into 2 halves (left and right) and then extract the text from each half and apply some postprocessing logic so that it is in the way you want it to be. |
Beta Was this translation helpful? Give feedback.
-
Hi @alexreg. I think @samkit-jain's answer is a reasonable one. That said, I'd also like to suggest another approach. As @samkit-jain noted, what you have isn't really a table in a strict sense, and so
|
Beta Was this translation helpful? Give feedback.
-
Thanks very much, @samkit-jain and @jsvine. They both seem like good solutions. My best previous attempt was quite similar to @samkit-jain's, but missed a couple of settings that prevented it from working. What I'll probably do is have a play with both these solutions and see which is more reliable. Cheers! |
Beta Was this translation helpful? Give feedback.
-
Small side question: the accents are getting extracted as separate chars; how can I fix this? |
Beta Was this translation helpful? Give feedback.
-
This is the PDF I'm working with. It's proving rather troublesome to even get any table cells extracted from it. I've tried different values for
horizontal_strategy
andvertical_strategy
to no avail.Beta Was this translation helpful? Give feedback.
All reactions