extract_tables: Text in two vertically aligned cells is combined and separated with newline #1178
dev-guy
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Without access to the PDF itself, it may prove difficult to provide suggestions. But as a first step, could you run |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is a snapshot of a portion of a table that is problematic. It is on a page in landscape format.
Other rows in the table are handled correctly. I am unable to post the PDF for data protection reasons.
extract_tables()
is called with no arguments. It returns a list that contains the following list:['Moisture content\nPolysorbate 80', None, '✓', ], [None, None, 'NT', ]
The two leftmost cells have been combined into one cell with the words in each cell concatenated and separated by a newline.
A similar issue occurs with the following cells but they appear in the middle of the table:
I tried changing
snap_tolerance
andjoin_tolerance
to no avail.I tried changing
text_y_tolerance
. Values >= 17 do help but there are other parts of the table that suffer from the same problem. Increasingtext_y_tolerance
further doesn't fix them. Values much greater than 17 (say 50) cause near-garbage to be returned.Does anyone know why a horizontal line between cells would be ignored?
Default
extract_tables()
options:Beta Was this translation helpful? Give feedback.
All reactions