extract_tables: Text in two vertically aligned cells is combined and separated with newline #1178

dev-guy · 2024-07-27T16:14:51Z

dev-guy
Jul 27, 2024

This is a snapshot of a portion of a table that is problematic. It is on a page in landscape format.

Other rows in the table are handled correctly. I am unable to post the PDF for data protection reasons.

extract_tables() is called with no arguments. It returns a list that contains the following list:

['Moisture content\nPolysorbate 80', None, '✓', ], [None, None, 'NT', ]

The two leftmost cells have been combined into one cell with the words in each cell concatenated and separated by a newline.

A similar issue occurs with the following cells but they appear in the middle of the table:

I tried changing snap_tolerance and join_tolerance to no avail.

I tried changing text_y_tolerance. Values >= 17 do help but there are other parts of the table that suffer from the same problem. Increasing text_y_tolerance further doesn't fix them. Values much greater than 17 (say 50) cause near-garbage to be returned.

Does anyone know why a horizontal line between cells would be ignored?

Default extract_tables() options:

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

jsvine · 2024-08-02T20:58:09Z

jsvine
Aug 2, 2024
Maintainer

Without access to the PDF itself, it may prove difficult to provide suggestions. But as a first step, could you run page.to_image().debug_tablefinder() (i.e., with the default settings) and share as much as possible of the resulting iamge?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_tables: Text in two vertically aligned cells is combined and separated with newline #1178

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

extract_tables: Text in two vertically aligned cells is combined and separated with newline #1178

dev-guy Jul 27, 2024

Replies: 1 comment

jsvine Aug 2, 2024 Maintainer

dev-guy
Jul 27, 2024

jsvine
Aug 2, 2024
Maintainer