Keep blank lines at the start of cell #587
Replies: 2 comments 1 reply
-
Hi @jhonatan-lopes Appreciate your interest in the library. It seems you are using the default table extraction strategy. Have you tried using {
"vertical_strategy": "lines",
"horizontal_strategy": "text"
} It would make sure that the "CONCRETE" text and the "compressive" text come in the same line/row. |
Beta Was this translation helpful? Give feedback.
-
Hi @samkit-jain, thank you for taking the time to provide support for this library. I really appreciate it. Yes, I have tried using If there is no way of extracting those characters, I could try parsing it once with default settings to get the proper header and another time with I must highlight that I must repeat this process for several tables on different pages on a PDF and repeat the process for several different PDFs, so I would like to keep the method as general as possible. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am trying to parse the second table (from the top) from page 2 of this PDF.
I can extract the table via the following:
and convert it to a Pandas Dataframe through:
My issue is with the first lines of the content. As it is seen in the screenshot below, the line "Compressive strength of cubes..." should be aligned with "CONCRETE - hardened".
However, the table is parsed without extra characters/newlines to indicate that, which results in the line "Compressive strength of cubes..." being aligned with "PHYSICAL PROPERTIES" in the DataFrame/csv.
Using
im.reset().draw_rects(page.chars)
, some characters appear on the empty spaces, but these are not parsed by the table extraction algorithm:Is there a way of forcing the algorithm to parse those spaces?
As an alternative, do you know a way of parsing the table content separately from the header, i.e. with different
table_settings
? If so, I could usevertical_strategy = text
just for the body of the table, but it messes up the header.Many thanks
Beta Was this translation helpful? Give feedback.
All reactions