Keep blank lines at the start of cell #587

jhonatan-lopes · 2022-01-20T17:27:23Z

jhonatan-lopes
Jan 20, 2022

Hello,

I am trying to parse the second table (from the top) from page 2 of this PDF.

I can extract the table via the following:

import pdfplumber

with pdfplumber.open('0013Testing-Multiple.pdf'):
    page = pdf.pages[1]
    table = page.find_tables()[1].extract()

and convert it to a Pandas Dataframe through:

import pandas as pd

cols = [col.replace('\n', '') for col in table[0]]
df = pd.DataFrame(table[1:], columns=cols)
df.to_csv(f'Table_0013Testing-Multiple.csv')

My issue is with the first lines of the content. As it is seen in the screenshot below, the line "Compressive strength of cubes..." should be aligned with "CONCRETE - hardened".

However, the table is parsed without extra characters/newlines to indicate that, which results in the line "Compressive strength of cubes..." being aligned with "PHYSICAL PROPERTIES" in the DataFrame/csv.

Using im.reset().draw_rects(page.chars), some characters appear on the empty spaces, but these are not parsed by the table extraction algorithm:

Is there a way of forcing the algorithm to parse those spaces?

As an alternative, do you know a way of parsing the table content separately from the header, i.e. with different table_settings? If so, I could use vertical_strategy = text just for the body of the table, but it messes up the header.

Many thanks

samkit-jain · 2022-01-21T19:08:20Z

samkit-jain
Jan 21, 2022
Collaborator

Hi @jhonatan-lopes Appreciate your interest in the library. It seems you are using the default table extraction strategy. Have you tried using

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "text"
}

It would make sure that the "CONCRETE" text and the "compressive" text come in the same line/row.

0 replies

jhonatan-lopes · 2022-01-21T19:35:57Z

jhonatan-lopes
Jan 21, 2022
Author

Hi @samkit-jain, thank you for taking the time to provide support for this library. I really appreciate it.

Yes, I have tried using "horizontal_strategy": "text" for table settings, but I end up with a mess of a header in this case:

If there is no way of extracting those characters, I could try parsing it once with default settings to get the proper header and another time with "horizontal_strategy": "text" to get the core content while discarding the messy header, but this results in twice the work.

I must highlight that I must repeat this process for several tables on different pages on a PDF and repeat the process for several different PDFs, so I would like to keep the method as general as possible.

1 reply

samkit-jain Jan 23, 2022
Collaborator

What if instead of parsing twice, you do some post-processing and merge the first 3-4 rows that constitute the header?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep blank lines at the start of cell #587

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Keep blank lines at the start of cell #587

jhonatan-lopes Jan 20, 2022

Replies: 2 comments · 1 reply

samkit-jain Jan 21, 2022 Collaborator

jhonatan-lopes Jan 21, 2022 Author

samkit-jain Jan 23, 2022 Collaborator

jhonatan-lopes
Jan 20, 2022

Replies: 2 comments 1 reply

samkit-jain
Jan 21, 2022
Collaborator

jhonatan-lopes
Jan 21, 2022
Author

samkit-jain Jan 23, 2022
Collaborator