Problem with table extraction #589

IrinaOganezova · 2022-01-25T12:38:58Z

IrinaOganezova
Jan 25, 2022

The lines have different length, the extraction of pdf with following parameters

{"vertical_strategy": "lines",
"horizontal_strategy": "text",
"intersection_tolerance": 15,
"snap_tolerance": 3,}

results is

0 None 24
1 None

Please help with solution

IrinaOganezova · 2022-01-25T13:19:54Z

IrinaOganezova
Jan 25, 2022
Author

https://www.dgo.gov.pt/execucaoorcamental/SintesedaExecucaoOrcamentalMensal/2006/Janeiro/0106-bol.pdf

page 23

0 replies

jsvine · 2022-01-27T00:45:55Z

jsvine
Jan 27, 2022
Maintainer

Hi @IrinaOganezova, and thanks for providing the PDF. Parsing this in an automated way might be tricky (though possible, if you programmatically identify the location of the columns and headers), but a more manual approach like this could work:

table_settings = {
    "horizontal_strategy": "text",
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": [ 70, 280, 325, 370, 410, 450, 488 ]
}

That will identify a table that contains your data — but which you might also want to clean up by first using page.crop(...):

0 replies

IrinaOganezova · 2022-01-28T12:10:32Z

IrinaOganezova
Jan 28, 2022
Author

Thank you very much for help !

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with table extraction #589

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Problem with table extraction #589

IrinaOganezova Jan 25, 2022

Replies: 3 comments

IrinaOganezova Jan 25, 2022 Author

jsvine Jan 27, 2022 Maintainer

IrinaOganezova Jan 28, 2022 Author

IrinaOganezova
Jan 25, 2022

IrinaOganezova
Jan 25, 2022
Author

jsvine
Jan 27, 2022
Maintainer

IrinaOganezova
Jan 28, 2022
Author