support for tagged pdfs? <StructTreeNode> #54

mllife · 2024-11-12T05:29:42Z

I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure StructTreeNode and I want to know if you can add the support for it, ie. low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:

Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures?
Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I we can do with pdf parsers.
Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging.
My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements

PeterStaar-IBM · 2024-11-12T06:30:41Z

@mllife Yes, we could add this as extra info. However, the tags get generally identified by docling via visual models.

mllife · 2024-11-12T09:20:27Z

Yes, that ML model works, but sometimes the pdf have in-built tags which are always accurate comes; directly from the vendors/distributors and there is no way to utilise them programmatically (so far that I know of). If you add this feature, this will be the only library to do it. "pdfalyzer" https://github.com/michelcrypt4d4mus/pdfalyzer this is one tool to analyse the , but it does not allow reading from the structure and dump it into a format which can be utilised, like tables to csv, get bboxes from tags itself ? docling-parse is a unique project because you guys are doing it from the scratch, so I have some hope.

PeterStaar-IBM · 2024-11-12T09:25:37Z

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

mllife · 2024-11-12T12:32:29Z

I can defiantly say, I will be the first one to test it and provide you feedback on it.

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

I can defiantly say, I will be the first one to test it and provide you feedback on it. Thanks a ton in advance.

PeterStaar-IBM · 2024-11-12T13:13:27Z

@mllife please, can you provide me then some examples where you know it is there? I would not know where to search for it.

mllife · 2024-11-13T05:11:00Z

Sorry, I can't share any of these files but, there is way to create them using "accessibility features" in Foxit pdf pro (trail is available for free) https://www.foxit.com/pdf-editor/advanced-editing/ (https://www.youtube.com/watch?v=Oub-mmPXASk) Table tagging is automatically done. I think it's not 100% correct always but it works on most of the pdfs (it should be sufficient to create some examples to test), also if you export any stylised word 2013+ file to pdf, it should be tagged automatically. Let me know if this is helpful.

mllife · 2024-11-18T05:34:37Z

hello, @PeterStaar-IBM , any update on this?

PeterStaar-IBM · 2024-11-18T06:55:10Z

Yes, I looked into it, but I have found very very few documents that use this, hence, it is not a prioirity right now.

PeterStaar-IBM added the ice-box label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for tagged pdfs? <StructTreeNode> #54

support for tagged pdfs? <StructTreeNode> #54

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 13, 2024 •

edited

Loading

mllife commented Nov 18, 2024

PeterStaar-IBM commented Nov 18, 2024

support for tagged pdfs? <StructTreeNode> #54

support for tagged pdfs? <StructTreeNode> #54

Comments

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 12, 2024

PeterStaar-IBM commented Nov 12, 2024

mllife commented Nov 13, 2024 • edited Loading

mllife commented Nov 18, 2024

PeterStaar-IBM commented Nov 18, 2024

mllife commented Nov 13, 2024 •

edited

Loading