Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for tagged pdfs? <StructTreeNode> #54

Open
mllife opened this issue Nov 12, 2024 · 8 comments
Open

support for tagged pdfs? <StructTreeNode> #54

mllife opened this issue Nov 12, 2024 · 8 comments
Labels

Comments

@mllife
Copy link

mllife commented Nov 12, 2024

I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure StructTreeNode and I want to know if you can add the support for it, ie. low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:

Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures?
Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I we can do with pdf parsers.
Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging.
My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements

@PeterStaar-IBM
Copy link
Contributor

@mllife Yes, we could add this as extra info. However, the tags get generally identified by docling via visual models.

@mllife
Copy link
Author

mllife commented Nov 12, 2024

Yes, that ML model works, but sometimes the pdf have in-built tags which are always accurate comes; directly from the vendors/distributors and there is no way to utilise them programmatically (so far that I know of). If you add this feature, this will be the only library to do it. "pdfalyzer" https://github.com/michelcrypt4d4mus/pdfalyzer this is one tool to analyse the , but it does not allow reading from the structure and dump it into a format which can be utilised, like tables to csv, get bboxes from tags itself ? docling-parse is a unique project because you guys are doing it from the scratch, so I have some hope.

@PeterStaar-IBM
Copy link
Contributor

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

@mllife
Copy link
Author

mllife commented Nov 12, 2024

I can defiantly say, I will be the first one to test it and provide you feedback on it.

I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year).

I can defiantly say, I will be the first one to test it and provide you feedback on it. Thanks a ton in advance.

@PeterStaar-IBM
Copy link
Contributor

@mllife please, can you provide me then some examples where you know it is there? I would not know where to search for it.

@mllife
Copy link
Author

mllife commented Nov 13, 2024

Sorry, I can't share any of these files but, there is way to create them using "accessibility features" in Foxit pdf pro (trail is available for free) https://www.foxit.com/pdf-editor/advanced-editing/ (https://www.youtube.com/watch?v=Oub-mmPXASk) Table tagging is automatically done. I think it's not 100% correct always but it works on most of the pdfs (it should be sufficient to create some examples to test), also if you export any stylised word 2013+ file to pdf, it should be tagged automatically. Let me know if this is helpful.

@mllife
Copy link
Author

mllife commented Nov 18, 2024

hello, @PeterStaar-IBM , any update on this?

@PeterStaar-IBM
Copy link
Contributor

Yes, I looked into it, but I have found very very few documents that use this, hence, it is not a prioirity right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants