-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for tagged pdfs? <StructTreeNode> #54
Comments
@mllife Yes, we could add this as extra info. However, the tags get generally identified by docling via visual models. |
Yes, that ML model works, but sometimes the pdf have in-built tags which are always accurate comes; directly from the vendors/distributors and there is no way to utilise them programmatically (so far that I know of). If you add this feature, this will be the only library to do it. "pdfalyzer" https://github.com/michelcrypt4d4mus/pdfalyzer this is one tool to analyse the , but it does not allow reading from the structure and dump it into a format which can be utilised, like tables to csv, get bboxes from tags itself ? docling-parse is a unique project because you guys are doing it from the scratch, so I have some hope. |
I agree, it is a good idea to add. I know there are also "annotations" and meta-data we could use in priciple. I dont consider it the highest priority, but it definitely would be nice to have in the medium term (by end of year). |
I can defiantly say, I will be the first one to test it and provide you feedback on it.
I can defiantly say, I will be the first one to test it and provide you feedback on it. Thanks a ton in advance. |
@mllife please, can you provide me then some examples where you know it is there? I would not know where to search for it. |
Sorry, I can't share any of these files but, there is way to create them using "accessibility features" in Foxit pdf pro (trail is available for free) https://www.foxit.com/pdf-editor/advanced-editing/ (https://www.youtube.com/watch?v=Oub-mmPXASk) Table tagging is automatically done. I think it's not 100% correct always but it works on most of the pdfs (it should be sufficient to create some examples to test), also if you export any stylised word 2013+ file to pdf, it should be tagged automatically. Let me know if this is helpful. |
hello, @PeterStaar-IBM , any update on this? |
Yes, I looked into it, but I have found very very few documents that use this, hence, it is not a prioirity right now. |
I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure StructTreeNode and I want to know if you can add the support for it, ie. low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:
Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures?
Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I we can do with pdf parsers.
Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging.
My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements
The text was updated successfully, but these errors were encountered: