Utility functions for reading PageXML files
poetry add pagexml-tools
pip install pagexml-tools
PageXML-tools contains functions for parsing and for a range of analysis tasks.
There is a tutorial that demonstrates the physical document model API
PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file (tutorial),
- searching in text (keyword in context, keywords or fuzzy search)
- reading and working with tables (table processing)
- classifying physical document types in a large set of PageXML documents (tutorial),
- checking the quality of the HTR/OCR process (tutorial),
- comparing subsets (tutorial),
- identifying document sections in sequences of PageXML documents (tutorial),
- turning text lines into running text (tutorial),
- supporting different reading orders (tutorial),
- reinterpreting and restructuring text regions and lines (tutorial),
- turning physical structure into logical structure,
USAGE | CONTRIBUTING | LICENSE