Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 1.58 KB

README.md

File metadata and controls

48 lines (35 loc) · 1.58 KB

This is a pipeline for language detection on the Globalise HTR output.

Usage

You will need a Linux/BSD/macOS system with make, cargo, rustc and a recent enough python3. On Ubuntu/Debian Linux this is accomplished with apt install make cargo python.

First clone this git repository. Then from this directory, run the following to check and install necessary dependencies.

$ make setup

This will check necessary dependencies, create a Python virtual environment with the globalise-tools, and compile and install lingua-cli and lexmatch which are used for language detection.

Activate the Python virtual environment that was created in the above step:

$ source env/bin/activate

Next, extract data from the globalise page XMLs as follows, pass the path where your globalise page XMLs are (grouped in directories corresponding with inventory numbers). This will create many *-lines.tsv files that will be the input for the actual pipeline:

$ make fromxml PAGEXML_PATH=/path/to/globalise-pagexml

Then run the actual language detection pipeline as follows:

$ make all

Note: Processing may take a long time if done serially, it is therefore strongly recommended to make use of multiple CPU cores by passing -j 20 (example for 20 cores) to make to speed up to process.

The main results will be in pages.lang.tsv, secondary results in nondutch-pages.lang.tsv (everything that includes another language) and unknown-pages.lang.tsv (everything that could not be identified).