This project develops, trains, and evaluates a series of ML models to predict the Character Error Rate (CER) of OCR-ed text documents. The final best model is a Support Vector Regression model that uses various lexical features to make estimations.
- Feature Extraction: The program extracts various features from text files, such as readability scores, lexical diversity, and character frequency deviations.
- Model Prediction: Using a pre-trained model, the program predicts the CER based on the extracted features.
- Output: The results are saved to a CSV file or displayed in the terminal.
- Readability Scores: Flesch Reading Ease and Flesch-Kincaid Grade.
- Lexical Diversity: Using the LexicalRichness package.
- Character Frequencies: Squared and absolute deviations from expected frequencies.
- Paired Letter Analysis: Emphasizes common OCR misrecognitions.
- Percentages: Alphabetic, numeric, and punctuation character percentages.
- Misspelled Words: Percentage of misspelled words and interaction with lexical diversity.
fk_ocr
: Flesch-Kincaid scoresubstitution_hhi
: Substitution Herfindahl-Hirschman Index (paired letter analysis)percent_numeric
: Percentage of numeric characterssquared_letter_devs
: Squared deviations from expected letter frequenciesmisspelled_interaction
: Interaction between lexical complexity and percentage of misspelled wordsflesch_ocr
: Flesch Reading Ease scoreabsolute_letter_devs
: Absolute deviations from expected letter frequenciespercent_punctuation
: Percentage of punctuation characterspercent_alphabetic
: Percentage of alphabetic characterspercent_misspelled
: Percentage of misspelled words
To set up the environment:
- Download the
predict_cer
folder from DropBox. - Navigate to that folder locally.
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
Install all dependencies:
pip install -r requirements.txt
You can run the program in two ways:
To print out the prediction for a single file in the terminal:
python predict_cer.py <absolute path to OCR text file>
The program will print the predicted CER of the file to your terminal.
To generate a CSV of prediction data for a directory of text files:
python predict_cer.py <absolute path to the directory containing text files>
The program will output a CSV named predicted_cer_results.csv in your working directory.