This repository contains code and some accompanying documentation to support analysis of languages used in manuscripts. The scripts also allow you to create a simple website using GitHub Pages.
The code reads CSV files that contain information on texts in a manuscript, with for each text a title (optional), language and start and end indication. It produces a new CSV file that contains the number of folio sides for each text, correcting for sides that contain multiple texts.
It depends on pandas for working with CSV files and doing the analyses. The included Jupyter notebook is the first proof of concept.
This programme tries to auto-detect the input character encoding using the chardet
library and
writes output using the UTF-8 encoding.
The code doesn't need to be installed, other than downloaded to a suitable location. You may need to install dependencies.
To prevent interference with existing Python 3 installations, you should create a virtualenv
for this project. This
example creates a virtualenv in ~/virtualenvs/
called manuscript-stats
and activates it:
virtualenv ~/virtualenvs/manuscript-stats
source ~/virtualenvs/manuscript-stats/bin/activate
In your git tree, clone the repository and install the required dependencies:
git clone https://github.com/LeidenUniversityLibrary/manuscript-stats.git
cd manuscript-stats/
pip install -r requirements.txt
The main script is LanguageAnalysis.py
. It reads the main overview of manuscripts and their metadata from manuscripts.csv
and the lists of contents for each manuscript from individual files named contents_XXX.csv
, where XXX is the manuscript identifier used in the overview.
This script expects all input files to be in data/input/
.
python LanguageAnalysis.py
When run, this command prints the name of the file it is working on and creates a normalised version in data/output/
, if
the operations succeed. If something went wrong, the error is printed.
It also creates (in data/output/
):
all_manuscripts.csv
: the main output that is a combination of the inputmanuscripts.csv
and the calculated language coverage for each manuscriptall_contents.csv
: a concatenation of all normalisedcontents_XXX.csv
files with page ranges converted to ordinal numbers and the total number of pages for each text calculatedall_languages.csv
: for each manuscript, this contains one row per language and the number of pages for that languageall_langs_pivot.csv
: a pivoted version of theall_languages.csv
table with columns for absolute and relative numbers of pages in French, English, Latin and other languages
The normalise_owners.py
script reads files named owners_XXX.csv
, where again XXX is the manuscript identifier, and
creates both normalised copies and a concatenation of all normalised files named all_owners.csv
. The information about owners
is not currently used in analyses, but it is presented in the web version of the results.
This script reads input files from data/input/
and writes to data/output/
.
The convert_for_web.py
script reads the all_manuscripts.csv
, all_contents.csv
and all_owners.csv
files and generates
a Markdown file for each manuscript with all structured data included in the YAML metadata. Using Jekyll (for example, as
provided by GitHub pages) you can create a web presentation of the manuscripts and the analysis results.
The script reads the files from data/output/
and creates the Markdown files in docs/_details/
.
In the development of these scripts, we encountered a few things about the input files that may require extra attention.
- When you use Microsoft Excel to prepare the CSV files, make sure the files don't have empty lines.
- Ranges of Roman numerals are handled by converting to their Arabic equivalents and adding 100000, so that the numbers don't clash with the folia numbered in Arabic ranges. This doesn't work when there are multiple ranges of Roman numerals, but in this dataset we are not aware of any.
- Do not mix folio numbering (
1r
,3v
) with pagination (1
,6
) in one file.
Created by Ben Companjen at the Centre for Digital Scholarship, Leiden University Libraries. Copyright 2018, Leiden University Libraries.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.