This package provide set-valued POS-taggers. The code relies on existing probabilistic taggers like CoreNLP and the TreeTagger. Additionally the code also provides two simple taggers. Information about the Baseline can be found in my thesis.
data
contains dataexamples
andscripts
contain usage filessetpos
contains the implementation
Disclaimer: The code probably doesn't run without modifications on Windows. It should work on any standard Linux distribution.
Install Python package:
$ pip install .
Download TreeTagger and place the binaries
tree-tagger
andtrain-tree-tagger
in thesetpos/tagger/treetagger
folder. Make sure the executable flag is set. This code is tested with version 3.2.2.Install
java
version 11 (for CoreNLP)Install
swig-3
(for hyperopt)Install Python package:
$ pip install .[extra]
The CoreNLP tagger is provided as a patched version.
The patch and packed jar is in setpos/tagger/corenlp
, the patch is applied to this version.
- The Patch changes the following:
- CoreNLP will write the posterior probability into debug files (needed for pos tagging)
- Additional command line option for modifying the deterministic tag expansion [thesis, 5.5.3]
Data stems from the Intergramm which in turn includes texts that originally stem from the ReN project and have been adapted to the Intergramm tagging guidelines. The corpus consists of historic Middle Lower German texts. The provided versions here have slight modifications like orthographic unification.
import logging
from sklearn.model_selection import LeaveOneGroupOut
import pandas as pd
from setpos.tagger import MostFrequentTag, CoreNLPTagger, TreeTagger
from setpos.data.split import load
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
toks, tags, groups = load()
train, test = next(LeaveOneGroupOut().split(toks, tags, groups))
clf = TreeTagger()
clf.fit(toks[train], tags[train])
result = pd.DataFrame([toks[test][:20, 1].tolist(), clf.setpredict(toks[test][:20])], index=['token', 'tag']).T
print(result)
token tag 0 stadtrecht {"FM": 1.0} 1 braunschweig {"NE": 0.946357, "NA": 0.025582, "ADJD": 0.011... 2 1227 {"OA": 0.5348, "XY": 0.458823} 3 blankline {"$.": 0.995565} 4 SWelich {"OA": 0.839456, "DIA": 0.087804, "ADJA": 0.03... 5 vo+eghet {"NA": 0.636112, "VVFIN.*": 0.182379, "NE": 0.... 6 enen {"DIART": 0.934728, "CARDA": 0.062113} 7 richte {"NA": 1.0} ...
@article{heid2020reliable, title={Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction}, author={Stefan Heid and Marcel Wever and Eyke Hüllermeier}, year={2020}, eprint={2008.01377}, archivePrefix={arXiv}, primaryClass={cs.CL} }
I want to thank my supervisors and co-authors Marcel Wewer and Prof. Eyke Hüllermeier for the helpful feedback during the thesis