Setvalued Part of Speech Tagging

This package provide set-valued POS-taggers. The code relies on existing probabilistic taggers like CoreNLP and the TreeTagger. Additionally the code also provides two simple taggers. Information about the Baseline can be found in my thesis.

Overview

data contains data
examples and scripts contain usage files
setpos contains the implementation

Installation

Disclaimer: The code probably doesn't run without modifications on Windows. It should work on any standard Linux distribution.

Simple

Install Python package:
```
$ pip install .
```

Complete

Download TreeTagger and place the binaries tree-tagger and train-tree-tagger in the setpos/tagger/treetagger folder. Make sure the executable flag is set. This code is tested with version 3.2.2.
Install java version 11 (for CoreNLP)
Install swig-3 (for hyperopt)
Install Python package:
```
$ pip install .[extra]
```

Corenlp

The CoreNLP tagger is provided as a patched version. The patch and packed jar is in setpos/tagger/corenlp, the patch is applied to this version.

The Patch changes the following:

CoreNLP will write the posterior probability into debug files (needed for pos tagging)
Additional command line option for modifying the deterministic tag expansion [thesis, 5.5.3]

Data

https://img.shields.io/badge/license-CC--BY%204.0-informational

Data stems from the Intergramm which in turn includes texts that originally stem from the ReN project and have been adapted to the Intergramm tagging guidelines. The corpus consists of historic Middle Lower German texts. The provided versions here have slight modifications like orthographic unification.

Usage

import logging

from sklearn.model_selection import LeaveOneGroupOut
import pandas as pd

from setpos.tagger import MostFrequentTag, CoreNLPTagger, TreeTagger
from setpos.data.split import load

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    toks, tags, groups = load()
    train, test = next(LeaveOneGroupOut().split(toks, tags, groups))

    clf = TreeTagger()
    clf.fit(toks[train], tags[train])
    result = pd.DataFrame([toks[test][:20, 1].tolist(), clf.setpredict(toks[test][:20])], index=['token', 'tag']).T

    print(result)

           token                                                tag
0     stadtrecht                                        {"FM": 1.0}
1   braunschweig  {"NE": 0.946357, "NA": 0.025582, "ADJD": 0.011...
2           1227                     {"OA": 0.5348, "XY": 0.458823}
3      blankline                                   {"$.": 0.995565}
4        SWelich  {"OA": 0.839456, "DIA": 0.087804, "ADJA": 0.03...
5       vo+eghet  {"NA": 0.636112, "VVFIN.*": 0.182379, "NE": 0....
6           enen             {"DIART": 0.934728, "CARDA": 0.062113}
7         richte                                        {"NA": 1.0}
...

Citation

@article{heid2020reliable,
    title={Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction},
    author={Stefan Heid and Marcel Wever and Eyke Hüllermeier},
    year={2020},
    eprint={2008.01377},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgement

I want to thank my supervisors and co-authors Marcel Wewer and Prof. Eyke Hüllermeier for the helpful feedback during the thesis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Setvalued Part of Speech Tagging

Overview

Installation

Simple

Complete

Corenlp

Data

Usage

Citation

Acknowledgement

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Setvalued Part of Speech Tagging

Overview

Installation

Simple

Complete

Corenlp

Data

Usage

Citation

Acknowledgement