We succinctly call the Language of Data the unique grammar of short textual labels that typically appear in structured data. For example:
Name | Full_addr | Type | Notes |
---|---|---|---|
Pizza Rio | Via della Resistenza, 9/A, 38123 Trento, Italy | pizzeria | take-away possible |
La Bigoudène | 18 rue Vauban, 29200 Brest, France | pancake restaurant | Closed Permanently |
Both headers and data values in the dataset above have the following characteristics:
- short labels, consisting of just a few words;
- frequent named entities (names, postal addresses, dates, URLs);
- non-standard orthography (use of "_" for token separation, Inconsistent Use of Capitals, frequent abbreviations, etc.);
- absence or rarity of certain parts of speech (e.g. verbs, pronouns);
- non-standard syntax (omission of verbs, prepositions, inverted word order, etc.: take-away [is] possible, operate [an] uninsured vehicle, death country.
A high-accuracy automated analysis of the textual content of vast datasets is crucial in many applications, such as information retrieval (e.g. for meaning-based indexing of content by search engines), data integration, or AI-based data analytics.
State-of-the-art natural language processing tools are trained on regular text (e.g. Wikipedia) or on social media content (e.g. tweets). They analyse text in context, looking at a window of preceding and following words and phrases. In the Language of Data, context is short or non-existent, and orthography and syntax are used in specific, non-standard ways. Conventional NLP tools vastly underperform on such text (e.g. 10-40% of F-measure for named entity recognition, 70% of accuracy in classifying parts of speech). The specific grammar of the Language of Data needs specifically designed NLP tools.
Our corpora and tools are in their early stages and are under constant development. More resources will follow in the near future. All tools and corpora are licensed under CC BY-NC 4.0, meaning that you are free to share and adapt the material for non-commercial purposes, provided that you give appropriate credit to the authors. Do not hesitate to contact us for individual licensing arrangements.
Name | Version | Task | Language | Accuracy/F1 | Link |
---|---|---|---|---|---|
LoD OpenNLP Tokenizer | 1.0 | tokenization | English | 96.7% | Download |
LoD OpenNLP POS Tagger | 1.0 | POS tagging | English | 85.9% | Download |
LoD OpenNLP Name Finder | 1.0 | NER | English | 50.8% | Download |
LoD BERT-NER | 1.0 | NER | English | 67.4% | coming soon |
Note that sequence labelling classification tasks such as POS or NER tagging are much harder over the Language of Data, which explains the difference w.r.t. state-of-the-art scores over regular text.
Name | Description | Language | Nb. labels | Nb. tokens | Link |
---|---|---|---|---|---|
LoD Headers English | Hand-annotated table head labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags. | English | 8,558 | 31,127 | Download |
LoD Data English | Hand-annotated data value labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags. | English | 8,731 | 39,698 | Download |
LoD Headers Italian | Hand-annotated table head labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags. | Italian | 3,536 | 9,723 | Download |
LoD Data Italian | Hand-annotated data value labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags. | Italian | 6,528 | 39,517 | Download |
The main publication supporting our principal hypotheses, please cite this if you use our resources or tools.
An early publication on using NLP mechanisms tailored to the Language of Data in order to perform multilingual and multi-domain word sense disambiguation:
Research on the Language of Data is being carried out at the Language Diversity Lab of the KnowDive Research Group at the University of Trento, Italy. For any inquiry, do not hesitate to drop us an email.
Contributors:
- Gábor Bella;
- prof. Fausto Giunchiglia;
- Linda Gremes.