-
Notifications
You must be signed in to change notification settings - Fork 42
Subject vocabulary formats
A subject vocabulary defines the subjects available for automated indexing or classification. Typically this will be a thesaurus, a classification or a list of subject headings. Annif doesn't care much about the internal structure of a subject vocabulary, it just needs to know the URIs and preferred labels (a.k.a. terms or descriptors) of each subject/class/concept. If the vocabulary includes also notion codes, e.g. as in UDC, also they can be given.
The simple TSV subject vocabulary format only specifies URIs and labels for concepts and only supports one language. The vocabulary file is UTF-8 encoded TSV (tab separated values) file with the file extension .tsv
, where the first column contains a subject URI and the second column its label (and the optional third column the notation code). The format is the same as the extended subject file format for documents, specified below. For example:
<http://example.org/thesaurus/subj1> computer network
<http://example.org/thesaurus/subj2> computer science
<http://example.org/thesaurus/subj3> Internet Protocol 42.42
The CSV subject vocabulary format only specifies URIs and labels for concepts; labels can be given in many languages. The vocabulary file is UTF-8 encoded CSV (comma separated values) file with the file extension .csv
. The first row is a header which defines the meaning of columns in subsequent rows. The header must contain a column called uri
and one or more columns called label_XX
, where XX
is a BCP47 language tag such as en
or fr
. There may also be a notation
column for notations. Example:
uri,label_en,label_fr,notation
http://example.org/thesaurus/subj1,computer network,réseau informatique,
http://example.org/thesaurus/subj2,computer science,informatique,
http://example.org/thesaurus/subj3,Internet Protocol,Internet Protocol,42.42
A subject vocabulary can also be given as a SKOS/RDF file. All common RDF serializations (i.e. those supported by rdflib) are supported, including RDF/XML, Turtle and N-Triples.
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend