-
Notifications
You must be signed in to change notification settings - Fork 42
Backend: fastText
The fasttext
backend implements a text classification algorithm based on word embeddings and machine learning. It is a wrapper around the fastText library created by Facebook Research. The model resembles a feed-forward neural network with one hidden layer, though some shortcuts are used for computational efficiency.
The quality of results can be very good, but many parameters have to be selected to get optimal results. Training can be computationally intensive; by default it can train using all cores in parallel. If you have a machine with a huge number of CPU cores (more than 8), it is probably wise to limit the number of cores used for training; a good starting point is thread=12
beyond which additional CPU cores do not significantly speed up the training, so you will just end up wasting CPU resources.
See Optional features and dependencies
This is a simple configuration that creates a relatively small model (1.3GB when trained on the yso-finna-fi
dataset):
[fasttext-en]
name=fastText English
language=en
backend=fasttext
analyzer=snowball(english)
dim=100
lr=0.25
epoch=5
loss=hs
limit=100
chunksize=24
vocab=yso
This is a more advanced configuration that creates a larger (3.6GB when trained on the yso-finna-fi
dataset), but more accurate model which also takes much longer to train:
[yso-fasttext-fi]
name=YSO fastText Finnish
language=fi
backend=fasttext
analyzer=snowball(finnish)
dim=430
lr=0.74
epoch=75
minn=4
maxn=7
minCount=3
loss=hs
limit=1000
chunksize=24
vocab=yso
With the exception of chunksize
and limit
, all the parameters are passed directly to the fastText algorithm. If you omit a parameter, a default value is used. You can check out the fastText documentation about options for more details about the parameters.
The backend processes longer documents in chunks: the document is represented as a list of sentences, and that list is turned into chunks. With a chunksize
of 24 (as above), each chunk is made of 24 sentences, except for the last chunk which may be shorter. Each chunk is analyzed separately and the results are averaged. Setting chunksize
to a high value such as 10000 will in practice disable chunking.
The most important parameters are:
Parameter | Description |
---|---|
chunksize | How many sentences per chunk |
limit | Maximum number of results to return |
dim | Dimensionality of word vectors (i.e. hidden layer size) |
lr | Learning rate |
epoch | How many passes over training data to perform |
loss | Loss function: ns , hs or softmax . hs is much faster than the others. |
minn | Lower limit of character n-gram length |
maxn | Upper limit of character n-gram length |
minCount | Minimum word (or n-gram) frequency to include it in the model |
wordNgrams | maximum length of word n-grams (default: 1) |
thread | number of CPU cores to use for training (default: all of them) |
Preprocessing the training data can take a significant portion of the training time. If you want to experiment with different parameter settings, you can reuse the preprocessed training data by using the --cached
option - see Reusing preprocessed training data. Only the analyzer
and vocab
settings affect the preprocessing; you can use the --cached
option as long as you haven't changed these parameters.
Load a vocabulary:
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
Train the model:
annif train fasttext-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz
Retrain the model, reusing already preprocessed training data:
annif train fasttext-en --cached
Test the model with a single document:
cat document.txt | annif suggest fasttext-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval fasttext-en /path/to/documents/
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend