Skip to content

Release 0.4

Compare
Choose a tag to compare
@alanakbik alanakbik released this 19 Dec 19:42
· 5682 commits to master since this release
5b72a44

Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.

New Features

Support for new languages

Flair embeddings

We now include new language models for:

In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:

flair_embeddings = FlairEmbeddings('dutch-forward')

Word Embeddings

We now include pre-trained FastText Embeddings for 30 languages: English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.

Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:

# German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')

# German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')

Named Entity Recognition

Thanks to the Flair community, we now include NER models for:

Next to the previous models for English and German.

Part-of-Speech Taggigng

Thanks to the Flair community, we now include PoS models for:

Multilingual models

As a major new feature, we now include models that can tag text in various languages.

12-language Part-of-Speech Tagging

We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).

# load model
tagger = SequenceTagger.load('pos-multi')

# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')

# predict PoS tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

4-language Named Entity Recognition

We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).

# load model
tagger = SequenceTagger.load('ner-multi')

# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

This model also kind of works on other languages, such as French.

Pre-trained classification models (issue 70)

Flair now also includes two pre-trained classification models:

  • de-offensive-lanuage: detecting offensive language in German text (GermEval 2018 Task 1)
  • en-sentiment: detecting postive and negative sentiment in English text (IMDB)

Simply load the TextClassifier using the preferred model, such as

TextClassifier.load('en-sentiment')

BERT and ELMo embeddings

We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.

BERT Embeddings (issue 251)

We added BERT embeddings to Flair. We are using the implementation of huggingface. The embeddings can be used as any other embedding type in Flair:

from flair.embeddings import BertEmbeddings
 # init embedding
embedding = BertEmbeddings()
 # create a sentence
sentence = Sentence('The grass is green .')
 # embed words in sentence
embedding.embed(sentence)

ELMo Embeddings (issue 260)

Flair now also includes ELMo embeddings. We use the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, you need to first install the library via pip install allennlp before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:

from flair.embeddings import ELMoEmbeddings
# init embedding
embedding = ELMoEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)

Multi-Dataset Training (issue 232)

You can now train a model on on multiple datasets with the MultiCorpus object. We use this to train our multilingual models.

Just create multiple corpora and put them into MultiCorpus:

english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)

multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

The multi_corpus can now be used for training, just as any other corpus before. Check the tutorial for more details.

Parameter Selection using Hyperopt (issue 242)

We built a wrapper around hyperopt to allow you to search for the best hyperparameters for your downstream task.

Define your search space and start training using several different parameter settings. The results are written to a specific file called param_selection.txt in the result directory. Check the tutorial for more details.

NLP Dataset Downloader (issue 243)

To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in the tutorial.

Model training features

We added various other features to model training.

Saving training log (issue 212)

The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in training.log.

Resuming training (issue 217)

It is now possible to stop training at any point in time and to resume it later by training with checkpoint set to True. Check the tutorial for more details.

Custom Optimizers (issue 220)

You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.

Learning Rate Finder (issue 228)

A new helper method to assist you in finding a good learning rate for model training.

Breaking Changes

This release introduces breaking changes. The most important are:

Unified Model Trainer (issue 189)

Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely ModelTrainer. This replaces the earlier classes SequenceTaggerTrainer and TextClassifierTrainer.

Downstream task models now implement the new flair.nn.Model interface. So, both the SequenceTagger and TextClassifier now inherit from flair.nn.Model. This allows both models to be trained with the ModelTrainer, like this:

# Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')

# Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')

The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.

Metric class

The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum EvaluationMetric which you can pass to the ModelTrainer to tell it what to use for evaluation.

Updates and Bug Fixes

Torch 1.0 (issue 176)

Flair now bulids on torch 1.0.

Use Pathlib (issue 176)

Flair now uses Path wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.

Bug Fixes

  • Fix: Non-whitespaced tokenized text results into an infinite loop (issue 226)
  • Fix: Getting IndexError: list index out of range error (issue 233)
  • Do not reset cache directory always to None (issue 249)
  • Filter sentences with zero tokens (issue 266)