Skip to content

Commit

Permalink
Merge pull request #3137 from flairNLP/bioner-tutorial
Browse files Browse the repository at this point in the history
Update HunFlair tutorial to Flair 0.12
  • Loading branch information
alanakbik authored Mar 10, 2023
2 parents 0b94c7f + 49b6488 commit 8b3568e
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 26 deletions.
55 changes: 37 additions & 18 deletions resources/docs/HUNFLAIR.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,44 +23,63 @@ Then, in your favorite virtual environment, simply do:
```
pip install flair
```
Furthermore, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing
and tokenization of scientific / biomedical texts:
```
pip install scispacy==0.2.5
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```

#### Example Usage
#### Example 1: Biomedical NER
Let's run named entity recognition (NER) over an example sentence. All you need to do is
make a Sentence, load a pre-trained model and use it to predict tags for the sentence:
```python
from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer
from flair.nn import Classifier

# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer())
# make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# load biomedical tagger
tagger = MultiTagger.load("hunflair")
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)
```
Done! The Sentence now has entity annotations. Let's print the entities found by the tagger:
```python
for annotation_layer in sentence.annotation_layers.keys():
for entity in sentence.get_spans(annotation_layer):
print(entity)
for entity in sentence.get_labels():
print(entity)
```
This should print:
~~~
```console
Span[0:2]: "Behavioral abnormalities" → Disease (0.6736)
Span[9:12]: "Fragile X Syndrome" → Disease (0.99)
Span[4:5]: "Fmr1" → Gene (0.838)
Span[6:7]: "Mouse" → Species (0.9979)
~~~
```


#### Example 2: Biomedical NER with Better Tokenization

Scientific texts are difficult to tokenize. For this reason, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing and tokenization of scientific / biomedical texts:
```
pip install scispacy==0.2.5
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```

Use this code to apply scientific tokenization:

```python
from flair.data import Sentence
from flair.nn import Classifier
from flair.tokenization import SciSpacyTokenizer

# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer())

# load biomedical tagger
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)
```


## Comparison to other biomedical NER tools
Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets.
Expand Down
15 changes: 7 additions & 8 deletions resources/docs/HUNFLAIR_TUTORIAL_1_TAGGING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Let's use the pre-trained *HunFlair* model for biomedical named entity recogniti
This model was trained over 24 biomedical NER data sets and can recognize 5 different entity types,
i.e. cell lines, chemicals, disease, gene / proteins and species.
```python
from flair.models import MultiTagger
from flair.nn import Classifier

tagger = MultiTagger.load("hunflair")
tagger = Classifier.load("hunflair")
```
All you need to do is use the predict() method of the tagger on a sentence.
This will add predicted tags to the tokens in the sentence.
Expand All @@ -23,7 +23,7 @@ sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fra
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())
print(sentence)
```
This should print:
~~~
Expand All @@ -40,7 +40,7 @@ Often named entities consist of multiple words spanning a certain text span in t
"_Behavioral Abnormalities_" or "_Fragile X Syndrome_" in our example sentence.
You can directly get such spans in a tagged sentence like this:
```python
for disease in sentence.get_spans("hunflair-disease"):
for disease in sentence.get_labels("hunflair-disease"):
print(disease)
```
This should print:
Expand Down Expand Up @@ -71,9 +71,8 @@ You can retrieve all annotated entities of the other entity types in analogous w
for cell lines, `hunflair-chemical` for chemicals, `hunflair-gene` for genes and proteins, and `hunflair-species`
for species. To get all entities in one you can run:
```python
for annotation_layer in sentence.annotation_layers.keys():
for entity in sentence.get_spans(annotation_layer):
print(entity)
for entity in sentence.get_labels():
print(entity)
```
This should print:
~~~
Expand Down Expand Up @@ -117,7 +116,7 @@ abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mut
To work with complete abstracts or full-text, we first have to split them into separate sentences.
Again we can apply the integration of the [SciSpaCy](https://allenai.github.io/scispacy/) library:
```python
from flair.tokenization import SciSpacySentenceSplitter
from flair.splitter import SciSpacySentenceSplitter

# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()
Expand Down

0 comments on commit 8b3568e

Please sign in to comment.