This repository containes code for solving NER with self-supervised learning (SSL) alone avoiding supervised learning.
Post describing the second iteration of this method
- Test repository link used to test this approach
- Medium post describing the previous iteration of this method
- To identify noun phrase spans, Dat Quoc Nguyen's POS tagger/Dependency parser is used.
If the use case is to automatically detect all noun phrase spans in a sentence, then POS tagger needs to be installed. If we only require specific phrases of interest to us in a sentence to be tagged (e.g. colorectal cancer above), then POS tagger install is not required. In the first use case, 7 microservices (POS tagger is made up of two microservices) are started. In the second use, case 5 microservices are started.
Run ./setup.sh
this will install and load all 5 microservices. When done (assuming all goes well) it should display the output of a test query
(this can be skipped if we only require specific phrases to be tagged)
Install POS service using this link
Make sure to run both services in the install instructions
Note POS service requires python 2.7 environment
July 2022
- Added the generation of bootstrap file. These component files can be edited to improve the bootstrap list. Every time the bootstrap list is updated, we need to run the clustering run.sh (and choose option 6) in bert_vector_clustering to both magnify this list as well as generate entity signatures for each vocabulary term for use in NER. A labeled set of entity files with instructions is present here
17 Jan 2022
- Ensemble service of NER with two models tested on 11 NER benchmarks as described in this post.
17 Sept 2021
- This can now be run as a service. run_servers.sh
- Simple Ensembling service added for combining results of multiple NER servers
- If the install runs into issess, we could start the services independantly to isolate problem.
- First install descriptors service. Confirm it works. Then install NER service. Do this for both models (bio and phi). Then test ensemble service. Ensemble is in the subdirectory ensemble in the NER service.
- Test sets to test the output of NER against 11 benchmarks are in this repository.
- This repository can be used as a metric to test a pretrained model trained from scratch. We can give the model an F1-score just like we do fine tuned model. To do this, we need to convert human labels file (e.g. bootstrap_entities.txt) into magnified entity vectors using this repository. Just invoke run.sh and use the subword neighbor clustering option . If we want to pick the initial terms to label - the creation of bootstrap_entities.txt itself, run the same tool, but just choose the generate cluster option and adaptive clustering. This will yield about 4k cluster pivots. We can start labeling them and then create entity vectors. The entity vectors (e.g. labels.txt) can then be used with descriptor service to test model. If we are creating new entity types, then the entity map file needs to be updated accordingly to map subtypes to types, or just add new types.
The unsupervised NER tool can be used in three ways.
- to tag canned sentences (option 1)
- $ python3 main_ner.py 1
- To tag custom sentences present in a file (option 2)
- $ python3 main_ner.py 2 sample_test.txt
- To tag single entities in custom sentences present in a file (option 3) where the single entity is specified in a sentence in the format name:__ entity __ . Concrete example: Cats and Dogs:__ entity __ are pets where Dogs is the term to be tagged. Single or multiple words/phrases within a sentence can also be tagged. Example: Her hypophysitis:__ entity __ secondary to ipilimumab:__ entity __ was well managed with supplemental:__ entity__ hormones:__ entity __
- $ python main_NER.py 3 single_entity_test.txt
This repository is covered by MIT license.
The POS tagger/Dep parser that this service depends on is covered by a GPL license.