Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update example notebooks #267

Merged
merged 1 commit into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/load_use_ner_model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"# Loading and using a NER model\n",
"\n",
"This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub.\n",
"This notebook shows how to load an existing named entity recognition (NER) model from the HuggingFace hub, using T-Res.\n",
"\n",
"We start by importing some libraries, and the `recogniser` script from the `geoparser` folder:"
]
Expand Down
78 changes: 71 additions & 7 deletions examples/run_pipeline_basic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using the fine-grained tagset, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://github.com/Living-with-machines/toponym-resolution/blob/main/geoparser/pipeline.py)."
"Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using `Livingwithmachines/toponym-19thC-en` NER model, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://living-with-machines.github.io/T-Res/reference/geoparser/pipeline.html)."
]
},
{
Expand All @@ -40,6 +40,13 @@
"geoparser = pipeline.Pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the pipeline: end-to-end"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -54,10 +61,8 @@
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"print(resolved)"
]
},
{
Expand All @@ -67,8 +72,67 @@
"outputs": [],
"source": [
"resolved = geoparser.run_sentence(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"for r in resolved:\n",
" print(r)"
"print(resolved)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the pipeline: step-wise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of using the end-to-end pipeline, the pipeline can be used step-wise.\n",
"\n",
"Therefore, it can be used to just perform toponym recognition (i.e. NER):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mentions = geoparser.run_text_recognition(\"A remarkable case of rattening has just occurred in the building trade at Sheffield.\")\n",
"print(mentions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pipeline can then be used to just perform candidate selection given the output of NER:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"candidates = geoparser.run_candidate_selection(mentions)\n",
"print(candidates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, the pipeline can be used to perform entity disambiguation, given the output from the previous two steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"disamb_output = geoparser.run_disambiguation(mentions, candidates)\n",
"print(disamb_output)"
]
}
],
Expand Down
22 changes: 2 additions & 20 deletions examples/run_pipeline_deezy_mostpopular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
Expand All @@ -52,9 +50,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -72,9 +69,6 @@
"mylinker = linking.Linker(\n",
" method=\"mostpopular\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params=dict(),\n",
" overwrite_training=False,\n",
")"
]
},
Expand All @@ -87,18 +81,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
20 changes: 3 additions & 17 deletions examples/run_pipeline_deezy_reldisamb+wmtops.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,7 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
" \"top_threshold\": 85,\n",
" \"min_len\": 5,\n",
" \"max_len\": 15,\n",
" \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
" \"w2v_ocr_model\": \"w2v_*_news\",\n",
" \"overwrite_dataset\": False,\n",
" },\n",
" strvar_parameters=dict(),\n",
" deezy_parameters={\n",
" # Paths and filenames of DeezyMatch models and data:\n",
" \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
Expand All @@ -55,9 +44,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +65,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": False,\n",
" \"without_microtoponyms\": True,\n",
Expand Down
25 changes: 2 additions & 23 deletions examples/run_pipeline_deezy_reldisamb+wpubl+wmtops.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,6 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
Expand All @@ -55,9 +53,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +74,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": True,\n",
" \"without_microtoponyms\": True,\n",
Expand All @@ -103,22 +98,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\n",
" \"A remarkable case of rattening has just occurred in the building trade next to the Market-street of Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\", \n",
" place=\"Manchester\", \n",
" place_wqid=\"Q18125\"\n",
")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
27 changes: 3 additions & 24 deletions examples/run_pipeline_deezy_reldisamb+wpubl.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,7 @@
"myranker = ranking.Ranker(\n",
" method=\"deezymatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
" strvar_parameters={\n",
" # Parameters to create the string pair dataset:\n",
" \"ocr_threshold\": 60,\n",
" \"top_threshold\": 85,\n",
" \"min_len\": 5,\n",
" \"max_len\": 15,\n",
" \"w2v_ocr_path\": str(Path(\"../resources/models/w2v/\").resolve()),\n",
" \"w2v_ocr_model\": \"w2v_*_news\",\n",
" \"overwrite_dataset\": False,\n",
" },\n",
" strvar_parameters=dict(),\n",
" deezy_parameters={\n",
" # Paths and filenames of DeezyMatch models and data:\n",
" \"dm_path\": str(Path(\"../resources/deezymatch/\").resolve()),\n",
Expand All @@ -55,9 +44,8 @@
" \"dm_output\": \"deezymatch_on_the_fly\",\n",
" # Ranking measures:\n",
" \"ranking_metric\": \"faiss\",\n",
" \"selection_threshold\": 25,\n",
" \"num_candidates\": 3,\n",
" \"search_size\": 3,\n",
" \"selection_threshold\": 50,\n",
" \"num_candidates\": 1,\n",
" \"verbose\": False,\n",
" # DeezyMatch training:\n",
" \"overwrite_training\": False,\n",
Expand All @@ -77,12 +65,10 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
" \"training_split\": \"originalsplit\",\n",
" \"context_length\": 100,\n",
" \"db_embeddings\": cursor,\n",
" \"with_publication\": True,\n",
" \"without_microtoponyms\": False,\n",
Expand Down Expand Up @@ -133,13 +119,6 @@
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
10 changes: 9 additions & 1 deletion examples/run_pipeline_modular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@
" mylinker = linking.Linker(\n",
" method=\"reldisamb\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" rel_params={\n",
" \"model_path\": \"../resources/models/disambiguation/\",\n",
" \"data_path\": \"../experiments/outputs/data/lwm/\",\n",
Expand Down Expand Up @@ -127,6 +126,15 @@
"source": [
"output_disamb = geoparser.run_disambiguation(output, cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output_disamb"
]
}
],
"metadata": {
Expand Down
16 changes: 0 additions & 16 deletions examples/run_pipeline_perfect_mostpopular.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@
"myranker = ranking.Ranker(\n",
" method=\"perfectmatch\",\n",
" resources_path=\"../resources/wikidata/\",\n",
" mentions_to_wikidata=dict(),\n",
" wikidata_to_mentions=dict(),\n",
")\n"
]
},
Expand All @@ -44,8 +42,6 @@
"mylinker = linking.Linker(\n",
" method=\"mostpopular\",\n",
" resources_path=\"../resources/\",\n",
" linking_resources=dict(),\n",
" overwrite_training=False,\n",
")"
]
},
Expand All @@ -58,18 +54,6 @@
"geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"resolved = geoparser.run_text(\"A remarkable case of rattening has just occurred in the building trade at Shefrield, but also in Lancaster. Not in Nottingham though. Not in Ashton either, nor in Salop!\")\n",
" \n",
"for r in resolved:\n",
" print(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
Loading