We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Given the following code snippet:
import json from trankit import Pipeline p = Pipeline('auto', embedding='xlm-roberta-large') doc = '''Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.''' tokens = p(doc, is_sent=True) print(json.dumps(tokens, indent=2, ensure_ascii=False))
For some reason, I get # in my lemma as seen in this sample doc:
#
lemma
doc
{ "text": "Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.", "tokens": [ { "id": 1, "text": "Naton", "upos": "PROPN", "xpos": "N", "feats": "Case=Gen|Number=Sing", "head": 2, "deprel": "nmod:poss", "span": [ 0, 5 ], "lemma": "Nato" }, { "id": 2, "text": "päämajassa", "upos": "NOUN", "xpos": "N", "feats": "Case=Ine|Number=Sing", "head": 4, "deprel": "obl", "span": [ 6, 16 ], "lemma": "pää#maja" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 3, "text": "Brysselissä", "upos": "PROPN", "xpos": "N", "feats": "Case=Ine|Number=Sing", "head": 2, "deprel": "appos", "span": [ 17, 28 ], "lemma": "Bryssel" }, { "id": 4, "text": "järjestettiin", "upos": "VERB", "xpos": "V", "feats": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass", "head": 0, "deprel": "root", "span": [ 29, 42 ], "lemma": "järjestää" }, { "id": 5, "text": "iltapäivällä", "upos": "NOUN", "xpos": "N", "feats": "Case=Ade|Number=Sing", "head": 4, "deprel": "obl", "span": [ 43, 55 ], "lemma": "ilta#päivä" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 6, "text": "Suomen", "upos": "PROPN", "xpos": "N", "feats": "Case=Gen|Number=Sing", "head": 8, "deprel": "nmod:poss", "span": [ 56, 62 ], "lemma": "Suomi" }, { "id": 7, "text": "virallinen", "upos": "ADJ", "xpos": "A", "feats": "Case=Nom|Degree=Pos|Derivation=Llinen|Number=Sing", "head": 8, "deprel": "amod", "span": [ 63, 73 ], "lemma": "virallinen" }, { "id": 8, "text": "liittymisseremonia", "upos": "NOUN", "xpos": "N", "feats": "Case=Nom|Number=Sing", "head": 4, "deprel": "obj", "span": [ 74, 92 ], "lemma": "liittyä#seremoni" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 9, "text": ".", "upos": "PUNCT", "xpos": "Punct", "head": 4, "deprel": "punct", "span": [ 92, 93 ], "lemma": "." } ], "lang": "finnish" }
I tired it both in Colab and terminal, but same results!
What am I doing wrong?
PS, I do not get the same error in demo website:
Cheers,
The text was updated successfully, but these errors were encountered:
Not an error, the component words of compound words (Finnish: yhdyssana) are separated by the '#' sign by design.
Sorry, something went wrong.
but this only occurs when Standard package TDT is used, FTB would not lead into the same issue.
No branches or pull requests
Given the following code snippet:
For some reason, I get
#
in mylemma
as seen in this sampledoc
:I tired it both in Colab and terminal, but same results!
What am I doing wrong?
PS, I do not get the same error in demo website:
Cheers,
The text was updated successfully, but these errors were encountered: