-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POS tagging pipeline throws "RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0" #1401
Comments
Something very strange about this is that it wants to split
->
but that shouldn't be a crashing problem anyway, since it should be happy to tag a single word. So this is strange. I'll dig into it |
Okay, two problems here, both of which can & should be fixed. One is that the tokenizer doesn't know about the word The second is that the MWT expander is then processing the second piece, Both are easy to fix, actually. If you come across other words like |
Well, it's not crashing now in the As you find more examples of bad tokenization, please let us know, and hopefully over time we can make some progress cleaning those up. |
... point being that now it seems to think |
Thanks you for the quick fix! What I found out in the meantime, is that when I split the string (To be honest, I have a very dim idea of the inner workings of Stanza.)
Since my corpus consists of subtitles, sentences often miss final punctuation:-(
I did a quick regex search in frequency list from my corpus (after tokenization) and found that the following strings are split in the same way ('XXXlo' + 's') if I substitute 'invadirlos' from in
There's more potential culprits I have, but I haven't checked if they are results of bad tokenization of a verb + 'los'. Here's the whole list:
|
…okenizer. Addresses an issue where some words were being chopped up in Spanish because the tokenizer had never seen them and interpreted the last character as the end of the document #1401
…ve tokenization errors stanfordnlp/stanza#1401
I added the 11 you put at the top. I know some of the others you listed in the second section are also verbs, but between the additional tokenization hints and a method for teaching the tokenizer that the last character isn't necessarily a punctuation, I hope it already covers the problem cases. If you find more, please don't hesitate to tell us |
… incorrectly tokenized in stanfordnlp/stanza#1401 This does not lower the quality of the Stanza tokenizer, and hopefully helps with its tokenization of words with these lemmas
I went ahead and added another 20 or so infinitives, as you can see in the linked git checkin. Between that and the training update, that hopefully addresses this issue, but as I mentioned, feel free to tell us when you come across more tokenization errors for Spanish. |
How does this look overall now? If there are still issues, we can rebuild the models with more data as needed. |
Describe the bug
When POS tagging a specific string in Spanish a RuntimeError is reproducibly thrown without any apparent reason.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No error thrown.
Environment (please complete the following information):
The error happens in pytorch preprocessing (i.e. regardless whether inference happens on GPU/CPU).
Additional context
I encountered the problem only in Spanish. I was parsing a longer text, but narrowed it to this minimal example.
The text was updated successfully, but these errors were encountered: