Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS tagging pipeline throws "RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0" #1401

Open
adno opened this issue Jul 15, 2024 · 8 comments
Labels

Comments

@adno
Copy link

adno commented Jul 15, 2024

Describe the bug
When POS tagging a specific string in Spanish a RuntimeError is reproducibly thrown without any apparent reason.

To Reproduce
Steps to reproduce the behavior:

  1. Run the following Python script (minimal example):
import stanza
s = ' momento en la historia hacia adelante se ve que los lugareños aprendieron que los blancos los extranjeros en general no eran personas de fiar y que su llegada traía la muerte para sus seres queridos\n\nSe cree entonces que desde ese momento optaron por tener un nivel de violencia altísimo para mantener lejos a cualquier otra persona que intente volver a invadirlos'
pos_nlp = stanza.Pipeline(lang='es', processors='tokenize,mwt,pos')
pos_nlp(s)
  1. A RuntimeError is thrown. Complete error output:
2024-07-15 14:46:12 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 384kB [00:00, 14.5MB/s]                                                                                               
2024-07-15 14:46:12 INFO: Downloaded file to /Users/x/stanza_resources/resources.json
2024-07-15 14:46:12 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package         |
-------------------------------
| tokenize  | combined        |
| mwt       | combined        |
| pos       | combined_charlm |
===============================

2024-07-15 14:46:12 INFO: Using device: cpu
2024-07-15 14:46:12 INFO: Loading: tokenize
2024-07-15 14:46:13 INFO: Loading: mwt
2024-07-15 14:46:13 INFO: Loading: pos
2024-07-15 14:46:13 INFO: Done loading processors!
Traceback (most recent call last):
  File "/Users/x/Projects/y/err.py", line 7, in <module>
    pos_nlp(s)
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/pipeline/core.py", line 480, in __call__
    return self.process(doc, processors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/pipeline/core.py", line 431, in process
    doc = process(doc)
          ^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/pipeline/pos_processor.py", line 88, in process
    preds += self.trainer.predict(b)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/models/pos/trainer.py", line 98, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/models/pos/model.py", line 149, in forward
    word_emb = pack(word_emb)
               ^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/stanza/models/pos/model.py", line 144, in pack
    return pack_padded_sequence(x, sentlens, batch_first=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.conda/envs/y/lib/python3.11/site-packages/torch/nn/utils/rnn.py", line 264, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

Expected behavior
No error thrown.

Environment (please complete the following information):

  • OS: Linux and MacOS (reproducible on both)
  • Python version: python 3.11.9 (hb806964_0_cpython conda-forge)
  • Stanza version: 1.8.2
  • torch: 2.3.1

The error happens in pytorch preprocessing (i.e. regardless whether inference happens on GPU/CPU).

Additional context
I encountered the problem only in Spanish. I was parsing a longer text, but narrowed it to this minimal example.

@adno adno added the bug label Jul 15, 2024
@AngledLuffa
Copy link
Collaborator

Something very strange about this is that it wants to split s into its own sentence at the very end of the second sentence you're giving it:

Se cree entonces que desde ese momento optaron por tener un nivel de violencia altísimo para mantener lejos a cualquier otra persona que intente volver a invadirlos

->

Se cree entonces que desde ese momento optaron por tener un nivel de violencia altísimo para mantener lejos a cualquier otra persona que intente volver a invadirlo
s

but that shouldn't be a crashing problem anyway, since it should be happy to tag a single word. So this is strange. I'll dig into it

@AngledLuffa
Copy link
Collaborator

Okay, two problems here, both of which can & should be fixed.

One is that the tokenizer doesn't know about the word invadir, so it's trying to tokenize invadirlos and comes up with invadirlo and s

The second is that the MWT expander is then processing the second piece, s, to a blank string as a hallucination in the seq2seq model.

Both are easy to fix, actually. If you come across other words like invadirlos which cause similar problems, please post them and we can improve the tokenizer going forward.

@AngledLuffa
Copy link
Collaborator

Well, it's not crashing now in the dev branch, as I fixed that error. Overall the fix isn't satisfactory yet, though. The tokenizer is splitting los into lo and s, and it's marking s by itself as an MWT (of only one subword).

As you find more examples of bad tokenization, please let us know, and hopefully over time we can make some progress cleaning those up.

@AngledLuffa
Copy link
Collaborator

... point being that now it seems to think s is the sentence final punctuation here, simply because there's no . at the very end of the document. I'll have to figure out some way to upgrade the training to make the current tokenizer work a bit better, probably with data augmentation, although long term an upgraded sentence splitter would help

@adno
Copy link
Author

adno commented Jul 17, 2024

Thanks you for the quick fix!

What I found out in the meantime, is that when I split the string s from my original issue at '\n\n', the problem disappears (I get final two tokens 'invadir" and 'los'). This surprised me, as I had supposed, that Stanza splits strings at '\n\n' into separate sequences to be processed by the models, and so I thought it shouldn't have any effect. But it clearly has. I applied the splitting as a recovery from RuntimeErrors during POS tagging and I avoided crashes in the whole corpus this way.

(To be honest, I have a very dim idea of the inner workings of Stanza.)

... point being that now it seems to think s is the sentence final punctuation here, simply because there's no . at the very end of the document.

Since my corpus consists of subtitles, sentences often miss final punctuation:-(

As you find more examples of bad tokenization, please let us know, and hopefully over time we can make some progress cleaning those up.

I did a quick regex search in frequency list from my corpus (after tokenization) and found that the following strings are split in the same way ('XXXlo' + 's') if I substitute 'invadirlos' from in s with them:

verlos
hacerlos
haberlos
atarlos
compartirlos
saberlos
decirlos
besarlos
llamarlos
tenerlos
usarlos

There's more potential culprits I have, but I haven't checked if they are results of bad tokenization of a verb + 'los'. Here's the whole list:

carlo
irlo
charlo
merlo
pirlo
verlo
mirlo
hacerlo
arlo
burlo
montecarlo
giancarlo
marlo
haberlo
karlo
atarlo
compartirlo
saberlo
serlo
amarlo
porlo
decirlo
chorlo
besarlo
chirlo
murlo
llamarlo
parlo
estarlo
oírlo
tenerlo
usarlo
lograrlo
harlo
romperlo
matarlo
conocerlo
vivirlo
waterlo
sacarlo
scantamburlo
consumirlo
sentirlo
aceptarlo
creerlo
protegerlo
amaticourlo
urlo
capturarlo
ordenarlo
conquistarlo
ayudarlo
simularlo
cobrarlo
stalkearlo
salvarlo
hacecerlo
buscarlo
distraerlo
obligarlo
sublimarlo
pagarlo
jarlo
conservarlo
subirlo
bajarlo
unificarlo
amoblarlo
overlo
poenrlo
comprarlo
ganarlo
controlarlo
corlo
oberlo
orlo
recogerlo
asaltarlo
explicarlo
escogerlo
construirlo
disfrutarlo
meterlo
barlo
cerlo
aplicarlo
trascenderlo
parecerlo
venderlo
atraparlo
devolverlo
perlo
estudiarlo
cambiarlo
heirlo
detenerlo
convencerlo
aprobarlo
acusarlo
evitarlo
firmarlo
impulsarlo
acordarlo
darlo
emocionarlo
incluirlo
...verlo
moverlo
mirarlo
captarlo
traerlo
perderlo
lmperlo
costearlo
prenderlo
transplantarlo
escucharlo
comprobarlo
pelarlo
predicarlo
arreglarlo
esquivarlo
nulificarlo

AngledLuffa added a commit that referenced this issue Jul 21, 2024
…okenizer. Addresses an issue where some words were being chopped up in Spanish because the tokenizer had never seen them and interpreted the last character as the end of the document #1401
AngledLuffa added a commit to stanfordnlp/handparsed-treebank that referenced this issue Jul 21, 2024
@AngledLuffa
Copy link
Collaborator

I added the 11 you put at the top. I know some of the others you listed in the second section are also verbs, but between the additional tokenization hints and a method for teaching the tokenizer that the last character isn't necessarily a punctuation, I hope it already covers the problem cases. If you find more, please don't hesitate to tell us

AngledLuffa added a commit to stanfordnlp/handparsed-treebank that referenced this issue Jul 21, 2024
… incorrectly tokenized in stanfordnlp/stanza#1401  This does not lower the quality of the Stanza tokenizer, and hopefully helps with its tokenization of words with these lemmas
@AngledLuffa
Copy link
Collaborator

I went ahead and added another 20 or so infinitives, as you can see in the linked git checkin. Between that and the training update, that hopefully addresses this issue, but as I mentioned, feel free to tell us when you come across more tokenization errors for Spanish.

@AngledLuffa
Copy link
Collaborator

How does this look overall now? If there are still issues, we can rebuild the models with more data as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants