Special tokens have been added in the vocabulary #32

kwalcock · 2023-07-26T00:54:03Z

During manipulating the tokenizers, this error message can appear:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
.../convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

It's somewhat unclear what this means, but a example can illustrate the issue. If the Python tokenizer for microsoft/deberta-v3-bas is run the text of a certain unicode character 䀀, it is tokenized as

['[CLS]', '▁', '[UNK]', '[SEP]']
[1, 507, 3, 2]
-1, 0, 0, -1

There shouldn't be four tokens here, but aside from that, note the [UNK]. The Python version is not able to to do mapping back to the strange character. It goes to the special token that has been added to the vocabulary. The Rust version gets

['[CLS]', '▁', '䀀', '[SEP]']
[1, 507, 3, 2]
-1, 0, 0, -1

so it is able to map back to the original text using the byte fallback option. It doesn't seem like this should affect processors because I think we are keeping the original texts around. The 0 for the word index should refer back to the original word. During training, we're only concerned with the numbers (wordIds), and they match up.

The text was updated successfully, but these errors were encountered:

kwalcock · 2023-07-26T01:08:34Z

For a different tokenizer, xlm-roberta-base, it looks like this:

['<s>', '▁', '<unk>', '</s>']
[0, 6, 3, 2]
-1, 0, 0, -1

and

['<s>', '▁', '䀀', '</s>']
[0, 6, 3, 2]
-1, 0, 0, -1

MihaiSurdeanu · 2023-07-26T07:20:00Z

Thanks @kwalcock ! Unclear to me why the mapping fails in Python... They should access exactly the same token vocabulary, no?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens have been added in the vocabulary #32

Special tokens have been added in the vocabulary #32

kwalcock commented Jul 26, 2023

kwalcock commented Jul 26, 2023 •

edited

Loading

MihaiSurdeanu commented Jul 26, 2023

Special tokens have been added in the vocabulary #32

Special tokens have been added in the vocabulary #32

Comments

kwalcock commented Jul 26, 2023

kwalcock commented Jul 26, 2023 • edited Loading

MihaiSurdeanu commented Jul 26, 2023

kwalcock commented Jul 26, 2023 •

edited

Loading