Issue with the BertTokenizer - add_tokens() method is missing #61

andreabac3 · 2021-03-13T12:04:52Z

I have a need similar of @LittlePea13 (see issue #58 ) on the SQUAD task. I want add some tokens at training time in the vocabulary of the language model. But the BertTokenizer (SpanBertcode/pytorch_pretrained_bert/tokenization.py) don't load the following file 'added_tokens.json' and the class has no method to load it.

A naive solution is move from:

from pytorch_pretrained_bert.tokenization import (BasicTokenizer, BertTokenizer, whitespace_tokenize)
tokenizer = BertTokenizer.from_pretrained(
        args.model, do_lower_case=args.do_lower_case)
# to
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
        args.model, do_lower_case=args.do_lower_case)

It is equal?
In this way the method from_pretrained() load correctly my custom tokenizer and the AutoTokenizer has the method add_tokens()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with the BertTokenizer - add_tokens() method is missing #61

Issue with the BertTokenizer - add_tokens() method is missing #61

andreabac3 commented Mar 13, 2021

Issue with the BertTokenizer - add_tokens() method is missing #61

Issue with the BertTokenizer - add_tokens() method is missing #61

Comments

andreabac3 commented Mar 13, 2021