transformers.GPT2Config has a parameter n_positions (and n_ctx) that defines how many positional embeddings may go into model. Usually set to 1024, mabey it should be adjusted according to the text length?

transformers.GPT2Tokenizer - Requires a space to start the input string => the encoding methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string: tokenizer.decode(tokenizer.encode(“Hello”)) = ” Hello”

special tokens: [DOCS]class GPT2Tokenizer(PreTrainedTokenizer): """ GPT-2 BPE tokenizer. Peculiarities: - Byte-level Byte-Pair-Encoding - Requires a space to start the input string => the encoding methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string: tokenizer.decode(tokenizer.encode("Hello")) = " Hello" """ vocab_files_names = VOCAB_FILES_NAMES pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
             bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs)

l:)

encode( add_special_tokens – if set to True, the sequences will be encoded with the special tokens relative to their model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

Files

notes.md

Latest commit

History

notes.md

File metadata and controls