Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 1.52 KB

notes.md

File metadata and controls

28 lines (19 loc) · 1.52 KB

transformers.GPT2Config has a parameter n_positions (and n_ctx) that defines how many positional embeddings may go into model. Usually set to 1024, mabey it should be adjusted according to the text length?

transformers.GPT2Tokenizer - Requires a space to start the input string => the encoding methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string: tokenizer.decode(tokenizer.encode(“Hello”)) = ” Hello”

special tokens: [DOCS]class GPT2Tokenizer(PreTrainedTokenizer): """ GPT-2 BPE tokenizer. Peculiarities: - Byte-level Byte-Pair-Encoding - Requires a space to start the input string => the encoding methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string: tokenizer.decode(tokenizer.encode("Hello")) = " Hello" """ vocab_files_names = VOCAB_FILES_NAMES pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
             bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs)

l:)

encode( add_special_tokens – if set to True, the sequences will be encoded with the special tokens relative to their model.