Skip to content

Latest commit

 

History

History

continual-pretraining

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Continual Pretraining

We use nanotron library to do continual pretraining.

Setup

Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs and smollm/pre-training for and example with the pre-training scripts.

Usage

The nanotron checkpoints for SmolLM2 models are available at: https://huggingface.co/HuggingFaceTB/SmolLM2-nanotron-ckpt.

Example: Finemath

For finemath, we did continual pretraining of llama3-3B with different data mixtures. Here we will detail the steps to do the same.

Nanotron

For this example, you need to switch to this PR

gh pr checkout 255

Data

First step is to tokenize the datasets. To do this, we use the datatrove library. We tokenized the following datasets with the llama3 tokenizer:

You can find an example of how to tokenize the datasets in the finemath/finemath-tokenize.py script. You might encounter some issues with the tokenization, you can apply the following patches:

  • For Infi-MM/InfiMM-WebMath-40B: finemath/tokenization_InfiMM-WebMath-4OB.patch
  • For others: finemath/tokenization_finemath.patch To apply the patch, install datatrove from source and run git apply <path_to_patch>.patch in the datatrove directory.

Training

Once the dataset are tokenized, you can launch the training with a similar script as the one in smollm/pre-training. When resuming a training from a checkpoint, you have the choice to keep the learning rate scheduler and optimizer state by changing the following parameters in the yaml file:

  • load_lr_scheduler: false
  • load_optimizer: false

Evaluation

For evaluation, you can follow the instructions in smollm/evaluation.