Name		Name	Last commit message	Last commit date
parent directory ..
finemath		finemath
README.md		README.md

README.md

Continual Pretraining

We use nanotron library to do continual pretraining.

Setup

Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs and smollm/pre-training for and example with the pre-training scripts.

Usage

The nanotron checkpoints for SmolLM2 models are available at: https://huggingface.co/HuggingFaceTB/SmolLM2-nanotron-ckpt.

Example: Finemath

For finemath, we did continual pretraining of llama3-3B with different data mixtures. Here we will detail the steps to do the same.

Nanotron

For this example, you need to switch to this PR

gh pr checkout 255

Data

First step is to tokenize the datasets. To do this, we use the datatrove library. We tokenized the following datasets with the llama3 tokenizer:

You can find an example of how to tokenize the datasets in the finemath/finemath-tokenize.py script. You might encounter some issues with the tokenization, you can apply the following patches:

For Infi-MM/InfiMM-WebMath-40B: finemath/tokenization_InfiMM-WebMath-4OB.patch
For others: finemath/tokenization_finemath.patch To apply the patch, install datatrove from source and run git apply <path_to_patch>.patch in the datatrove directory.

Training

Once the dataset are tokenized, you can launch the training with a similar script as the one in smollm/pre-training. When resuming a training from a checkpoint, you have the choice to keep the learning rate scheduler and optimizer state by changing the following parameters in the yaml file:

load_lr_scheduler: false
load_optimizer: false

Evaluation

For evaluation, you can follow the instructions in smollm/evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continual-pretraining

continual-pretraining

README.md

Continual Pretraining

Setup

Usage

Example: Finemath

Nanotron

Data

Training

Evaluation

Files

continual-pretraining

Directory actions

More options

Directory actions

More options

Latest commit

History

continual-pretraining

Folders and files

parent directory

README.md

Continual Pretraining

Setup

Usage

Example: Finemath

Nanotron

Data

Training

Evaluation