This folder contains implementations to evaluate LLM360 model memorization of the training data. Such memorization raises privacy concerns by potentially leaking private training data and can degrade LLM performance when the data includes unintended duplicates or anomalies. This folder adopts the memorization score introduced in Biderman et al. 2023 to measure model memorization. Please refer to section 4.3 of the LLM360 paper for more implementation details.
The folder contains training data memorization evaluation for the Amber checkpoints. LLM360 project releases 360 pretraining model checkpoints and corresponding training data chunks to support transparent and reproducible research on LLM training process.
single_ckpt_memorization_eval.py
is the main entrypoint for running memorization evaluation on a single model. It uses python modules in utils/
folder.
The utils/
folder contains helper functions for model/dataset IO:
data_utils.py
: Dataloader utilsmodel_utils.py
: Checkpoint loader
By default, the training data chunks are saved in ./data/train_{data_id}.jsonl
, and the evaluation results are saved in ./result_ckpt-{ckpt_id}/data-{data_id}.json
.
- Clone and enter the folder:
git clone https://github.com/LLM360/Analysis360.git cd Analysis360/analysis/memorization
- Install dependencies:
pip install -r requirements.txt pip install flash-attn --no-build-isolation
An example usage is provided in the demo.ipynb, which can be executed with a single A100 80G
GPU.