Minimal implementation of the paper Training Language Models to Self-Correct via Reinforcement Learning
To set up the environment for this project, follow these steps:
-
Create a new conda environment named "llmrl" with Python 3.9:
conda create -n score python=3.9
-
Activate the environment:
conda activate score
Install the required packages using the requirements.txt
file:
pip install -r requirements.txt
python score_toy.py
python score_math.py
dataset_relabel.py
was used to add final answer pattern: 'Final Answer: The final answer is $answer$. I hope it is correct.'