To reproduce benchmark metrics on RAGBench, use calculate_metrics.py
. For example, to reproduce GPT-3.5, RAGAS, Trulens for a set of RAGBench component datasets run:
python calculate_metrics.py --dataset hotpotqa msmarco hagrid expertqa
Use the run_inference.py
script to evaluate RAG eval frameworks on RAGBench. Input arguments:
dataset
: name of the RAGBench dataset to run inference onmodel
: the model to evaluate (trulens or ragas)output
: output directory to store results in
Run Trulens inference on HotpotQA subset:
python run_inference.py --dataset msmarco --model trulens --output results