Published on the Trustworthy and Reliable Large-Scale Machine Learning Models ICLR 2023 Workshop.
Abstract: Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time.
All images, tables and values cited in the paper can be reproduced in the notebooks 01 and 02.
conda env create -f environment.yml
conda init black_box
python -m ipykernel install --user --name=black_box
Rescored toxicity scores and metrics produced for the paper are available at our HuggingFace datasets repo. Published scores from RTP are also needed to reproduce results.
git lfs install
git clone [email protected]:datasets/for-ai/black-box-api-challenges data
wget https://ai2-public-datasets.s3.amazonaws.com/realtoxicityprompts/realtoxicityprompts-data.tar.gz
tar -xvzf realtoxicityprompts-data.tar.gz -C data/
rm realtoxicityprompts-data.tar.gz
There are three main scripts: score
, collate
and evaluate
. Below are examples of how to use each for DExperts rescored generation files that accompany this repo.
You can replace the input_path
for your desired jsonl
file and indicate in which column are the text you want to rescore. The script currently supports text that are contained in dictionaries (text
key), list of dictionaries and columns of strings. This outputs files with _perspective.jsonl
termination.
Perspective API rate limit is 1 by default. Before running this script, don't forget to export your API key.
export PERSPECTIVE_API_KEY=$YOUR_KEY
python -m scripts.score \
data/dexperts/generations/toxicity/dapt/prompted_gens_gpt2_gens_rescored.jsonl \
--column_name generations \
--output_folder data/example \
--perspective_rate_limit 1
To rescore DExperts's 10k non-toxic RTP prompts, for example, you can run
python -m scripts.score \
data/dexperts/prompts/nontoxic_prompts-10k.jsonl \
--column_name prompt \
--output_folder data/example \
--perspective_rate_limit 1
The collate script joins prompts and generations into a single file. We need all three files: with generated text, with scores corresponding to those texts, and the prompts which generated the continuations, if used. This outputs files with _collated.jsonl
termination.
python -m scripts.collate \
data/dexperts/generations/toxicity/dapt/prompted_gens_gpt2_gens_rescored.jsonl \
data/example/prompted_gens_gpt2_gens_rescored_perspective.jsonl \
--prompts_path data/dexperts/prompts/nontoxic_prompts-10k.jsonl
You can collate prompts to their new scores with
python -m scripts.collate_prompts data/dexperts/prompts/nontoxic_prompts-10k.jsonl data/example/nontoxic_prompts-10k_perspective.jsonl
With the evaluate script we can compute toxicity metrics such as Expected Maximum Toxicity, Toxicity Probability and Toxic Fraction This outputs files with _toxicity.csv
termination.
python -m scripts.evaluate --prompted_json data/example/prompted_gens_gpt2_gens_rescored_collated.jsonl
We scrape the website for models benchmarked under the real_toxicity_prompts
task.
Those model names are used to download continuations and published stats.jsonl
files from HELM's buckets.
python -m scripts.helm.scrape \
--task "real_toxicity_prompts" \
--version "v0.2.2" \
--output_folder "data/rescored/helm"
Then, we rescore downloaded continuations and collate those scores with the original prompts. You can pass prompts_path
in case you rescored prompts as well.
HELM uses the first spanScore
, instead of the summaryScores
as other RTP benchmarks. Evaluation is performed exclusively on notebook 02 from the collated files and original stats.jsonl
files.
python -m scripts.helm.score_and_collate \
--perspective_rate_limit 1 \
--base_dir data/rescored/helm/real_toxicity_prompts
@article{pozzobon2023challenges,
title={On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research},
author={Pozzobon, Luiza and Ermis, Beyza and Lewis, Patrick and Hooker, Sara},
journal={arXiv preprint arXiv:2304.12397},
year={2023}
}
- RealToxicityPrompts: https://github.com/allenai/real-toxicity-prompts
- DExperts: https://github.com/stanford-crfm/helm
- HELM: https://github.com/alisawuffles/DExperts