Misc cleanup for HELM Capabilities #3274

yifanmai · 2025-01-15T05:25:58Z

Switch aggregation to mean
Use rescaled rather than raw wildbench score
Move MMLU-Pro from lite_run_specs.py to capabilities_run_specs.py
Clean up run spec names
Change some default arguments
Add revisions for Hugging Face datasets
Remove unused parameter in wildbench
Switch BigCodeBench version from v0.1.2 to v0.1.3

percyliang · 2025-01-15T05:28:12Z

src/helm/benchmark/static/schema_lite_v2.yaml

@@ -103,6 +108,11 @@ metrics:
    short_display_name: WB Score
    description: Score of the AI output judged by GPT-4o.
    lower_is_better: false
+  - name: wildbench_score


should this be _rescaled?

Yes thanks for catching. This pull request was missing a commit. Should be fixed now.

yifanmai added 5 commits January 14, 2025 17:03

Misc cleanup for HELM Capabilities

82a2c71

More fixes

6320ce6

Fix

3b7e1da

Fixes

15e87bf

Fixes

bf90ccb

percyliang reviewed Jan 15, 2025

View reviewed changes

Fixes

3bb27be

yifanmai merged commit 6d70e98 into main Jan 15, 2025
8 checks passed

yifanmai deleted the yifanmai/fix-capabilities-cleanup branch January 15, 2025 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc cleanup for HELM Capabilities #3274

Misc cleanup for HELM Capabilities #3274

yifanmai commented Jan 15, 2025 •

edited

Loading

percyliang Jan 15, 2025

yifanmai Jan 15, 2025

Misc cleanup for HELM Capabilities #3274

Misc cleanup for HELM Capabilities #3274

Conversation

yifanmai commented Jan 15, 2025 • edited Loading

percyliang Jan 15, 2025

Choose a reason for hiding this comment

yifanmai Jan 15, 2025

Choose a reason for hiding this comment

yifanmai commented Jan 15, 2025 •

edited

Loading