A leaderboard focused on evaluating practical open-source language models. We only test models that are:
- Locally deployable
- Quantized
- Runnable with 20GB or less VRAM
🔗 Online Leaderboard: https://lenml.github.io/lenml-llm-leaderboard/
Current open-source model evaluations face several limitations:
- Most leaderboards focus solely on English capabilities or standardized test scores
- Primary emphasis on large models (100B+ parameters), which lack practicality
- Evaluation methods are too academic and fail to reflect real-world usage
- Limited coverage of community models, especially ERP variants
We've designed a set of metrics that better align with real-world usage scenarios:
Metric | Description |
---|---|
Hardcore | Evaluates model knowledge in specific (you known) niche domains |
Reject | Tests model's tendency to refuse responses (lower is better) |
Creative | Assesses creative writing capabilities |
Long | Measures accuracy in generating content of specified length |
ACG | Evaluates knowledge of Anime, Comics, and Games (ACG culture) |
- Custom evaluation formula support
- Custom test data support
- Automated evaluation implementation
- Additional evaluation dimensions (e.g., lateral thinking puzzles)
Issues and Pull Requests are welcome to help improve this project!