ML Benchmark Reference
Benchmarks provide standardized evaluation datasets and protocols for comparing model performance. The following table covers major benchmarks used in language model evaluation, code generation, and reasoning tasks. Scores are approximate and reflect published results as of early 2025.
Language Model Benchmarks
| Benchmark | Task Type | Metric | Dataset Size | Description |
|---|---|---|---|---|
| MMLU | Knowledge / reasoning | Accuracy (5-shot) | 15,908 questions across 57 subjects | Massive Multitask Language Understanding. Multiple-choice questions spanning STEM, humanities, social sciences, and professional domains. Tests breadth of knowledge. |
| HellaSwag | Commonsense reasoning | Accuracy (10-shot) | 10,042 questions | Sentence completion requiring commonsense reasoning about physical and social situations. Adversarially filtered to be easy for humans but difficult for models. |
| ARC (Challenge) | Science reasoning | Accuracy (25-shot) | 2,590 questions (Challenge set) | AI2 Reasoning Challenge. Grade-school science questions partitioned into Easy and Challenge sets. Challenge set contains questions that retrieval and co-occurrence methods fail on. |
| WinoGrande | Coreference resolution | Accuracy (5-shot) | 1,267 problems | Fill-in-the-blank coreference problems requiring commonsense reasoning. Adversarially constructed and crowdsource-validated for quality. |
| TruthfulQA | Truthfulness | MC1 / MC2 accuracy | 817 questions | Questions designed to elicit common misconceptions and falsehoods. Tests whether models generate truthful answers rather than repeating popular but incorrect claims. |
| GSM8K | Math reasoning | Accuracy (exact match, 5-shot CoT) | 8,500 problems (1,319 test) | Grade School Math 8K. Multi-step arithmetic word problems requiring 2-8 reasoning steps. Tests mathematical reasoning with chain-of-thought prompting. |
| HumanEval | Code generation | pass@1 (functional correctness) | 164 problems | Python programming problems with function signatures, docstrings, and unit tests. Measures functional correctness of generated code, not just syntactic validity. |
| MBPP | Code generation | pass@1 | 974 problems (500 test) | Mostly Basic Python Problems. Short Python functions covering basic programming constructs. Complements HumanEval with more diverse problem types. |
| MATH | Mathematics | Accuracy (exact match) | 12,500 problems (5,000 test) | Competition-level mathematics problems from AMC, AIME, and other competitions. Covers algebra, geometry, number theory, combinatorics, and calculus. Difficulty levels 1-5. |
| GPQA | Expert knowledge | Accuracy | 448 questions (Diamond set) | Graduate-level Google-Proof QA. Expert-written multiple-choice questions in biology, physics, and chemistry that resist web search. Even domain experts outside the question's subfield score around 34%. |
Benchmark Selection Guidelines
- Choose benchmarks that match the model's intended use case. A code generation model should be evaluated on HumanEval/MBPP, not just MMLU.
- Report results on multiple benchmarks to avoid Goodhart's Law -- optimizing for a single metric at the expense of general capability.
- Specify the exact evaluation protocol: number of few-shot examples, chain-of-thought prompting, temperature, and sampling method.
- Be aware of data contamination. Models trained on web data may have seen benchmark questions during training. Newer benchmarks are less likely to be contaminated.
- Benchmark saturation: when leading models score above 95%, the benchmark no longer discriminates between models. Move to harder benchmarks (e.g., MMLU to GPQA, GSM8K to MATH).
Related: Evaluation frameworks covers platforms for running and tracking benchmark evaluations.