ML Benchmark Reference

Benchmarks provide standardized evaluation datasets and protocols for comparing model performance. The following table covers major benchmarks used in language model evaluation, code generation, and reasoning tasks. Scores are approximate and reflect published results as of early 2025.

Language Model Benchmarks

Benchmark	Task Type	Metric	Dataset Size	Description
MMLU	Knowledge / reasoning	Accuracy (5-shot)	15,908 questions across 57 subjects	Massive Multitask Language Understanding. Multiple-choice questions spanning STEM, humanities, social sciences, and professional domains. Tests breadth of knowledge.
HellaSwag	Commonsense reasoning	Accuracy (10-shot)	10,042 questions	Sentence completion requiring commonsense reasoning about physical and social situations. Adversarially filtered to be easy for humans but difficult for models.
ARC (Challenge)	Science reasoning	Accuracy (25-shot)	2,590 questions (Challenge set)	AI2 Reasoning Challenge. Grade-school science questions partitioned into Easy and Challenge sets. Challenge set contains questions that retrieval and co-occurrence methods fail on.
WinoGrande	Coreference resolution	Accuracy (5-shot)	1,267 problems	Fill-in-the-blank coreference problems requiring commonsense reasoning. Adversarially constructed and crowdsource-validated for quality.
TruthfulQA	Truthfulness	MC1 / MC2 accuracy	817 questions	Questions designed to elicit common misconceptions and falsehoods. Tests whether models generate truthful answers rather than repeating popular but incorrect claims.
GSM8K	Math reasoning	Accuracy (exact match, 5-shot CoT)	8,500 problems (1,319 test)	Grade School Math 8K. Multi-step arithmetic word problems requiring 2-8 reasoning steps. Tests mathematical reasoning with chain-of-thought prompting.
HumanEval	Code generation	pass@1 (functional correctness)	164 problems	Python programming problems with function signatures, docstrings, and unit tests. Measures functional correctness of generated code, not just syntactic validity.
MBPP	Code generation	pass@1	974 problems (500 test)	Mostly Basic Python Problems. Short Python functions covering basic programming constructs. Complements HumanEval with more diverse problem types.
MATH	Mathematics	Accuracy (exact match)	12,500 problems (5,000 test)	Competition-level mathematics problems from AMC, AIME, and other competitions. Covers algebra, geometry, number theory, combinatorics, and calculus. Difficulty levels 1-5.
GPQA	Expert knowledge	Accuracy	448 questions (Diamond set)	Graduate-level Google-Proof QA. Expert-written multiple-choice questions in biology, physics, and chemistry that resist web search. Even domain experts outside the question's subfield score around 34%.

Benchmark Selection Guidelines

Choose benchmarks that match the model's intended use case. A code generation model should be evaluated on HumanEval/MBPP, not just MMLU.
Report results on multiple benchmarks to avoid Goodhart's Law -- optimizing for a single metric at the expense of general capability.
Specify the exact evaluation protocol: number of few-shot examples, chain-of-thought prompting, temperature, and sampling method.
Be aware of data contamination. Models trained on web data may have seen benchmark questions during training. Newer benchmarks are less likely to be contaminated.
Benchmark saturation: when leading models score above 95%, the benchmark no longer discriminates between models. Move to harder benchmarks (e.g., MMLU to GPQA, GSM8K to MATH).

Related: Evaluation frameworks covers platforms for running and tracking benchmark evaluations.