ML Benchmark Reference

Benchmarks provide standardized evaluation datasets and protocols for comparing model performance. The following table covers major benchmarks used in language model evaluation, code generation, and reasoning tasks. Scores are approximate and reflect published results as of early 2025.

Language Model Benchmarks

BenchmarkTask TypeMetricDataset SizeDescription
MMLUKnowledge / reasoningAccuracy (5-shot)15,908 questions across 57 subjectsMassive Multitask Language Understanding. Multiple-choice questions spanning STEM, humanities, social sciences, and professional domains. Tests breadth of knowledge.
HellaSwagCommonsense reasoningAccuracy (10-shot)10,042 questionsSentence completion requiring commonsense reasoning about physical and social situations. Adversarially filtered to be easy for humans but difficult for models.
ARC (Challenge)Science reasoningAccuracy (25-shot)2,590 questions (Challenge set)AI2 Reasoning Challenge. Grade-school science questions partitioned into Easy and Challenge sets. Challenge set contains questions that retrieval and co-occurrence methods fail on.
WinoGrandeCoreference resolutionAccuracy (5-shot)1,267 problemsFill-in-the-blank coreference problems requiring commonsense reasoning. Adversarially constructed and crowdsource-validated for quality.
TruthfulQATruthfulnessMC1 / MC2 accuracy817 questionsQuestions designed to elicit common misconceptions and falsehoods. Tests whether models generate truthful answers rather than repeating popular but incorrect claims.
GSM8KMath reasoningAccuracy (exact match, 5-shot CoT)8,500 problems (1,319 test)Grade School Math 8K. Multi-step arithmetic word problems requiring 2-8 reasoning steps. Tests mathematical reasoning with chain-of-thought prompting.
HumanEvalCode generationpass@1 (functional correctness)164 problemsPython programming problems with function signatures, docstrings, and unit tests. Measures functional correctness of generated code, not just syntactic validity.
MBPPCode generationpass@1974 problems (500 test)Mostly Basic Python Problems. Short Python functions covering basic programming constructs. Complements HumanEval with more diverse problem types.
MATHMathematicsAccuracy (exact match)12,500 problems (5,000 test)Competition-level mathematics problems from AMC, AIME, and other competitions. Covers algebra, geometry, number theory, combinatorics, and calculus. Difficulty levels 1-5.
GPQAExpert knowledgeAccuracy448 questions (Diamond set)Graduate-level Google-Proof QA. Expert-written multiple-choice questions in biology, physics, and chemistry that resist web search. Even domain experts outside the question's subfield score around 34%.

Benchmark Selection Guidelines

  • Choose benchmarks that match the model's intended use case. A code generation model should be evaluated on HumanEval/MBPP, not just MMLU.
  • Report results on multiple benchmarks to avoid Goodhart's Law -- optimizing for a single metric at the expense of general capability.
  • Specify the exact evaluation protocol: number of few-shot examples, chain-of-thought prompting, temperature, and sampling method.
  • Be aware of data contamination. Models trained on web data may have seen benchmark questions during training. Newer benchmarks are less likely to be contaminated.
  • Benchmark saturation: when leading models score above 95%, the benchmark no longer discriminates between models. Move to harder benchmarks (e.g., MMLU to GPQA, GSM8K to MATH).

Related: Evaluation frameworks covers platforms for running and tracking benchmark evaluations.