Model Evaluation Methodology

Data Splitting Strategies

Proper data splitting is the foundation of reliable model evaluation. The goal is to estimate how a model will perform on unseen data while avoiding data leakage.

Train / Validation / Test Split

The standard approach divides data into three non-overlapping partitions. The training set is used to fit model parameters. The validation set is used for hyperparameter tuning and model selection. The test set is held out until final evaluation and used only once to report performance.

Common ratios: 70/15/15 or 80/10/10 for datasets above 10,000 samples.
Large datasets (>1M samples): 98/1/1 is acceptable since even 1% provides a statistically meaningful test set.
Small datasets (<1,000 samples): Use cross-validation instead of a fixed split.
Stratified splitting: Maintain class proportions in each split when working with imbalanced data.
Temporal splitting: For time-series data, split chronologically to avoid future data leaking into training.
Group splitting: When samples are grouped (e.g., multiple images per patient), ensure all samples from the same group are in the same split.

Cross-Validation

Cross-validation provides more robust performance estimates by training and evaluating on multiple partitions of the data.

K-Fold (k=5 or k=10): Data is divided into k folds. The model is trained k times, each time using k-1 folds for training and 1 fold for validation. The final metric is the mean across all folds.
Stratified K-Fold: Same as K-Fold but preserving class distribution in each fold. Preferred for classification tasks with imbalanced classes.
Leave-One-Out (LOO):K-Fold where k equals the number of samples. Computationally expensive but useful for very small datasets (<100 samples).
Repeated K-Fold: K-Fold repeated multiple times with different random seeds. Provides confidence intervals on performance estimates.
Nested Cross-Validation: Inner loop for hyperparameter tuning, outer loop for performance estimation. Avoids optimistic bias from tuning on the test fold.

Metrics by Task Type

The choice of evaluation metric depends on the task type, class distribution, and business requirements. The following table summarizes common metrics organized by task.

Classification Metrics

Metric	Formula	When to Use	Range
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Balanced classes only. Misleading for imbalanced data.	0 to 1
Precision	TP / (TP + FP)	When false positives are costly (e.g., spam filtering).	0 to 1
Recall (Sensitivity)	TP / (TP + FN)	When false negatives are costly (e.g., disease detection).	0 to 1
F1 Score	2 * (Precision * Recall) / (Precision + Recall)	Balance between precision and recall. Good default for imbalanced data.	0 to 1
AUC-ROC	Area under ROC curve	Threshold-independent evaluation. Measures discriminative ability across all thresholds.	0.5 to 1
AUC-PR	Area under Precision-Recall curve	Preferred over AUC-ROC for highly imbalanced datasets.	0 to 1
Log Loss	-mean(ylog(p) + (1-y)log(1-p))	When probability calibration matters, not just ranking.	0 to inf

Regression Metrics

Metric	Formula	When to Use	Range
MAE	mean(\|y - y_hat\|)	Robust to outliers. Interpretable in original units.	0 to inf
MSE	mean((y - y_hat)^2)	Penalizes large errors more than MAE. Standard loss function.	0 to inf
RMSE	sqrt(mean((y - y_hat)^2))	Same units as target variable. More interpretable than MSE.	0 to inf
R-squared	1 - SS_res / SS_tot	Proportion of variance explained. Compare models on same dataset.	-inf to 1
MAPE	mean(\|y - y_hat\| / \|y\|) * 100	Percentage-based. Undefined when y=0. Good for business reporting.	0% to inf

Generation / NLP Metrics

Metric	Description	Task Type	Range
BLEU	N-gram overlap between generated and reference text. Brevity penalty for short outputs.	Machine translation	0 to 1
ROUGE-L	Longest common subsequence between generated and reference text.	Summarization	0 to 1
Perplexity	Exponentiated average negative log-likelihood. Lower is better. Measures how well a model predicts a sample.	Language modeling	1 to inf
METEOR	Unigram matching with stemming and synonym support. Correlates better with human judgment than BLEU.	Translation, captioning	0 to 1
BERTScore	Semantic similarity using contextual embeddings from BERT. Captures meaning beyond surface-level n-gram overlap.	General text generation	0 to 1
pass@k	Probability that at least one of k generated code samples passes all unit tests.	Code generation	0 to 1

Metric Selection Guidelines

Always report multiple metrics. A single metric rarely captures all aspects of model performance.
Choose metrics aligned with business impact. If false negatives cost 10x more than false positives, optimize for recall.
Report confidence intervals or standard deviations, not just point estimates.
Disaggregate metrics across relevant subgroups to detect performance disparities.
Avoid accuracy as the sole metric for imbalanced datasets. A model predicting the majority class always achieves high accuracy but provides no value.
For generative tasks, complement automated metrics with human evaluation when feasible.

Next: Bias and fairness testing covers how to evaluate model performance across demographic groups.