Model Evaluation Methodology
Data Splitting Strategies
Proper data splitting is the foundation of reliable model evaluation. The goal is to estimate how a model will perform on unseen data while avoiding data leakage.
Train / Validation / Test Split
The standard approach divides data into three non-overlapping partitions. The training set is used to fit model parameters. The validation set is used for hyperparameter tuning and model selection. The test set is held out until final evaluation and used only once to report performance.
- Common ratios: 70/15/15 or 80/10/10 for datasets above 10,000 samples.
- Large datasets (>1M samples): 98/1/1 is acceptable since even 1% provides a statistically meaningful test set.
- Small datasets (<1,000 samples): Use cross-validation instead of a fixed split.
- Stratified splitting: Maintain class proportions in each split when working with imbalanced data.
- Temporal splitting: For time-series data, split chronologically to avoid future data leaking into training.
- Group splitting: When samples are grouped (e.g., multiple images per patient), ensure all samples from the same group are in the same split.
Cross-Validation
Cross-validation provides more robust performance estimates by training and evaluating on multiple partitions of the data.
- K-Fold (k=5 or k=10): Data is divided into k folds. The model is trained k times, each time using k-1 folds for training and 1 fold for validation. The final metric is the mean across all folds.
- Stratified K-Fold: Same as K-Fold but preserving class distribution in each fold. Preferred for classification tasks with imbalanced classes.
- Leave-One-Out (LOO):K-Fold where k equals the number of samples. Computationally expensive but useful for very small datasets (<100 samples).
- Repeated K-Fold: K-Fold repeated multiple times with different random seeds. Provides confidence intervals on performance estimates.
- Nested Cross-Validation: Inner loop for hyperparameter tuning, outer loop for performance estimation. Avoids optimistic bias from tuning on the test fold.
Metrics by Task Type
The choice of evaluation metric depends on the task type, class distribution, and business requirements. The following table summarizes common metrics organized by task.
Classification Metrics
| Metric | Formula | When to Use | Range |
|---|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Balanced classes only. Misleading for imbalanced data. | 0 to 1 |
| Precision | TP / (TP + FP) | When false positives are costly (e.g., spam filtering). | 0 to 1 |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly (e.g., disease detection). | 0 to 1 |
| F1 Score | 2 * (Precision * Recall) / (Precision + Recall) | Balance between precision and recall. Good default for imbalanced data. | 0 to 1 |
| AUC-ROC | Area under ROC curve | Threshold-independent evaluation. Measures discriminative ability across all thresholds. | 0.5 to 1 |
| AUC-PR | Area under Precision-Recall curve | Preferred over AUC-ROC for highly imbalanced datasets. | 0 to 1 |
| Log Loss | -mean(y*log(p) + (1-y)*log(1-p)) | When probability calibration matters, not just ranking. | 0 to inf |
Regression Metrics
| Metric | Formula | When to Use | Range |
|---|---|---|---|
| MAE | mean(|y - y_hat|) | Robust to outliers. Interpretable in original units. | 0 to inf |
| MSE | mean((y - y_hat)^2) | Penalizes large errors more than MAE. Standard loss function. | 0 to inf |
| RMSE | sqrt(mean((y - y_hat)^2)) | Same units as target variable. More interpretable than MSE. | 0 to inf |
| R-squared | 1 - SS_res / SS_tot | Proportion of variance explained. Compare models on same dataset. | -inf to 1 |
| MAPE | mean(|y - y_hat| / |y|) * 100 | Percentage-based. Undefined when y=0. Good for business reporting. | 0% to inf |
Generation / NLP Metrics
| Metric | Description | Task Type | Range |
|---|---|---|---|
| BLEU | N-gram overlap between generated and reference text. Brevity penalty for short outputs. | Machine translation | 0 to 1 |
| ROUGE-L | Longest common subsequence between generated and reference text. | Summarization | 0 to 1 |
| Perplexity | Exponentiated average negative log-likelihood. Lower is better. Measures how well a model predicts a sample. | Language modeling | 1 to inf |
| METEOR | Unigram matching with stemming and synonym support. Correlates better with human judgment than BLEU. | Translation, captioning | 0 to 1 |
| BERTScore | Semantic similarity using contextual embeddings from BERT. Captures meaning beyond surface-level n-gram overlap. | General text generation | 0 to 1 |
| pass@k | Probability that at least one of k generated code samples passes all unit tests. | Code generation | 0 to 1 |
Metric Selection Guidelines
- Always report multiple metrics. A single metric rarely captures all aspects of model performance.
- Choose metrics aligned with business impact. If false negatives cost 10x more than false positives, optimize for recall.
- Report confidence intervals or standard deviations, not just point estimates.
- Disaggregate metrics across relevant subgroups to detect performance disparities.
- Avoid accuracy as the sole metric for imbalanced datasets. A model predicting the majority class always achieves high accuracy but provides no value.
- For generative tasks, complement automated metrics with human evaluation when feasible.
Next: Bias and fairness testing covers how to evaluate model performance across demographic groups.