Bias and Fairness Testing

Algorithmic bias occurs when an ML model produces systematically different outcomes for different groups, particularly along protected characteristics such as race, gender, age, or disability status. Fairness testing measures these disparities and determines whether they exceed acceptable thresholds.

No single definition of fairness satisfies all contexts. Practitioners must choose fairness criteria that align with the deployment context, legal requirements, and stakeholder values. Many fairness metrics are mathematically incompatible (the impossibility theorem), so tradeoffs are inevitable.

Fairness Metrics

Metric	Definition	Formula	When to Use
Demographic Parity	The probability of a positive prediction is equal across groups. Also called statistical parity or independence.	P(Y_hat=1\|A=a) = P(Y_hat=1\|A=b)	When equal selection rates are legally or ethically required regardless of base rates.
Equalized Odds	True positive rate and false positive rate are equal across groups. Also called separation.	P(Y_hat=1\|Y=y,A=a) = P(Y_hat=1\|Y=y,A=b) for y in {0,1}	When errors should be distributed equally across groups. Common in criminal justice, hiring.
Equal Opportunity	True positive rate is equal across groups. A relaxation of equalized odds focusing only on positive outcomes.	P(Y_hat=1\|Y=1,A=a) = P(Y_hat=1\|Y=1,A=b)	When it is most important that qualified individuals are treated equally.
Predictive Parity	Positive predictive value (precision) is equal across groups. Also called sufficiency.	P(Y=1\|Y_hat=1,A=a) = P(Y=1\|Y_hat=1,A=b)	When a positive prediction should mean the same thing regardless of group.
Calibration	Predicted probabilities correspond to actual outcome rates within each group. A model predicting 70% should be correct 70% of the time for all groups.	P(Y=1\|S=s,A=a) = P(Y=1\|S=s,A=b) for all scores s	When risk scores are used for decision-making (e.g., lending, recidivism).
Counterfactual Fairness	The prediction would remain the same if the individual had belonged to a different group, all else being equal. Requires a causal model.	Y_hat_A=a(U) = Y_hat_A=b(U)	When individual-level fairness is required and a causal model is available.

Intersectional Analysis

Intersectional analysis evaluates model performance across combinations of protected attributes (e.g., Black women, elderly Asian men) rather than each attribute in isolation. This is critical because:

A model may appear fair on gender and race separately but show significant bias at the intersection (e.g., fair for men overall, fair for Black people overall, but unfair for Black women specifically).
Intersectional subgroups are often smaller, leading to higher variance in performance estimates. Use bootstrapping or confidence intervals.
The number of intersectional groups grows combinatorially. Prioritize groups most likely to be harmed based on domain knowledge.
Report sample sizes for each subgroup so readers can assess statistical reliability.

Fairness Testing Tools

Tool	Developer	Language	Key Features	License
AI Fairness 360 (AIF360)	IBM Research	Python	70+ fairness metrics, 10+ bias mitigation algorithms (pre-processing, in-processing, post-processing), interactive web demo.	Apache 2.0
Fairlearn	Microsoft	Python	Metrics dashboard, constraint-based mitigation (ExponentiatedGradient, ThresholdOptimizer), scikit-learn compatible API.	MIT
Aequitas	University of Chicago DSAPP	Python	Audit tool focused on policy context. Generates bias audit reports with group-level metrics. Web-based interface.	MIT
What-If Tool	Google PAIR	Python / JS	Interactive visual exploration of ML model behavior. Fairness analysis, counterfactual exploration, partial dependence plots. Integrates with TensorBoard.	Apache 2.0
Responsible AI Toolbox	Microsoft	Python	Unified dashboard combining error analysis, fairness assessment, model interpretability, and counterfactual analysis.	MIT

Testing Methodology

Define protected attributes relevant to the deployment context (e.g., gender, race, age, disability). Consider legal requirements in the jurisdiction.
Select fairness metrics aligned with the harm model. Allocation harms (who gets what) favor demographic parity. Quality-of-service harms (who gets accurate results) favor equalized odds.
Compute disaggregated metrics for each group and intersectional subgroup. Report sample sizes alongside metrics.
Set disparity thresholds for acceptable differences (e.g., 80% rule from US EEOC, or a maximum 5% difference in TPR between groups).
Apply mitigation if thresholds are exceeded: pre-processing (resampling, reweighting), in-processing (adversarial debiasing, constrained optimization), or post-processing (threshold adjustment, reject-option classification).
Document results in the model card, including metrics before and after mitigation, chosen thresholds, and tradeoffs made.

Next: AI red teaming covers adversarial testing techniques for language models and generative AI.