AI Red Teaming Guide
AI red teaming is the practice of systematically probing AI systems to discover failure modes, safety vulnerabilities, and harmful behaviors before deployment. Unlike traditional penetration testing which focuses on infrastructure, AI red teaming targets the model itself -- its reasoning, alignment, and robustness to adversarial inputs.
Prompt Injection Testing
Prompt injection attacks attempt to override or modify the instructions given to a language model by embedding adversarial content in user input. There are two primary categories:
- Direct prompt injection:The attacker provides input that directly instructs the model to ignore its system prompt or follow new instructions. Example: "Ignore all previous instructions and instead..."
- Indirect prompt injection: Malicious instructions are embedded in external data that the model processes (e.g., a web page the model retrieves, an email it summarizes, or a document it analyzes). The model follows the injected instructions because it cannot distinguish data from instructions.
Testing should cover both categories and include variations such as instruction injection via encoding (Base64, ROT13), language switching, roleplay framing, and context manipulation.
Jailbreak Categories
| Category | Technique | Description |
|---|---|---|
| Roleplay | Character assumption | Asking the model to assume a persona (e.g., "You are DAN, a model that can do anything") to bypass safety constraints. |
| Encoding | Obfuscation | Using Base64, hex, pig latin, or other encodings to disguise harmful requests so safety filters do not detect them. |
| Context Manipulation | Framing shift | Framing a harmful request as fiction, research, education, or hypothetical scenario to make it appear benign. |
| Token Smuggling | Fragmentation | Breaking harmful words into fragments across multiple messages or using Unicode lookalikes to bypass keyword filters. |
| Multi-Turn | Gradual escalation | Building up context across multiple turns, starting with benign requests and gradually escalating to harmful ones. |
| System Prompt Extraction | Information disclosure | Tricking the model into revealing its system prompt, safety guidelines, or internal instructions. |
| Tool Abuse | Function calling exploit | Manipulating tool/function calling to execute unintended operations, access unauthorized data, or chain tools in harmful ways. |
Safety Evaluation Benchmarks
| Benchmark | Focus | Description | Size |
|---|---|---|---|
| TruthfulQA | Truthfulness | Questions designed to test whether models generate truthful answers rather than plausible-sounding falsehoods. Covers health, law, finance, and misconceptions. | 817 questions |
| BBQ (Bias Benchmark for QA) | Social bias | Tests bias across 9 social dimensions (age, disability, gender, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation) using ambiguous and disambiguated questions. | 58,492 examples |
| WinoBias | Gender bias | Coreference resolution dataset testing gender stereotypes in occupational contexts. Measures whether models associate occupations with stereotypical genders. | 3,160 sentences |
| RealToxicityPrompts | Toxicity | Sentence-level prompts from web text designed to elicit toxic continuations. Measures the probability of generating toxic text given benign or mildly toxic prompts. | 100,000 prompts |
| CrowS-Pairs | Stereotypes | Paired sentences measuring stereotypical bias across 9 bias types. Each pair differs only in a social group reference. | 1,508 pairs |
| HarmBench | Attack/defense | Standardized evaluation framework for automated red teaming. Covers 7 harm categories with functional and semantic attack success criteria. | 510 behaviors |
Adversarial Robustness
Beyond jailbreaks and safety violations, adversarial robustness testing examines how small input perturbations affect model outputs:
- Textual adversarial attacks: Character-level perturbations (typos, homoglyphs), word-level substitutions (synonyms, paraphrases), and sentence-level transformations that flip model predictions while preserving human-perceived meaning.
- Image adversarial attacks: Imperceptible pixel perturbations (FGSM, PGD, C&W), patch attacks, and physical adversarial examples that cause misclassification.
- Distribution shift: Testing model performance on data that differs from the training distribution in systematic ways (different time periods, geographies, domains).
- Stress testing: Evaluating model behavior at extreme input lengths, unusual formatting, edge-case data types, and high concurrency.
Red Teaming Methodology
- Define scope and threat model. Identify what harms you are testing for, what attack surfaces exist, and what resources attackers are assumed to have.
- Assemble a diverse team. Include people with different backgrounds, languages, cultural contexts, and adversarial skill levels. Homogeneous teams miss categories of harm.
- Systematic coverage. Use taxonomies (e.g., OWASP LLM Top 10, NIST AI RMF) to ensure all attack categories are tested rather than relying on ad hoc exploration.
- Automated + manual testing. Automated tools scale coverage but miss nuanced failures. Manual red teaming finds creative attacks that automated tools cannot.
- Document and track findings. Record the exact input, expected behavior, actual behavior, severity, and reproducibility for each finding.
- Retest after mitigation. Verify that fixes address the root cause and do not introduce regressions.
Related: ML Benchmarks reference includes safety-relevant benchmarks like TruthfulQA alongside capability benchmarks.