AI Red Teaming Guide

AI red teaming is the practice of systematically probing AI systems to discover failure modes, safety vulnerabilities, and harmful behaviors before deployment. Unlike traditional penetration testing which focuses on infrastructure, AI red teaming targets the model itself -- its reasoning, alignment, and robustness to adversarial inputs.

Prompt Injection Testing

Prompt injection attacks attempt to override or modify the instructions given to a language model by embedding adversarial content in user input. There are two primary categories:

Direct prompt injection:The attacker provides input that directly instructs the model to ignore its system prompt or follow new instructions. Example: "Ignore all previous instructions and instead..."
Indirect prompt injection: Malicious instructions are embedded in external data that the model processes (e.g., a web page the model retrieves, an email it summarizes, or a document it analyzes). The model follows the injected instructions because it cannot distinguish data from instructions.

Testing should cover both categories and include variations such as instruction injection via encoding (Base64, ROT13), language switching, roleplay framing, and context manipulation.

Jailbreak Categories

Category	Technique	Description
Roleplay	Character assumption	Asking the model to assume a persona (e.g., "You are DAN, a model that can do anything") to bypass safety constraints.
Encoding	Obfuscation	Using Base64, hex, pig latin, or other encodings to disguise harmful requests so safety filters do not detect them.
Context Manipulation	Framing shift	Framing a harmful request as fiction, research, education, or hypothetical scenario to make it appear benign.
Token Smuggling	Fragmentation	Breaking harmful words into fragments across multiple messages or using Unicode lookalikes to bypass keyword filters.
Multi-Turn	Gradual escalation	Building up context across multiple turns, starting with benign requests and gradually escalating to harmful ones.
System Prompt Extraction	Information disclosure	Tricking the model into revealing its system prompt, safety guidelines, or internal instructions.
Tool Abuse	Function calling exploit	Manipulating tool/function calling to execute unintended operations, access unauthorized data, or chain tools in harmful ways.

Safety Evaluation Benchmarks

Benchmark	Focus	Description	Size
TruthfulQA	Truthfulness	Questions designed to test whether models generate truthful answers rather than plausible-sounding falsehoods. Covers health, law, finance, and misconceptions.	817 questions
BBQ (Bias Benchmark for QA)	Social bias	Tests bias across 9 social dimensions (age, disability, gender, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation) using ambiguous and disambiguated questions.	58,492 examples
WinoBias	Gender bias	Coreference resolution dataset testing gender stereotypes in occupational contexts. Measures whether models associate occupations with stereotypical genders.	3,160 sentences
RealToxicityPrompts	Toxicity	Sentence-level prompts from web text designed to elicit toxic continuations. Measures the probability of generating toxic text given benign or mildly toxic prompts.	100,000 prompts
CrowS-Pairs	Stereotypes	Paired sentences measuring stereotypical bias across 9 bias types. Each pair differs only in a social group reference.	1,508 pairs
HarmBench	Attack/defense	Standardized evaluation framework for automated red teaming. Covers 7 harm categories with functional and semantic attack success criteria.	510 behaviors

Adversarial Robustness

Beyond jailbreaks and safety violations, adversarial robustness testing examines how small input perturbations affect model outputs:

Textual adversarial attacks: Character-level perturbations (typos, homoglyphs), word-level substitutions (synonyms, paraphrases), and sentence-level transformations that flip model predictions while preserving human-perceived meaning.
Image adversarial attacks: Imperceptible pixel perturbations (FGSM, PGD, C&W), patch attacks, and physical adversarial examples that cause misclassification.
Distribution shift: Testing model performance on data that differs from the training distribution in systematic ways (different time periods, geographies, domains).
Stress testing: Evaluating model behavior at extreme input lengths, unusual formatting, edge-case data types, and high concurrency.

Red Teaming Methodology

Define scope and threat model. Identify what harms you are testing for, what attack surfaces exist, and what resources attackers are assumed to have.
Assemble a diverse team. Include people with different backgrounds, languages, cultural contexts, and adversarial skill levels. Homogeneous teams miss categories of harm.
Systematic coverage. Use taxonomies (e.g., OWASP LLM Top 10, NIST AI RMF) to ensure all attack categories are tested rather than relying on ad hoc exploration.
Automated + manual testing. Automated tools scale coverage but miss nuanced failures. Manual red teaming finds creative attacks that automated tools cannot.
Document and track findings. Record the exact input, expected behavior, actual behavior, severity, and reproducibility for each finding.
Retest after mitigation. Verify that fixes address the root cause and do not introduce regressions.

Related: ML Benchmarks reference includes safety-relevant benchmarks like TruthfulQA alongside capability benchmarks.