AI Red Teaming Guide

AI red teaming is the practice of systematically probing AI systems to discover failure modes, safety vulnerabilities, and harmful behaviors before deployment. Unlike traditional penetration testing which focuses on infrastructure, AI red teaming targets the model itself -- its reasoning, alignment, and robustness to adversarial inputs.

Prompt Injection Testing

Prompt injection attacks attempt to override or modify the instructions given to a language model by embedding adversarial content in user input. There are two primary categories:

  • Direct prompt injection:The attacker provides input that directly instructs the model to ignore its system prompt or follow new instructions. Example: "Ignore all previous instructions and instead..."
  • Indirect prompt injection: Malicious instructions are embedded in external data that the model processes (e.g., a web page the model retrieves, an email it summarizes, or a document it analyzes). The model follows the injected instructions because it cannot distinguish data from instructions.

Testing should cover both categories and include variations such as instruction injection via encoding (Base64, ROT13), language switching, roleplay framing, and context manipulation.

Jailbreak Categories

CategoryTechniqueDescription
RoleplayCharacter assumptionAsking the model to assume a persona (e.g., "You are DAN, a model that can do anything") to bypass safety constraints.
EncodingObfuscationUsing Base64, hex, pig latin, or other encodings to disguise harmful requests so safety filters do not detect them.
Context ManipulationFraming shiftFraming a harmful request as fiction, research, education, or hypothetical scenario to make it appear benign.
Token SmugglingFragmentationBreaking harmful words into fragments across multiple messages or using Unicode lookalikes to bypass keyword filters.
Multi-TurnGradual escalationBuilding up context across multiple turns, starting with benign requests and gradually escalating to harmful ones.
System Prompt ExtractionInformation disclosureTricking the model into revealing its system prompt, safety guidelines, or internal instructions.
Tool AbuseFunction calling exploitManipulating tool/function calling to execute unintended operations, access unauthorized data, or chain tools in harmful ways.

Safety Evaluation Benchmarks

BenchmarkFocusDescriptionSize
TruthfulQATruthfulnessQuestions designed to test whether models generate truthful answers rather than plausible-sounding falsehoods. Covers health, law, finance, and misconceptions.817 questions
BBQ (Bias Benchmark for QA)Social biasTests bias across 9 social dimensions (age, disability, gender, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation) using ambiguous and disambiguated questions.58,492 examples
WinoBiasGender biasCoreference resolution dataset testing gender stereotypes in occupational contexts. Measures whether models associate occupations with stereotypical genders.3,160 sentences
RealToxicityPromptsToxicitySentence-level prompts from web text designed to elicit toxic continuations. Measures the probability of generating toxic text given benign or mildly toxic prompts.100,000 prompts
CrowS-PairsStereotypesPaired sentences measuring stereotypical bias across 9 bias types. Each pair differs only in a social group reference.1,508 pairs
HarmBenchAttack/defenseStandardized evaluation framework for automated red teaming. Covers 7 harm categories with functional and semantic attack success criteria.510 behaviors

Adversarial Robustness

Beyond jailbreaks and safety violations, adversarial robustness testing examines how small input perturbations affect model outputs:

  • Textual adversarial attacks: Character-level perturbations (typos, homoglyphs), word-level substitutions (synonyms, paraphrases), and sentence-level transformations that flip model predictions while preserving human-perceived meaning.
  • Image adversarial attacks: Imperceptible pixel perturbations (FGSM, PGD, C&W), patch attacks, and physical adversarial examples that cause misclassification.
  • Distribution shift: Testing model performance on data that differs from the training distribution in systematic ways (different time periods, geographies, domains).
  • Stress testing: Evaluating model behavior at extreme input lengths, unusual formatting, edge-case data types, and high concurrency.

Red Teaming Methodology

  1. Define scope and threat model. Identify what harms you are testing for, what attack surfaces exist, and what resources attackers are assumed to have.
  2. Assemble a diverse team. Include people with different backgrounds, languages, cultural contexts, and adversarial skill levels. Homogeneous teams miss categories of harm.
  3. Systematic coverage. Use taxonomies (e.g., OWASP LLM Top 10, NIST AI RMF) to ensure all attack categories are tested rather than relying on ad hoc exploration.
  4. Automated + manual testing. Automated tools scale coverage but miss nuanced failures. Manual red teaming finds creative attacks that automated tools cannot.
  5. Document and track findings. Record the exact input, expected behavior, actual behavior, severity, and reproducibility for each finding.
  6. Retest after mitigation. Verify that fixes address the root cause and do not introduce regressions.

Related: ML Benchmarks reference includes safety-relevant benchmarks like TruthfulQA alongside capability benchmarks.