What is red-teaming?

Red-teaming is the practice of deliberately attacking your own AI system, probing for jailbreaks, prompt injection, data leaks, and harmful outputs, to find failure modes before adversaries or real users do.

Red-teaming borrows from security culture: you assign a team (human or automated) to act as the adversary and try to break the system under test. For LLM agents that means systematically probing for jailbreaks, prompt-injection paths, sensitive-data leakage, tool misuse, and harmful or biased outputs across a wide range of crafted inputs. Good red-teaming is structured rather than ad hoc: you maintain a growing suite of attack prompts, run them against each model or prompt change, and treat regressions as bugs. Automated red-teaming uses one model to generate adversarial inputs against another, scaling coverage far beyond what manual testing reaches. The output feeds back into evals, guardrails, and fine-tuning. For agent builders specifically, red-teaming should cover the full tool surface: what happens when a connected MCP server returns malicious content, when a tool argument is manipulated, or when the agent is asked to exfiltrate memory. The goal is not a one-time audit but a continuous discipline, because both the models and the attacks keep changing. Pair red-teaming with strong system boundaries so that a model failure does not become a system compromise.