What is a jailbreak?

A jailbreak is a crafted prompt that gets an LLM to bypass its safety training and produce content it was aligned to refuse, using role-play, obfuscation, or instruction-overriding tricks to defeat the model's guardrails.

A jailbreak is an adversarial prompt designed to make a language model violate its own safety policies, generating disallowed content, revealing hidden system prompts, or performing actions its alignment training was meant to prevent. Where prompt injection targets an agent's instruction-following over untrusted data, jailbreaks target the model's safety layer directly. Common techniques include role-play framing (asking the model to pretend it is an unrestricted persona), token-level obfuscation, encoding instructions in another language or format, and many-shot priming that floods the context with compliant examples. Jailbreaks are an evolving cat-and-mouse problem: providers patch known attacks through fine-tuning and added guardrails, and researchers find new ones through red-teaming. For teams building agents, the lesson is that the model's own refusals are a soft boundary, not a hard one. Real safety comes from system-level controls: scoping tool permissions, validating outputs, keeping humans in the loop for sensitive actions, and never relying on the model to be the only gatekeeper. A memory layer like Glen handles plain observations, not code execution, so even a jailbroken agent is constrained by the privileges of the surrounding system rather than by the model's goodwill.