What are guardrails?

Guardrails are the checks and constraints placed around an LLM or agent, on inputs, outputs, and tool calls, to keep behavior safe, on-policy, and within bounds, independent of the model's own cooperation.

Guardrails are the controls a system imposes around a model so that bad inputs are rejected, bad outputs are caught, and risky actions are blocked, regardless of whether the model itself behaves. They operate at several layers. Input guardrails screen prompts for injection attempts, policy violations, or out-of-scope requests before they reach the model. Output guardrails validate responses, schema-checking structured output, scanning for leaked secrets or PII, filtering disallowed content, and can trigger a retry or a safe fallback when a check fails. Action guardrails sit in front of tools: allow-lists of callable tools, argument validation, rate caps, and human-in-the-loop approval for destructive or irreversible operations. The key principle is defense in depth, the system prompt is not a security boundary because prompt injection and tool poisoning can override it, so real protection comes from deterministic code wrapping the model rather than from politely-worded instructions. In MCP deployments, guardrails include scoping each server's credentials to least privilege, isolating servers from one another, and treating all tool output as untrusted data. Well-designed guardrails fail closed, keep an audit trail of what was blocked, and make the safe path the default.