What is Error budget?
An error budget is the amount of unreliability a service is allowed under its SLO, the gap between the target and 100%, spent on deploys, experiments, and risk before the team must stop shipping and stabilize.
An error budget is the allowable amount of failure a service can absorb over an SLO window while still meeting its objective. If your SLO is 99.9% success over 30 days, your error budget is the remaining 0.1%, roughly 43 minutes of full downtime or its equivalent in failed requests. The point of an error budget is to convert reliability from an argument into an accounting exercise: as long as budget remains, the team is free to ship features, run risky migrations, and experiment, because some failure is acceptable. When the budget is exhausted, the policy flips, feature work pauses and the team spends its energy on hardening until the budget recovers in the next window. This reframes the eternal tension between velocity and stability as a shared, self-correcting rule rather than a recurring fight. Error budgets also drive alerting: burn-rate alerts fire when the budget is being consumed faster than the window allows, which is a far better signal than a static threshold. For an AI agent participating in release decisions, the current error budget and its burn rate are decisive context, and a shared memory of recent burn events helps the agent judge whether now is the time to push or to wait.