What is Rate limiting?
Rate limiting caps how many requests or tokens a client may consume in a time window, protecting a service from overload and abuse. LLM and MCP APIs enforce it, and agents must handle it gracefully.
Rate limiting is the practice of bounding how frequently a caller can hit an API, expressed as requests per minute, tokens per minute, or concurrent connections. Providers use it to keep a shared service stable, allocate capacity fairly across tenants, control cost, and blunt abuse. It is commonly implemented with algorithms like the token bucket or sliding window, and when a caller exceeds the limit the server returns HTTP 429 Too Many Requests, often with a `Retry-After` header telling the client how long to wait. For LLM-driven systems rate limiting is a first-class concern because two separate budgets usually apply, requests and tokens, and a long prompt can exhaust the token budget long before the request count. Agents and MCP clients that fan out parallel tool calls or batch many model invocations hit these ceilings quickly, so robust clients implement exponential backoff with jitter on 429s, respect `Retry-After`, throttle concurrency, and queue work rather than retrying in a tight loop. On the serving side, MCP servers and gateways apply their own rate limits per API key or user to stay within the limits of the upstream services they wrap. Treating rate limits as expected control flow, not exceptional errors, is what separates a flaky pipeline from a resilient one.