What is context compaction?

Context compaction shrinks a growing conversation or agent history so it fits the context window, typically by summarizing old turns, dropping low-value content, or offloading detail to external memory.

Context compaction is the practice of keeping an agent's working context small enough to fit the model's token budget while preserving the information the task actually needs. Long-running agents accumulate history fast, every tool call, observation, and reasoning step adds tokens, and once the context window fills, something must give. Common strategies include summarizing earlier turns into a compact recap, truncating or evicting the least relevant messages, deduplicating repeated tool output, and pruning verbose results down to their essentials. A more durable approach offloads detail into an external store: instead of carrying everything in the prompt, the agent writes facts to a memory store or vector database and retrieves only what is relevant on each turn, so the live context stays lean. Compaction is a balancing act, summarize too aggressively and the agent forgets a constraint it needed; not enough and it runs out of room or pays for tokens it never uses. Done well it both lowers cost and improves quality, since models attend better to short, focused contexts than to bloated ones. It pairs naturally with prompt caching (cache the stable prefix, compact the volatile tail) and with persistent memory systems that turn ephemeral conversation into reusable long-term knowledge.