What is inference?

Inference is the act of running a trained model to produce outputs, the step where an LLM actually generates tokens for a prompt, as opposed to training. It is where ongoing cost, latency, and throughput live in production.

Inference is using a trained model to generate predictions, distinct from training, which is the one-time (or periodic) process of learning the weights. For an LLM, inference is the per-request work of turning a prompt into output tokens, and it is where the recurring economics of a deployed system live. LLM inference has two phases with different characteristics: prefill, which processes the whole input prompt in parallel and is compute-bound, and decode, which generates output tokens one at a time and is memory-bandwidth-bound. This split is why long prompts raise time-to-first-token while long outputs raise total latency, and why batching multiple requests together improves GPU utilization and throughput. Most efficiency work, quantization, distillation, MoE, KV-cache optimization, speculative decoding, exists to make inference cheaper and faster without retraining. For agent builders, inference cost compounds because agents make many model calls per task, so trimming the context you send (the central job of a memory layer like Glen, which returns only relevant observations instead of a full transcript) directly lowers prefill cost and latency on every call.