What is Latency vs throughput?
Latency is how long a single request takes; throughput is how many requests a system completes per unit time. For LLM serving the two trade off, batching raises throughput but can raise per-request latency, so you tune for one or the other.
Latency and throughput are the two axes you tune any serving system against, and for LLM inference they are in tension. Latency is the time an individual request takes, often split into time-to-first-token (how fast the user sees something) and tokens-per-second (how fast the rest streams). Throughput is total useful work per unit time, typically tokens or requests per second across all users. The classic tradeoff is batching: combining many requests into one GPU pass dramatically increases throughput and lowers cost per token, but each request may wait to be batched and shares compute, which can raise its latency. Interactive products (a coding assistant, a chat agent) optimize for low latency and accept lower throughput; bulk jobs (overnight document processing, evals, embeddings) optimize for throughput and tolerate latency. Continuous batching and paged attention let modern servers get much of the throughput benefit with less latency penalty. The practical rule: pick the metric your use case actually cares about and measure it at the tail (p99), not the average, because a good average hides the slow requests users remember. Shrinking the prompt you send, what a memory layer does by returning only relevant context, helps both axes at once.