What is an eval?

An eval is a structured test that measures how well an LLM or agent performs on a task, run over a dataset of cases with scoring, so you can track quality and catch regressions as prompts and models change.

An eval (short for evaluation) is the AI equivalent of a test suite: a set of input cases, each with a way to score the model's output, run as a batch so you get an aggregate quality number rather than a one-off impression. Evals exist because LLM behavior is non-deterministic and sensitive to small changes, a prompt tweak, a new model version, a different temperature, can quietly improve some cases and break others, and you cannot tell without measuring. A typical eval defines a dataset of representative or adversarial inputs, expected behavior or reference answers, and a scorer: exact match for closed tasks, heuristic checks, or an LLM-as-judge for open-ended ones. You run it in CI or before shipping to compare candidates and guard against regressions, the same role unit tests play for ordinary code. For agents the cases extend to whether the right tools were called, in the right order, with the right arguments, and whether the final task succeeded end to end. Evaluation platforms, several available as MCP servers, store datasets, run scorers, and track scores over time so quality is a metric you manage rather than guess at.