What is LLM-as-judge?

LLM-as-judge is using a language model to score or compare other models' outputs against criteria you define, automating evaluation that would otherwise need slow, expensive human grading.

LLM-as-judge is the practice of having one language model evaluate the output of another, scoring a response for quality, correctness, helpfulness, or adherence to a rubric, or picking the better of two candidates. It exists because evaluating open-ended generations is hard: exact-match metrics fail when many phrasings are equally valid, and human review does not scale to thousands of test cases on every change. A capable model given clear criteria, and often a reference answer, can approximate human judgment cheaply and consistently enough to drive regression testing, A/B comparisons, and ranking. Common patterns are pointwise scoring (rate this answer 1 to 5 on a dimension) and pairwise comparison (which of these two is better, and why), usually asking for a short rationale to improve reliability. The method has known biases, judges can favor longer answers, prefer the first option shown, or rate their own family of models higher, so good evaluation harnesses control for position, calibrate against a human-labeled sample, and treat scores as signal rather than ground truth. It is a core tool in any serious LLM evaluation pipeline.