Building an eval harness for LLM-as-judge grading

When you build a product on a language model, you need to know whether its answers are any good, and there are far too many answers for a person to read. The standard fix is LLM-as-judge: use a model to grade the outputs of another model. That just relocates the problem, though, because if the judge is wrong, every metric built on top of it is fiction. This project is a small but complete implementation of the loop that makes a judge trustworthy, and this page walks through how I built it and why each piece looks the way it does. The code is at github.com/gradientsj/llm-eval-harness.

Step 1: write the rubric before anything else

A rubric is the written instruction set for scoring an answer. Mine scores grounded question answering, where the model answers from a provided context passage, on three 1-5 dimensions plus a verdict:

Groundedness: is every claim in the answer actually supported by the context? An answer that correctly says the context lacks the information counts as fully grounded; inventing an answer is the failure, not admitting the gap.
Relevance: does it address the question that was asked?
Coherence: is it well-formed, readable prose?
Overall pass: would you ship this answer to a user? Derived mechanically from the three scores.

One design decision mattered more than any other: the rubric is one file, and both the human labelers and the judge prompt read from it. Calibration only means something when both sides scored against identical instructions; if the rubric drifts for one side, you are comparing two different tasks and the agreement numbers are noise.

Step 2: build a benchmark designed to break judges

The benchmark is 30 question-answer pairs over short context passages, each labeled by hand against the rubric. I kept it deliberately adversarial: about a third of the examples are constructed traps where shallow scoring methods are known to disagree with people.

A negation flip says the opposite of the context using only the context's own words ("the mix is suitable" vs "the mix is not suitable"), so word-overlap scoring cannot see that anything changed.
A fact recombination attaches real facts from the context to the wrong entity, so every word is "grounded" while the claim itself is false.
A correct refusal rightly says the context does not contain the answer, and because it shares almost no vocabulary with the context, overlap scoring punishes exactly the behavior you want to reward.

The thinking behind the traps is that easy benchmarks make judges look better than they are, and a test set that cannot distinguish a good judge from a cheap heuristic cannot certify anything.

Step 3: a baseline judge that costs nothing

Before pointing an expensive model at the problem, I wrote a lexical baseline: a deterministic judge that scores groundedness by counting how much of the answer's vocabulary appears in the context. It makes no model calls, so the entire pipeline runs in CI for free, and it sets the floor a real LLM judge has to beat to justify its latency and cost. Knowing precisely how good word counting is turns "the LLM judge works" from a feeling into a measured comparison.

Step 4: measure agreement properly

With human labels and judge scores for every example, the question becomes: how well do they agree? No single statistic answers that, so the harness reports several, each covering a blind spot of the others.

Cohen's kappa measures agreement on the ship/no-ship verdict, corrected for chance. Plain accuracy looks good whenever one class dominates, so kappa subtracts the agreement you would get by guessing the base rates: 0 means guessing, 1 means perfect agreement.
Quadratic-weighted kappa (QWK) does the same for 1-5 scores, and weights disagreements by their squared distance: calling a 1 a 5 costs far more than calling a 4 a 5. It is the standard statistic for ordinal human ratings.
Spearman correlation asks whether the judge ranks answers in the same order as humans, even if its absolute scores run harsh or lenient, and ranking is what matters when a judge compares two model versions.
MAE (mean absolute error) is the average distance from the human score in rubric points. It is the most readable number of the four, but it is blind to chance agreement, so it gates nothing on its own.

Some of the design decisions are easiest to see in the code: I implemented these statistics from scratch in about 150 lines of plain Python, tested against hand-computed values, instead of importing scipy, because the agreement math is the product here and it should be auditable. Another was that when a statistic is undefined because of zero variance or otherwise degenerate data, it returns NaN, and NaN fails the gate, since "we cannot demonstrate calibration" must never be allowed to read as a pass.

Step 5: gate the build, in the right order

The harness runs two gates in CI, and the order is the point.

The calibration gate runs first: if judge-human agreement drops below threshold, the build fails, because a judge that has drifted from human judgment must not be trusted to evaluate anything. The regression gate runs second: it re-scores a candidate set of answers and fails if aggregate quality drops beyond tolerance against a frozen baseline. A regression verdict from an uncalibrated judge would be noise, which is why tier two never runs without tier one.

The gate thresholds were a design decision of their own: they were initialized just below the lexical baseline's measured agreement, so the floor means something and any judge that cannot beat word counting fails. When a stronger judge lands, the thresholds ratchet up to sit just below its numbers, so future regressions in judge quality get caught.

What the numbers said

The lexical baseline reached kappa 0.400 on ship/no-ship agreement and a revealing gradient across dimensions: QWK 0.725 on coherence, 0.505 on relevance, and only 0.256 on groundedness. That gradient is the finding. Coherence is nearly solvable with surface features, relevance partially, and groundedness not at all, which is the quantified argument for using an LLM judge there, and only there, once it beats the floor. The harness's failure-analysis report surfaces the worst judge-human disagreements ranked by size, and they were exactly the planted traps: negation flips scored 5 by the baseline and 1 by humans, correct refusals scored 1 by the baseline and 5 by humans.

Limitations

Thirty examples labeled by one person, me, are enough to demonstrate the machinery but not to stand as a production benchmark. The real version needs several independent annotators with inter-annotator agreement reported, examples sampled from real traffic, and a held-out split so the judge prompt is never tuned on the examples that certify it. Those caveats are written into the repo rather than discovered later, which I think is the more useful habit.