Eval engineering, a field study

Your LLM judge drifts 2 points between runs. Mine almost reverted a real cost win

A test retest sigma of 0.42 inside a session hid a 2.4 point swing between sessions. Here is the measurement, and the paired grading fix.

by Domantas Kazlauskas · June 17, 2026 · 9 min read

I built a loop to cut the cost of my production AI. It proposes a change, measures the saving, and grades the result against a quality bar before anything ships. On one of the first real cuts, the loop almost reverted a change that was completely fine.

The cut was good. The grader was not. The LLM I used as the judge had quietly become harsher between the day it set the bar and the day it graded the candidate, by more than two points out of ten, on answers that were nearly identical in content. Trust the obvious gate design and you throw away a real saving and file it as a quality regression. Here is the whole study, the cost work it came out of, and the fix.

The question

Can you let a loop optimize a production AI cost on its own, without it shipping a quality regression you find out about in front of a customer? That is the whole game with self improving systems. The optimizer is the easy half. The half that has to be right is the gate that decides whether a change is safe to keep.

So I built the gate first and made it the product, then pointed it at my own system: an agronomy assistant that answers Lithuanian farmers in production, Sonnet 4.6 driving a few dozen tools. Everything below is measured on that system. The numbers are mine, not a benchmark borrowed from a paper.

How I measured

Blended cost per answer lies. It moves with cache temperature, with how many tools a question fires, and with how long the conversation already is, so an average hides which lever actually matters. I decomposed it two ways.

First, per query class, counted cache independently. I call the expensive tool offline, count the characters it returns, and convert to tokens, so the figure does not depend on whether the cache happened to be warm that minute. That isolates the component a change really touches.

Second, the eval gated loop. It proposes one change, applies it to an in memory copy of the tool so the live request path is never touched, measures the targeted component, then runs a gate. The gate is the point: it is multi dimensional and absolute, and any single red reverts the change.

propose

next candidate from the backlog

apply

mutate one knob in isolation, reverted on exit

measure

count the component cache independently

gate

quality, cost, latency, new failure, shadow

decide

stage, revert, or quarantine

canary

a human approves, then a small canary

The shape of one cycle. A candidate ships only if every gate dimension passes at once, and any red reverts it automatically. A stage never auto deploys; a human approves it, then a small canary runs the rubric on both cohorts before full rollout.

method shape only, engine internals not shown

Where the cost actually was

Across 44 production log lines the envelope ran from 0.59 cents to 9.36 cents per answer, median 3.75, mean 3.8. (The curated set of eight questions I publish elsewhere on the site averages 3.4 cents; this is the wider envelope, multi turn and cold starts included.)

The first thing I went looking for was a fat tail, one or two ruinous query classes carrying the bill. There is none. The most expensive class is 1.37 times the median on a warm cache and 1.71 times on a cold normalized basis. Nothing runs away.

The cost is cross cutting instead.

cache read

$0.30/M

fresh input

$3.00/M

cache write

$3.75/M

output

$15.00/M

What a million tokens costs in each bucket. Cache serves about 95 percent of input on a warm path, so the cheap buckets stay cheap. The bill lands on the two dear ones, output at fifteen dollars and cache write at almost four, which is why the real levers are output discipline and the size of what you write to cache.

rates: showcase published bucket prices, Sonnet 4.6

Output tokens at fifteen dollars per million are the single dearest unit, and answers run roughly 600 to 1800 of them. The next driver is cache write, paid the first time a large tool result is written to cache; my agronomy tool returns big payloads, in the 10 to 26 KB range, one to seven of them per answer. The one time cold prefix and the uncached multi turn history fill in the rest, the latter adding two to four cents per follow up.

So the lever is not a magic class to delete. It is output discipline and the size of what you write to cache. I trimmed the worst payload by dropping two long free text columns the model never needed to cite.

full tool result payload

before

11,506

after

8,258

28 percent smaller

the fungicide field, the worst offender: 2,543 to 763 tokens, 70 percent off, by dropping two long free text columns the model did not need to cite.

The one cut that shipped, measured on the component itself rather than the blended bill. The tool result dropped from 11,506 to 8,258 tokens, 28 percent, with the fungicide field down 70 percent. The loop reconciled this to the deployed estimator within $0.000039.

cost_decomp_s85 before and after, cache independent count

The judge changed its mind

Here is the part worth your time. To gate on quality I use an LLM as the judge, a different model from the one under test (Opus 4.8 grading Sonnet 4.6), scoring each answer against a rubric. Before trusting it I calibrated it: graded a frozen set of baseline answers three times in one sitting, and slipped in three deliberately broken answers as bait. It was tight and it was not fooled. The test retest standard deviation within that sitting was 0.42 points out of ten, and it caught all 3 bad anchors, a true negative rate of 1.0. By the usual checks, a reliable judge.

Then I graded a real candidate, the trimmed payload above, on a different day. The same answers that had scored around nine now came back at 6.8. Not because the answers changed; they were nearly identical. The judge had simply gotten harsher between sessions. Inside a session it is steady to within half a point. Between sessions it drifted more than two points.

The same answers, graded twice on different days. Inside one session the judge is steady, a test retest sigma of 0.42. Between sessions its stringency moved 2.4 points on identical content (9.18 then 6.8). The band on session A is plus or minus one sigma.

judge Opus 4.8, my 11 question bank, June 2026

That breaks the obvious gate. The natural design is to grade your baseline once, freeze the score, and from then on compare every candidate to that frozen number. Do that here and the candidate reads 6.8 against a frozen 8.73, looks like a regression of more than a point, and gets reverted. A real, safe, measured cost cut, thrown away by a measurement artifact.

Naive gate, frozen baseline

candidate and baseline graded on different sessions

8.73frozen baseline

6.8candidate, later session

reverts a good cut

Paired gate, same session

candidate and baseline graded together, order swapped

9.27baseline, same session

9.18candidate, same session

keeps it, delta is noise

The same cost cut, judged two ways. The naive gate reads the candidate (6.8, a later session) against a baseline frozen on an earlier one (8.73) and reverts a change that was fine. The paired gate grades candidate and baseline together (9.18 against 9.27, delta -0.09) and keeps it.

my bank, paired and unpaired grading, June 2026

The fix is to stop comparing across sessions. Grade the baseline and the candidate in the same session, order swapped so neither gains from going first, and compare them to each other rather than to a number from last week. Graded that way the candidate scored 9.18 against a baseline of 9.27, a delta of -0.09, well inside the judge own noise. Quality held, the cut was kept, and the frozen score stays in the record as a historical anchor, not the live bar.

the gate rule, in principle

grade(candidate, baseline) together, one session, order swapped
keep quality if mean(candidate) >= mean(baseline) - margin
    margin = 2 * sigma / sqrt(n)        # sigma is the judge test retest noise
revert if any other dimension goes red  # cost, latency, new failure, shadow

The rails that make it safe

A gate that can keep or kill changes on its own needs guards. None of these are optional.

Multi dimensional, never one score.
Quality, targeted cost, cost per correct answer, latency, and new failure modes are all checked. Any one going red reverts the change.
A shadow set the optimizer never sees.
If the visible bank improves while the held out set stays flat, that is the signature of gaming the metric. The candidate is quarantined for a human, not shipped.
An independent judge.
The grader is a different model from the system under test, and any machine written candidate is graded order swapped to measure position bias.
Calibration before the loop runs.
If the judge test retest noise is too high, or it stops catching the bad anchors, the loop refuses to start.
Convergence and budget caps.
The loop stops when grades go flat and tokens stop moving, after a few non shipping cycles in a row, or at a hard spend cap.
A human gate before deploy.
A passing candidate is staged, not shipped. A person approves it, then a small canary runs the same rubric on both cohorts with automatic rollback before full rollout.

What I did not do, and what is still thin

The honest part. One lever I expected to find, deduplicating repeated tool result writes, turned out to be a phantom: the cache already writes once and reads after, so the duplicate writes I went looking for never fire. I dropped it. The remaining cost levers on this system are largely tapped. This work is a gate and a discipline, not a money printer.

The latency dimension was cross session in the replay, so it only does real work on live cycles. And the bank is small, 11 questions with 5 flagged for a human, and the cross session drift number rests on a re grade of two passes. That is enough to prove the failure mode and the fix. It is not a population study. The honest next step is to widen the bank from production traces, with temporal holdouts, before leaning on it harder. The whole validation ran on already shipped data, at $0.000039 reconciled and zero new spend.

Why this matters past my farm

The specific cut here is small. The discipline is not. Most teams running an LLM in production cannot tell you what a single answer costs, let alone prove a change did not quietly degrade quality, because the thing they would measure quality with, an LLM judge, is itself noisy and drifts.

That is exactly the problem a fabricated citation poses to a legal AI, where an answer that reads fine can get a lawyer sanctioned, and it is the same problem for any support assistant, RAG app, or agent where a plausible wrong answer costs more than a visibly broken one. Measuring cost to the token and gating quality against the judge own measured noise is what the Wyrum audit, sprint, and Model Watch retainer do on a client stack. The teardowns show each one running live.