all teardowns

Optimization Sprint, the teardown

The sprint, before and after, on real numbers

Your AI bill drops in weeks and the evals prove quality held. Below: the real cost history of our own system, what each technique contributes, and a simulator you can drag.

What you get, week by week

Week 1

Instrument first, exactly as in the audit, then ship the quick wins: cache breakpoints placed right, obvious output caps, retry hygiene. Cheap fixes land before the deep ones.

Weeks 2 to 3

The structural work: prompt caching architecture, model and tool routing, token budgets and output discipline. An eval harness runs from day one so every cut is gated on quality holding.

Week 4

Verification on production traffic, a before and after report reconciled against your provider bill, and handoff: your team owns the levers and the evals.

Forwardable as the scope of work. 2 to 4 weeks, fixed fee with the audit credited, pricing on request.

The real history, told honestly

agrotus, three measured run sets
first published run set5.5 cents meanthe system before consolidation, cold cache on several answers
second published run set3.6 cents meanafter tool consolidation, warm cache
this instrumented set3.4 cents meanper round traces on, all 8 reconciled to the estimator and now shown on the showcase

The honest part: the drop from 5.5 to 3.6 came from warm cache plus tool consolidation, not magic, and the three sets are never blended or averaged. Real systems vary run to run; a sprint moves the whole band down and the report shows you the run level evidence, exactly like this.

With and without caching, per question

Every published question against its own no caching counterfactual. The interactive simulator further down moves the same numbers; this is the underlying per question table.

Show the per question table, all 8
questionmeasuredno cachingmultiplier
q14.1c5.8c1.44x
q24.9c9.4c1.92x
q34.7c9.2c1.95x
q42.3c4.4c1.94x
q51.7c3.9c2.23x
q64.6c6.4c1.4x
q72.0c4.1c2.06x
q83.3c5.3c1.62x

method: every cached token repriced at the fresh input rate, same rounds, same outputcomputed

The cache simulator, on measured tokens

All 8 instrumented answers summed: 72.6% of input tokens came from cache in these runs. Drag the rate and watch what the same workload costs as caching degrades or improves.

computed
at 73% hit rate, the same 8 answers cost27.4 cents(3.43 cents per answer, 0.99x the measured bill)

method: linear between the no caching counterfactual (48.6 cents at 0%), the measured run (27.6 cents at 72.6%, cache writes priced as they really happened), and the all cache floor (16.7 cents at 100%), at published per token rates.

note: the showcase quotes about 95 percent cache hit, the steady state prefix hit rate in production traffic. The 72.6% marked here is a different, stricter measure: cache reads as a share of ALL input tokens within these 8 instrumented runs, including each run’s one time cache writes. Both are real; we label which is which.

Model cascading, the next lever

The cheapest token is the one you never send. The deepest lever past caching is not a cheaper answer, it is no language model call at all for the routine slice: a small task specific classifier handles intent routing, triage and structured extraction at near zero token cost, and the full model is reserved for synthesis. This is model cascading, a standard technique, and distillation is how the small model gets good enough to trust.

This is not hypothetical. The production system behind these numbers already runs a custom trained nine class classifier on its vision path, doing real perception with no language model token cost at all. The slider projects what happens to the model bill as a classifier absorbs more of a workload.

projection
at 40% of messages offloaded, blended model cost is2.07 centsper answer (40% below the measured 3.4 cent mean)

method: a projection, not a measurement. The measured mean here is 3.4 cents per answer across 8 instrumented answers. This models a workload where 40% of messages are routine enough for a task specific classifier to resolve, priced at zero language model tokens, with the full model reserved for synthesis. The classifier’s own inference is a fraction of a cent and not a language model bill; what moves is the model spend you stop paying.

What this is not: routing the hard answer to a weaker model. We tested that on the vision path, it misdiagnosed a disease, and we refused it. Cascading offloads the routine perception step; it does not downgrade the synthesis.stated

What each technique contributes

Prompt caching

measured here: without it these 8 answers cost 1.76x as much (per question range 1.4x to 2.23x)

computed
Tool and context routing

measured on the anatomy run: sending all 45 tool definitions every round instead of the routed 12 adds 4,616 tokens per round

computed
Output discipline

output tokens are 48% of this run set's cost at $15/M; capping and tightening answers is often the largest single lever

computed
Evals holding quality

the gate for every cut: we publish what we refused to optimize because quality paid for it

stated

On a client system the order is decided by the audit: biggest measured gap first. That is the whole trick, and it only works if you measured first.

What this does for your company

Cost

The levers above, shipped on your stack, with savings verified against the provider bill, not estimated.

measured
Reliability

The eval harness that gates the sprint stays with you; quality regressions surface before your customers find them.

stated
Speed

2 to 4 weeks, fixed scope. The audit already ranked the work, so the sprint starts cutting on day one.

stated

The honest ROI math

projected monthly savings

$12K to $36K

projected yearly savings

$144K to $432K

industry benchmark range, not a promisedocumented technique ranges, 30 to 90 percent

Existence proof from our own system: without prompt caching the 8 instrumented answers would cost 1.76x as much, so caching alone removes about 43 percent of that bill.measured

The floor is the guarantee, not a projection: money back if verified savings do not beat the fee within 90 days.

At this spend, engagements like this typically pay for themselves within weeks. Ask for pricing.

Why Wyrum

  • The techniques above run in our own production system with the receipts published, not in a deck.measured
  • The savings guarantee makes the sprint self funding or free.
  • One person builds, measures and hands off. The evals stay with you.

Projections on this page are labeled benchmark ranges; the only promises are the guarantees.industry benchmark range, not a promise

The guarantee

Money back if verified savings do not beat the fee within 90 days. The sprint pays for itself or you do not pay for it.

Ask for pricing

No prices on this site. You get a number and a scope in the first reply.