Optimization Sprint, the teardown
The sprint, before and after, on real numbers
Your AI bill drops in weeks and the evals prove quality held. Below: the real cost history of our own system, what each technique contributes, and a simulator you can drag.
What you get, week by week
Instrument first, exactly as in the audit, then ship the quick wins: cache breakpoints placed right, obvious output caps, retry hygiene. Cheap fixes land before the deep ones.
The structural work: prompt caching architecture, model and tool routing, token budgets and output discipline. An eval harness runs from day one so every cut is gated on quality holding.
Verification on production traffic, a before and after report reconciled against your provider bill, and handoff: your team owns the levers and the evals.
Forwardable as the scope of work. 2 to 4 weeks, fixed fee with the audit credited, pricing on request.
The real history, told honestly
The honest part: the drop from 5.5 to 3.6 came from warm cache plus tool consolidation, not magic, and the three sets are never blended or averaged. Real systems vary run to run; a sprint moves the whole band down and the report shows you the run level evidence, exactly like this.
With and without caching, per question
Every published question against its own no caching counterfactual. The interactive simulator further down moves the same numbers; this is the underlying per question table.
Show the per question table, all 8
| question | measured | no caching | multiplier |
|---|---|---|---|
| q1 | 4.1c | 5.8c | 1.44x |
| q2 | 4.9c | 9.4c | 1.92x |
| q3 | 4.7c | 9.2c | 1.95x |
| q4 | 2.3c | 4.4c | 1.94x |
| q5 | 1.7c | 3.9c | 2.23x |
| q6 | 4.6c | 6.4c | 1.4x |
| q7 | 2.0c | 4.1c | 2.06x |
| q8 | 3.3c | 5.3c | 1.62x |
method: every cached token repriced at the fresh input rate, same rounds, same outputcomputed
The cache simulator, on measured tokens
All 8 instrumented answers summed: 72.6% of input tokens came from cache in these runs. Drag the rate and watch what the same workload costs as caching degrades or improves.
method: linear between the no caching counterfactual (48.6 cents at 0%), the measured run (27.6 cents at 72.6%, cache writes priced as they really happened), and the all cache floor (16.7 cents at 100%), at published per token rates.
note: the showcase quotes about 95 percent cache hit, the steady state prefix hit rate in production traffic. The 72.6% marked here is a different, stricter measure: cache reads as a share of ALL input tokens within these 8 instrumented runs, including each run’s one time cache writes. Both are real; we label which is which.
Model cascading, the next lever
The cheapest token is the one you never send. The deepest lever past caching is not a cheaper answer, it is no language model call at all for the routine slice: a small task specific classifier handles intent routing, triage and structured extraction at near zero token cost, and the full model is reserved for synthesis. This is model cascading, a standard technique, and distillation is how the small model gets good enough to trust.
This is not hypothetical. The production system behind these numbers already runs a custom trained nine class classifier on its vision path, doing real perception with no language model token cost at all. The slider projects what happens to the model bill as a classifier absorbs more of a workload.
method: a projection, not a measurement. The measured mean here is 3.4 cents per answer across 8 instrumented answers. This models a workload where 40% of messages are routine enough for a task specific classifier to resolve, priced at zero language model tokens, with the full model reserved for synthesis. The classifier’s own inference is a fraction of a cent and not a language model bill; what moves is the model spend you stop paying.
What this is not: routing the hard answer to a weaker model. We tested that on the vision path, it misdiagnosed a disease, and we refused it. Cascading offloads the routine perception step; it does not downgrade the synthesis.stated
What each technique contributes
measured here: without it these 8 answers cost 1.76x as much (per question range 1.4x to 2.23x)
computedmeasured on the anatomy run: sending all 45 tool definitions every round instead of the routed 12 adds 4,616 tokens per round
computedoutput tokens are 48% of this run set's cost at $15/M; capping and tightening answers is often the largest single lever
computedthe gate for every cut: we publish what we refused to optimize because quality paid for it
statedOn a client system the order is decided by the audit: biggest measured gap first. That is the whole trick, and it only works if you measured first.
What this does for your company
The levers above, shipped on your stack, with savings verified against the provider bill, not estimated.
measuredThe eval harness that gates the sprint stays with you; quality regressions surface before your customers find them.
stated2 to 4 weeks, fixed scope. The audit already ranked the work, so the sprint starts cutting on day one.
statedThe honest ROI math
projected monthly savings
$12K to $36K
projected yearly savings
$144K to $432K
Existence proof from our own system: without prompt caching the 8 instrumented answers would cost 1.76x as much, so caching alone removes about 43 percent of that bill.measured
The floor is the guarantee, not a projection: money back if verified savings do not beat the fee within 90 days.
At this spend, engagements like this typically pay for themselves within weeks. Ask for pricing.
Why Wyrum
- The techniques above run in our own production system with the receipts published, not in a deck.measured
- The savings guarantee makes the sprint self funding or free.
- One person builds, measures and hands off. The evals stay with you.
Projections on this page are labeled benchmark ranges; the only promises are the guarantees.industry benchmark range, not a promise
The guarantee
Money back if verified savings do not beat the fee within 90 days. The sprint pays for itself or you do not pay for it.
Ask for pricingNo prices on this site. You get a number and a scope in the first reply.
