We added five LLM-powered features to our learner platform last year. Our monthly inference bill peaked at 11% of revenue, then dropped to 3.6% — without removing a single feature. Here is the playbook, in the order we'd apply it again.
Cache aggressively, but at the right key
Most teams cache the prompt. We cache the prompt + relevant model state. A 30-day cache with intelligent invalidation cut our token spend by 47% before we touched anything else. Hash on the part of the input that actually drives the answer, not the whole payload.
Route to the cheapest model that passes evals
Build a tiny eval set per feature — 50 inputs with known good outputs. Test every cohort upgrade against it. Half our features run on Haiku at 1/15th the cost of the larger model, with zero quality regression. The eval set is the wedge.
Stream and stop early
For chat-shaped features, stream tokens and stop generating as soon as the user has the answer they need. Add a stop sequence aligned with how the UI actually consumes the output. Saved us another 18%.
The habit nobody talks about
Log every prompt, every response, every cost — and review the bottom-decile by quality every Friday. The bugs that drive cost are almost always quality bugs in disguise. Once you fix the quality, the cost falls out.