The average engineering team doubled its LLM token usage last year. Heavy users quadrupled it. Almost none of them looked at where those tokens actually go.
Here is where they go: 69% of your input tokens are system prompts. Internal instructions, tool guidance, policy definitions, context scaffolding — text that is identical on every single call, re-sent, re-processed, and billed in full every time. And only 28% of LLM calls use cached tokens, meaning 72% of teams are paying full price for computation they already did.
This is not an oversight. It is an engineering leadership failure that compounds with every new feature you ship.
The maths are not in your favour
Token costs scale with volume, and volume is accelerating. Datadog’s 2026 State of AI Engineering report — drawn from production telemetry across thousands of customers — found median token usage per request more than doubled year-over-year. At the 90th percentile it quadrupled. Context windows have expanded from 128K to 2M tokens on some pricing tiers, which means the ceiling on how expensive a single call can get is far higher than most teams modelled.
If your system prompt runs to 2,000 tokens and you make 100,000 LLM calls a day, you send 200 million tokens of identical text to the model every 24 hours. At Claude Sonnet rates, that is around $600/day for text that never changes. With prompt caching enabled, the same prefix costs about $60/day. You are burning $540 a day — roughly $197,000 a year — on redundant computation because nobody configured the cache header.
Multiply that across a moderately busy agentic system and you understand why AI infrastructure is now a top-five cloud expense for companies that were not budgeting for it.
ProjectDiscovery ran the experiment
ProjectDiscovery published the clearest before-and-after case study available on this. Their cache hit rate sat at 7% before they looked at it. One change — moving the static prefix to the correct position so the model could actually cache it — pushed the rate to 74% overnight. Incremental work over the following weeks brought it to 84%.
The cost reduction was 59% against baseline. In the final 10-day period they measured, savings reached 70%. They served 9.8 billion tokens from cache. For their most complex security audit tasks, 1,200+ steps and 67M tokens per run, the cache hit rate was 91.8%.
Their summary is worth quoting: those audits are “now economically viable to run repeatedly.” Before caching, that class of task was something you thought twice about. After caching, you run it whenever it makes sense.
That is the real business impact. Not a smaller invoice — a different decision about what AI-assisted work you can afford to do.
The observability gap
Most teams have no LLM cost visibility at the call level. They see a monthly bill from their provider and a general spend trend. They do not see which feature generates the expensive calls, which system prompt is the longest, or what the cache hit rate is per endpoint. That is not a vendor problem — observability tooling exists. It is a prioritisation problem.
LLM cost observability is the prerequisite for every other cost decision. You cannot fix a prompt you cannot measure. You cannot catch the engineer who added 3,000 tokens of “helpful context” to a system prompt last Tuesday if you do not instrument at the call level.
The Datadog report is direct on the underlying issue: operational complexity, not model intelligence, is now the primary barrier to reliable AI at scale. Cost is one dimension of that. Rate limits — which caused nearly 60% of AI request failures in production as of February 2026 — are another. Both symptoms point to teams that shipped fast and skipped the operational layer.
Why this lands on leadership
Every token sent to an LLM is an architectural decision someone made. The system prompt length, the call frequency, the presence or absence of caching — all of it was decided, deliberately or by default.
Most teams made these choices at prototype stage, when cost was irrelevant and the goal was to get the thing working. Nobody went back. The prototype architecture became the production architecture, and now it scales with load. Token usage doubles and the system prompt that made sense at 10,000 calls a month gets re-sent 1.5 million times in month twelve. The architecture held. The cost assumption did not.
That is what makes this a leadership accountability problem rather than technical debt. Technical debt is accumulated complexity the team plans to address. This is money leaving the account every day because no engineering leader made the call to treat LLM cost as an operational concern alongside latency, availability, and error rate.
Three things to do before the bill arrives
Instrument first. Add LLM observability at the call level: token count in and out, cache hit/miss, cost per call, system prompt version. Langfuse, Helicone, and Datadog LLM Observability all do this. Pick one and turn it on this sprint. You cannot fix what you cannot see.
Audit your system prompts. Pull the top five by call volume. Measure the token length. Ask whether each one needs to be sent in full on every call or whether the static prefix could be cached. The answer is almost always the latter.
Move the static content to the top. Cache implementations work on prefix matching — the model caches the computation for a static prefix and reuses it when the same prefix reappears. If your system prompt has variable content early (user name, session ID, dynamic context) and static content later, flip the order. The ProjectDiscovery result was a single-day change. The rate went from 7% to 74% overnight because the static content was in the wrong position.
None of this requires a rewrite. It requires someone to own it.
If your team is shipping LLM features without cost observability and you want to get ahead of the bill, talk to us.
Sources