What problem does this solve?
Anthropic and OpenAI both offer provider-side prompt caching that can reduce costs by 50–90% for requests with repeated context (system prompts, RAG chunks, long documents). Today, clients must explicitly opt in by adding cache_control breakpoints (Anthropic) or using cache-eligible models (OpenAI). Most clients do not do this, leaving significant cost savings on the table.
This is distinct from Routerly's semantic routing cache — provider-native caching is applied at the upstream API level and affects billing directly.
Proposed solution
Anthropic cache_control injection
For models on the anthropic adapter, automatically inject cache_control: { type: "ephemeral" } at the optimal breakpoint in the message array — typically the last turn of the system prompt or the last user message before the new user query. This converts a standard OpenAI-format request into an Anthropic cache-eligible request transparently.
Config:
{ "provider": "anthropic", "promptCaching": "auto" }
"auto" — inject cache_control at the heuristically determined breakpoint
"passthrough" — forward whatever the client sends (current behaviour)
"disabled" — strip any existing cache_control markers
OpenAI prompt caching
OpenAI's prompt caching is automatic for qualifying models (gpt-4o, o-series) with no client-side changes needed. Routerly should:
- Detect when a cached prompt token discount appears in the usage response
- Log
cachedTokens in the usage record separately from inputTokens
- Surface cache hit rate in the dashboard
Alternatives you've considered
Requiring clients to add cache_control themselves. Works but requires educating every caller and modifying every integration. Most teams never do it.
Who would benefit from this?
Anyone using Anthropic models with long system prompts or RAG-style context injection. Prompt caching typically reduces input token costs by 50–90% for the cached portion, with no quality impact.
Additional context
Anthropic charges 0.1× for cache-read input tokens. For a system prompt of 10K tokens repeated across every request, the savings are immediate and significant. This is a pure cost optimisation that Routerly can provide transparently without any changes to the client application.
What problem does this solve?
Anthropic and OpenAI both offer provider-side prompt caching that can reduce costs by 50–90% for requests with repeated context (system prompts, RAG chunks, long documents). Today, clients must explicitly opt in by adding
cache_controlbreakpoints (Anthropic) or using cache-eligible models (OpenAI). Most clients do not do this, leaving significant cost savings on the table.This is distinct from Routerly's semantic routing cache — provider-native caching is applied at the upstream API level and affects billing directly.
Proposed solution
Anthropic
cache_controlinjectionFor models on the
anthropicadapter, automatically injectcache_control: { type: "ephemeral" }at the optimal breakpoint in the message array — typically the last turn of the system prompt or the last user message before the new user query. This converts a standard OpenAI-format request into an Anthropic cache-eligible request transparently.Config:
{ "provider": "anthropic", "promptCaching": "auto" }"auto"— injectcache_controlat the heuristically determined breakpoint"passthrough"— forward whatever the client sends (current behaviour)"disabled"— strip any existingcache_controlmarkersOpenAI prompt caching
OpenAI's prompt caching is automatic for qualifying models (gpt-4o, o-series) with no client-side changes needed. Routerly should:
cachedTokensin the usage record separately frominputTokensAlternatives you've considered
Requiring clients to add
cache_controlthemselves. Works but requires educating every caller and modifying every integration. Most teams never do it.Who would benefit from this?
Anyone using Anthropic models with long system prompts or RAG-style context injection. Prompt caching typically reduces input token costs by 50–90% for the cached portion, with no quality impact.
Additional context
Anthropic charges 0.1× for cache-read input tokens. For a system prompt of 10K tokens repeated across every request, the savings are immediate and significant. This is a pure cost optimisation that Routerly can provide transparently without any changes to the client application.