Skip to content

[Feature] Provider-native prompt caching — automatically enable Anthropic cache_control and OpenAI prompt caching #97

@carlosatta

Description

@carlosatta

What problem does this solve?

Anthropic and OpenAI both offer provider-side prompt caching that can reduce costs by 50–90% for requests with repeated context (system prompts, RAG chunks, long documents). Today, clients must explicitly opt in by adding cache_control breakpoints (Anthropic) or using cache-eligible models (OpenAI). Most clients do not do this, leaving significant cost savings on the table.

This is distinct from Routerly's semantic routing cache — provider-native caching is applied at the upstream API level and affects billing directly.

Proposed solution

Anthropic cache_control injection

For models on the anthropic adapter, automatically inject cache_control: { type: "ephemeral" } at the optimal breakpoint in the message array — typically the last turn of the system prompt or the last user message before the new user query. This converts a standard OpenAI-format request into an Anthropic cache-eligible request transparently.

Config:

{ "provider": "anthropic", "promptCaching": "auto" }
  • "auto" — inject cache_control at the heuristically determined breakpoint
  • "passthrough" — forward whatever the client sends (current behaviour)
  • "disabled" — strip any existing cache_control markers

OpenAI prompt caching

OpenAI's prompt caching is automatic for qualifying models (gpt-4o, o-series) with no client-side changes needed. Routerly should:

  1. Detect when a cached prompt token discount appears in the usage response
  2. Log cachedTokens in the usage record separately from inputTokens
  3. Surface cache hit rate in the dashboard

Alternatives you've considered

Requiring clients to add cache_control themselves. Works but requires educating every caller and modifying every integration. Most teams never do it.

Who would benefit from this?

Anyone using Anthropic models with long system prompts or RAG-style context injection. Prompt caching typically reduces input token costs by 50–90% for the cached portion, with no quality impact.

Additional context

Anthropic charges 0.1× for cache-read input tokens. For a system prompt of 10K tokens repeated across every request, the savings are immediate and significant. This is a pure cost optimisation that Routerly can provide transparently without any changes to the client application.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions