[Feature] Provider-native prompt caching — automatically enable Anthropic cache_control and OpenAI prompt caching

## What problem does this solve?

Anthropic and OpenAI both offer provider-side prompt caching that can reduce costs by 50–90% for requests with repeated context (system prompts, RAG chunks, long documents). Today, clients must explicitly opt in by adding `cache_control` breakpoints (Anthropic) or using cache-eligible models (OpenAI). Most clients do not do this, leaving significant cost savings on the table.

This is distinct from Routerly's semantic routing cache — provider-native caching is applied at the upstream API level and affects billing directly.

## Proposed solution

**Anthropic `cache_control` injection**

For models on the `anthropic` adapter, automatically inject `cache_control: { type: "ephemeral" }` at the optimal breakpoint in the message array — typically the last turn of the system prompt or the last user message before the new user query. This converts a standard OpenAI-format request into an Anthropic cache-eligible request transparently.

Config:
```json
{ "provider": "anthropic", "promptCaching": "auto" }
```
- `"auto"` — inject `cache_control` at the heuristically determined breakpoint
- `"passthrough"` — forward whatever the client sends (current behaviour)
- `"disabled"` — strip any existing `cache_control` markers

**OpenAI prompt caching**

OpenAI's prompt caching is automatic for qualifying models (gpt-4o, o-series) with no client-side changes needed. Routerly should:
1. Detect when a cached prompt token discount appears in the usage response
2. Log `cachedTokens` in the usage record separately from `inputTokens`
3. Surface cache hit rate in the dashboard

## Alternatives you've considered

Requiring clients to add `cache_control` themselves. Works but requires educating every caller and modifying every integration. Most teams never do it.

## Who would benefit from this?

Anyone using Anthropic models with long system prompts or RAG-style context injection. Prompt caching typically reduces input token costs by 50–90% for the cached portion, with no quality impact.

## Additional context

Anthropic charges 0.1× for cache-read input tokens. For a system prompt of 10K tokens repeated across every request, the savings are immediate and significant. This is a pure cost optimisation that Routerly can provide transparently without any changes to the client application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Provider-native prompt caching — automatically enable Anthropic cache_control and OpenAI prompt caching #97

What problem does this solve?

Proposed solution

Alternatives you've considered

Who would benefit from this?

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Provider-native prompt caching — automatically enable Anthropic cache_control and OpenAI prompt caching #97

Description

What problem does this solve?

Proposed solution

Alternatives you've considered

Who would benefit from this?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions