New `warm-first` Load Balancer

Add a backend-agnostic `warm-first` load balancer that prefesrs endpoints where the requested model is already resident in memory, avoiding cold-load latency (10-30s weight load + possible LRU eviction) on multi-model backends.

This generalises the backend-specific approach proposed in #152 into a strategy that works across any backend able to report model residency.

| Backend | Source | Notes |
|---|---|---|
| oMLX | `GET /v1/models/status` (`loaded: bool`) | multi-model |
| Ollama | `GET /api/ps` | multi-model |
| LM Studio | JIT model state | hot/cold |
| vLLM / SGLang / llama.cpp | single-model | always warm |

The `warm-first` balance would wrap something like `sticky( warmFirst( baseBalancer ) )` and something high-level like this:

```
Select(candidates, requestedModel):
      warm = [e for e in candidates if residency.IsResident(e, requestedModel) == (true, true)]
      if warm is non-empty:
          return base.Select(warm, requestedModel)   # delegate within warm set
      return base.Select(candidates, requestedModel)  # nobody warm → normal behaviour
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New `warm-first` Load Balancer #166

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Backend	Source	Notes
oMLX	`GET /v1/models/status` (`loaded: bool`)	multi-model
Ollama	`GET /api/ps`	multi-model
LM Studio	JIT model state	hot/cold
vLLM / SGLang / llama.cpp	single-model	always warm

Uh oh!

New warm-first Load Balancer #166

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New `warm-first` Load Balancer #166