Add a backend-agnostic warm-first load balancer that prefesrs endpoints where the requested model is already resident in memory, avoiding cold-load latency (10-30s weight load + possible LRU eviction) on multi-model backends.
This generalises the backend-specific approach proposed in #152 into a strategy that works across any backend able to report model residency.
| Backend |
Source |
Notes |
| oMLX |
GET /v1/models/status (loaded: bool) |
multi-model |
| Ollama |
GET /api/ps |
multi-model |
| LM Studio |
JIT model state |
hot/cold |
| vLLM / SGLang / llama.cpp |
single-model |
always warm |
The warm-first balance would wrap something like sticky( warmFirst( baseBalancer ) ) and something high-level like this:
Select(candidates, requestedModel):
warm = [e for e in candidates if residency.IsResident(e, requestedModel) == (true, true)]
if warm is non-empty:
return base.Select(warm, requestedModel) # delegate within warm set
return base.Select(candidates, requestedModel) # nobody warm → normal behaviour
Add a backend-agnostic
warm-firstload balancer that prefesrs endpoints where the requested model is already resident in memory, avoiding cold-load latency (10-30s weight load + possible LRU eviction) on multi-model backends.This generalises the backend-specific approach proposed in #152 into a strategy that works across any backend able to report model residency.
GET /v1/models/status(loaded: bool)GET /api/psThe
warm-firstbalance would wrap something likesticky( warmFirst( baseBalancer ) )and something high-level like this: