Skip to content

New warm-first Load Balancer #166

@thushan

Description

@thushan

Add a backend-agnostic warm-first load balancer that prefesrs endpoints where the requested model is already resident in memory, avoiding cold-load latency (10-30s weight load + possible LRU eviction) on multi-model backends.

This generalises the backend-specific approach proposed in #152 into a strategy that works across any backend able to report model residency.

Backend Source Notes
oMLX GET /v1/models/status (loaded: bool) multi-model
Ollama GET /api/ps multi-model
LM Studio JIT model state hot/cold
vLLM / SGLang / llama.cpp single-model always warm

The warm-first balance would wrap something like sticky( warmFirst( baseBalancer ) ) and something high-level like this:

Select(candidates, requestedModel):
      warm = [e for e in candidates if residency.IsResident(e, requestedModel) == (true, true)]
      if warm is non-empty:
          return base.Select(warm, requestedModel)   # delegate within warm set
      return base.Select(candidates, requestedModel)  # nobody warm → normal behaviour

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestroutingThis issue is with routing

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions