Skip to content

Multi-GPU model serving and opencode integration#267

Open
tobocop2 wants to merge 90 commits into
mainfrom
feat/local-model-api
Open

Multi-GPU model serving and opencode integration#267
tobocop2 wants to merge 90 commits into
mainfrom
feat/local-model-api

Conversation

@tobocop2

@tobocop2 tobocop2 commented May 17, 2026

Copy link
Copy Markdown
Owner

Problem

lilbee ran its models on a single GPU, and nothing outside lilbee could use them. A coding agent like opencode had no way to reach lilbee's local models, and lilbee couldn't call tools at all. Indexing a large library was strictly one card's worth of work at a time.

What this adds

Models run across all your GPUs

lilbee now spreads its models across every GPU on the machine instead of running on a single card. A model too big for one card tensor-splits across several; smaller roles pack onto the cards that have room. The whole box goes to work, so bigger models and heavier workloads have somewhere to run, and the fleet serves several requests at once instead of one at a time.

Bulk ingest fanned across the whole box

For indexing a large scanned library, lilbee can run several copies of the OCR and embedding models, one per spare GPU, and spread the work across all of them. A single lilbee sync then saturates every card instead of funneling through one, which is the difference between days and hours on a big corpus. Set vision_replicas / embed_replicas and lilbee places one server per GPU, load-balances pages and embedding batches to the least-busy one, and scales the ingest pipeline's file concurrency to match so the extra cards never sit idle.

opencode can use lilbee's models

lilbee speaks the OpenAI API. Point opencode, or any OpenAI-compatible client, at lilbee and it uses your local models directly: chat, tool calls, and lilbee_search over the same connection.

Tool calling

lilbee's models can call tools now. That is what makes the opencode integration useful: the model searches your library and acts on the results instead of only chatting.

Feedback while models load

lilbee launch opencode (and the other launchers) now show real progress while a cold model loads: a byte bar as the weights are read, then an engine-load spinner, then a clear handoff to the client. Before this it sat silent for what could be minutes on a large model. Once a server is up, repeat launches reuse it and start fast.

Better answers and longer conversations

The RAG path gained grounded refusal (it says when the library has nothing relevant instead of guessing), a context budget so retrieved chunks and history fit the model's window, and a cited-subset so a grounded answer is distinguishable from an off-corpus one. Long agentic conversations no longer overflow a native model's context: history is windowed to the served window instead of failing the turn. Retrieval quality improved too, with asymmetric query embedding for instruction-tuned embedders and LLM-based rerankers served alongside cross-encoders.

Indexes images, not just PDFs

Scanned images now route through the same vision OCR path as PDFs, so a folder of page scans indexes the same way a document does.

Architecture

Each request routes to a hosted model or to the local engine, which runs your models as a fleet of llama-server processes spread across the GPUs (one per role: chat, embedding, reranking, vision; plus extra replicas of a role when configured). A llama-swap supervisor owns the process lifecycle, and gguf-parser sizes each load so the fleet packs onto the available memory.

flowchart TD
    CLI[CLI] --> SVC
    TUI[TUI] --> SVC
    MCP[MCP server] --> SVC
    API[HTTP API] --> SVC

    SVC["App services · RAG pipeline<br/>(embed → rerank → chat)"] --> FAC{{"create_provider<br/>config.llm_provider"}}

    FAC -->|REMOTE| SDK["SdkLLMProvider"]
    FAC -->|AUTO| ROUTE["RoutingProvider<br/>(route by model-ref prefix)"]
    ROUTE -->|"hosted ref"| SDK
    ROUTE -->|"local GGUF ref"| FLEET["FleetProvider"]
    SDK --> CLOUD[("Hosted<br/>OpenAI-compatible API")]

    subgraph ENGINE["Local engine · managed llama-server fleet (one process per role, plus replicas)"]
        direction TB
        CHAT["chat · --jinja<br/>chat + native tool calls"]
        EMB["embed · --embeddings<br/>(×N replicas)"]
        RR["rerank · --pooling rank"]
        VIS["vision / OCR · --mmproj<br/>(×N replicas)"]
    end

    FLEET -->|"spawn · /health · restart<br/>+ localhost OpenAI HTTP"| ENGINE
Loading

On one GPU it is a fleet-of-one; on a multi-GPU host the supervisor packs the servers across cards by memory and, with replicas set, fans a role across the spare cards. The engine installs with lilbee, so there is nothing extra to set up.

Performance

Profiling a 595-page scanned PDF on a single GPU showed where OCR time actually went, and the fixes roughly halved it (this run used one card, not the replica fan-out):

Metric Before After
595-page scan, indexed end to end on one GPU 36 min 18 min
OCR pages in flight at once 1 (strictly sequential) up to vision_ocr_concurrency (4) per server, batched on the vision server's slots
Worst page (a repetition runaway) 29,779 chars over 51 s, uncapped capped at vision_ocr_max_tokens (4096)

Rasterization turned out to be negligible (1.9% of wall time); the real costs were a single-sequence decode that left the GPU about half idle and the occasional page that looped into a repetition, which the two changes above address. Embedding also regained a token/sequence-aware sub-batching step, so a large library no longer overruns the embed server's batch limit while indexing. Those numbers are a single GPU with one vision server; with vision_replicas the same OCR work fans across every card on the box on top of this per-GPU gain, so a multi-GPU host scales it further.

Supported models

These families were verified end to end through the local engine, so opencode can chat and call tools against them:

Family Verified ref
Qwen3 Qwen3-4B, Qwen3-Coder-30B-A3B
Llama 3.1 Meta-Llama-3.1-8B-Instruct
Gemma 4 gemma-4-E2B-it
GLM GLM-4.5-Air
MiniMax MiniMax-M2
gpt-oss gpt-oss-120b

Other families were tried but don't tool-call reliably through the engine yet: some ship no tool-call template (Hermes 3, GLM-4-9B), some never dispatch a tool in practice (functionary, the small DeepSeek-R1 distills, Granite, Phi-4-mini), and Mistral Nemo still errors on the fleet path for some message sequences. Very small or reasoning-only models may also decline to call tools.

@tobocop2 tobocop2 changed the title Local-model API: opencode (and any OpenAI/Anthropic client) talks to lilbee models Local-model API: popular clients can talk to lilbee models May 17, 2026
@tobocop2 tobocop2 changed the title Local-model API: popular clients can talk to lilbee models feat: Local-model API: popular clients can talk to lilbee models May 17, 2026
@tobocop2 tobocop2 marked this pull request as draft May 18, 2026 07:37
tobocop2 added a commit that referenced this pull request May 20, 2026
…lidated PR

Four-round audit loop on the full feat/local-model-api surface (PR #267)
found and addressed 22 findings across cross-cutting-api-change and
AGENTS.md compliance. Round 4 converged with zero findings.

Significant changes:

* Response-parser schemas lazy-load via @functools.cache get_schemas()
  instead of running 20 JSON file reads at module import on every
  parent-process bootstrap. Saves ~50ms off every CLI invocation.
* Native supports_tools tightened: the substring set could match
  "tools" anywhere in a chat template (false-positive routing of tool
  requests to incapable models). Now an anchored Jinja-delimiter regex.
* ContextWindowExceededError typed fields (requested / usable_budget /
  n_ctx) survive the worker-to-parent round-trip. Parent parses the
  message via regex when rebuilding the typed exception.
* /api/chat and /api/ask routes translate ModelNotFoundError,
  ModelDoesNotSupportToolsError, and ContextWindowExceededError into
  typed HTTP responses (404/400/400) instead of bubbling to 500.
* _classify_stream_error now uses CompletionsErrorCode enum values
  instead of magic strings.
* Schema-loader, _AbortBridge, and other constant docstrings trimmed
  of multi-paragraph design rationale per AGENTS.md.
* Historical narrative ("Defends against pre-fix manifests", "Without
  this, ...") removed from docstrings.
* pyproject.toml version aligned with main (b479).
tobocop2 added a commit that referenced this pull request May 22, 2026
…ate self-tune demo

- Built on expanded: llama.cpp / llama-cpp-python and Hugging Face Hub /
  huggingface_hub elevated with explicit 'without these there is no lilbee'
  framing. Tesseract, Litestar, MCP SDK, Typer, Pydantic, and tree-sitter-
  language-pack added; they were all load-bearing but uncredited.
- Quick-start 'agent already knows what chunk size, MMR weight, and reranker
  depth do' was implementation-detail leakage. Reframed as 'agents already
  understand search engines, so the right knobs to move are obvious to them.'
- Self-tune demo moved out of 'Grounding for AI agents' (its placement there
  was awkward) and into 'Already using an MCP-aware agent? Hand setup to it.'
  where the topic is exactly agent-driven retrieval tuning. The demo is now
  framed as 'coming with the opencode integration (#267)' since the launcher
  side of the integration isn't shipped yet.
tobocop2 added a commit that referenced this pull request May 22, 2026
…ate self-tune demo

- Built on expanded: llama.cpp / llama-cpp-python and Hugging Face Hub /
  huggingface_hub elevated with explicit 'without these there is no lilbee'
  framing. Tesseract, Litestar, MCP SDK, Typer, Pydantic, and tree-sitter-
  language-pack added; they were all load-bearing but uncredited.
- Quick-start 'agent already knows what chunk size, MMR weight, and reranker
  depth do' was implementation-detail leakage. Reframed as 'agents already
  understand search engines, so the right knobs to move are obvious to them.'
- Self-tune demo moved out of 'Grounding for AI agents' (its placement there
  was awkward) and into 'Already using an MCP-aware agent? Hand setup to it.'
  where the topic is exactly agent-driven retrieval tuning. The demo is now
  framed as 'coming with the opencode integration (#267)' since the launcher
  side of the integration isn't shipped yet.
tobocop2 added a commit that referenced this pull request May 22, 2026
…al-coming section

The MiniMax M2.7 demos in 'Agent integration' weren't obviously cloud
demos at a glance, which understated the lilbee-stays-local benefit and
also gave the wrong impression that local opencode was a shipped story.
Make it loud that opencode is driving a cloud model in both demos and
link up to 'Opencode integration (coming)' where local-model support
is in flight (#267). Drop the specific 'MiniMax M2.7' name in line with
the no-named-models framing.
@tobocop2 tobocop2 changed the title feat: Local-model API: popular clients can talk to lilbee models feat: local-model API + llama-server engine (opencode + multi-GPU fleet) May 27, 2026
@tobocop2 tobocop2 changed the title feat: local-model API + llama-server engine (opencode + multi-GPU fleet) Multi-GPU model serving + OpenAI-compatible API for opencode May 27, 2026
@tobocop2 tobocop2 changed the title Multi-GPU model serving + OpenAI-compatible API for opencode Re-architect local inference onto a multi-GPU llama-server fleet, with opencode integration May 27, 2026
@tobocop2 tobocop2 changed the title Re-architect local inference onto a multi-GPU llama-server fleet, with opencode integration Multi-GPU model serving and opencode integration May 27, 2026
@tobocop2 tobocop2 force-pushed the feat/local-model-api branch 2 times, most recently from 5b843d7 to 559c777 Compare June 4, 2026 05:09
@tobocop2 tobocop2 force-pushed the main branch 3 times, most recently from a5d1a07 to b399947 Compare June 4, 2026 23:37
@tobocop2 tobocop2 marked this pull request as ready for review June 6, 2026 03:26
tobocop2 added 9 commits June 6, 2026 16:01
…f-parser

Run native GGUF models as a fleet of llama-server processes (one per role:
chat/embed/rerank/vision), supervised by llama-swap and VRAM-sized by
gguf-parser. A model tensor-splits across GPUs and roles pack by capacity;
data-parallel replicas fan a role across every card. The engine binaries ship
in the lilbee-engine wheel built by tools/wheel-build.
Route each request by model ref: native GGUF refs go to the local fleet,
remote-prefixed refs to the OpenAI-compatible SDK backend. Adds model
discovery and role classification for installed and remote models.
Serve /v1 chat-completions and embeddings so any OpenAI-compatible client can
use lilbee's local models, with tool-call extraction and a canonical chat
dispatch path shared by the HTTP API, CLI, and MCP.
Launch opencode (and similar agents) against lilbee as their model provider,
with the lilbee_search skill/MCP wired in and the built-in client tools scoped
off so the model retrieves through lilbee.
…, page-text dataset

Fan OCR pages and embeddings across the replica fleet during a single sync,
cache per-page OCR so a downstream failure doesn't re-OCR, batch LanceDB
writes across documents, and record a per-page text dataset alongside chunks.
Real-hardware fleet smoke test, opencode model matrix, and CI that builds the
engine binaries and runs the integration suite against the fleet.
@tobocop2 tobocop2 force-pushed the feat/local-model-api branch from 132938c to 8381ab4 Compare June 6, 2026 20:03
tobocop2 added 2 commits June 6, 2026 16:14
token_cap equalled the per-slot context, so an exact-truncated oversize input
plus the server's re-added BOS overflowed by a token and the overflow retry
still failed. Truncate a few tokens under n_ctx.
opencode 1.16.2 rejects the whole config with ConfigInvalidError 'Missing key
limit.output' when a model's limit has only context, so it crashed on startup
and the matrix's opencode session died. Emit both context and output.
tobocop2 added 30 commits June 19, 2026 16:12
…mpty (#394)

A container orchestrator (SkyPilot, some Docker/Kubernetes GPU setups) can export
an empty CUDA_VISIBLE_DEVICES while exposing the GPU through NVIDIA_VISIBLE_DEVICES.
The empty value reads as no devices, so the fleet device probe reports no
CUDA-capable device and the ingest fails on a host that has a working GPU.

Clear an empty backend visible-devices var during fleet GPU bootstrap when a GPU is
physically present (detected via nvidia-smi/NVML, independent of the var). Explicit
pins and the conventional -1 CPU opt-out are non-empty and survive.
The store swallowed delete failures inside delete-then-add, silently producing duplicate or false-success state (memory add/update/delete, the sources-table replace, and the identity row). The delete helper now reports success and every caller gates the add or reports the real result; identity reads take the newest row. Narrows the wiki draft handlers to a dedicated PathTraversalError so genuine downstream errors are not masked, and rounds out the SSRF embedded-IPv4 coverage. Reviewed to convergence; 100% coverage.
…396)

_max_concurrent only raised file-concurrency to the vision OCR capacity when
vision_replicas > 1. On a single-GPU fleet it fell back to cpu_quota (cpu_count//2),
so a many-core host fanned dozens of concurrent OCR requests at one vision server's
few continuous-batching slots. The server returns HTTP 429; the bounded RATE_LIMIT
retries exhaust under the sustained oversubscription and the file is dropped (observed
live: a 1x A100 dropped 643 files to 429 while the GPU sat at 97%).

Bound file-concurrency by the vision slot capacity (replicas x per-server slots)
whenever a vision model is configured, single server included. This also fixes a
latent multi-GPU flood where cpu_quota exceeded the total slots.
Harden lilbee for concurrent, always-on operation, fixing a class of races that
only appear when many clients share one process:

- Refuse the operations that rebuild the shared engine (vault init, factory
  reset, and a model-provider switch via settings_set / settings_reset / the
  REST config handler) on the HTTP server, where they would tear the singleton
  down under other clients' in-flight requests; the CLI, TUI, and stdio MCP keep
  them. Services swap their reference before closing the old instance.
- Move per-request OCR settings to a ContextVar that propagates into worker
  threads, instead of mutating the global config.
- Guard the lazy provider, spaCy, and HuggingFace xet globals so concurrent
  first-callers can't double-initialize or clobber them.
- Resize the vision-request cap only when the gate is idle, so a mid-run change
  can't momentarily double the real concurrency.
- Hand streaming answer tokens back to the event loop from the worker thread
  safely.
- Remove an edited-to-empty file's stale chunks from the index on sync.
- Coerce string-valued chunk settings before validation so MCP callers can't
  skip the guards.

Full test coverage including the concurrency behaviors.
…398)

* QA harness: fail ungrounded answers, auto-approve reel edits

Two frame-validation findings from the full-matrix + reel run:

- The scenario pass-gate accepted ungrounded answers: a fresh lilbee_search
  dispatch plus a chat completion passed even when the model answered
  'not found in the indexed reference' / 'no documentation exists'. The tier
  prompts target content the indexed Godot reference does contain, so that is
  a real failure (retrieval gap or the model ignoring results), not a clean
  pass. Add an ungrounded-answer marker check to _poll_verdict so those FAIL,
  with unit tests. This stops the matrix from greenlighting demos in which
  lilbee appears to find nothing, and surfaces the underlying retrieval gap
  instead of masking it.

- The coder/giant reels never wrote their artifact: an automated VHS reel has
  no human to satisfy opencode's default 'ask' edit permission, so file writes
  silently never happened and the model narrated the code instead. Add
  permission.edit=allow to the reel's opencode.json so writes auto-approve.

* Harden the ungrounded-answer gate per review

- Apply the check only to a would-be-PASS verdict (a fresh dispatch AND chat
  completion), so a model answering 'not found' from memory without dispatching
  is not mislabeled as an ungrounded-search failure; it keeps waiting and times
  out on the accurate no-dispatch detail.
- Scan only the answer tail, not the whole cumulative pane, so a transient
  'not found' from mid-search reasoning cannot false-fail a recovered grounded
  answer.
- Narrow the markers to the unambiguous 'search found nothing in the reference'
  phrases; drop the generic ones that can appear in a good answer scoping a
  negative.
- Match the sibling marker checks' case handling (m.lower()), and add a
  no-dispatch test.

* Move the grounding gate to the settled pane (fix verdict-timing race)

Round-2 review caught that the in-verdict gate was mistimed: run_scenario
returns the instant dispatch+completion trips, but the harness is designed so
the answer is still streaming then (it renders in wait_for_answer_settle,
after the verdict is fixed). So the gate scanned mid-search reasoning, not the
final answer -- under-firing the real 'not found' answer and re-introducing
the transient false-positive.

Extract the check into a pure downgrade_if_ungrounded(result, settled_pane) and
apply it in run_smoke_scenarios right after wait_for_answer_settle, on a fresh
capture, so grounding is judged on the rendered answer. Tests now inject a
settled pane (how the harness calls it) and cover the no-op-on-non-PASS and
tail-scan cases.
Make the transaction-less store's write paths atomic and lock per-instance:

- Import each source through one locked transaction (cleanup + chunks + page
  texts + source row), with the embedding/dimension checks before the delete, so
  a failure or a concurrent reader never sees the source half-replaced.
- Re-embed memories with the snapshot, embed, and table rebuild all under the
  write lock, so a memory added mid-rebuild is no longer erased.
- Rebuild concept clusters via a single delete-and-add-under-one-lock helper, so
  readers never see the nodes table empty while the graph reports present.
- Key the write lock on each store's own data directory (a per-instance engine
  coordinated on the wrong file before) and share one timeout budget across its
  two stages instead of applying it twice.
- Report a file that yields page text but no searchable chunks as skipped rather
  than added, while still persisting its page text.
- Count only published pages as backlinks in the wiki orphan linter (drafts and
  archived pages no longer exempt a live page).

Full test coverage including the atomicity and locking behaviors.
…-main-20

# Conflicts:
#	pyproject.toml
#	src/lilbee/app/settings_map.py
#	src/lilbee/core/config/enums.py
#	src/lilbee/core/config/model.py
#	src/lilbee/mcp_server.py
#	tests/test_catalog_actions.py
#	tests/test_settings.py
#	tools/wheel-build/build_lilbee_binary.sh
#	uv.lock
* Evict ingest-lock registry entries on release

try_acquire created a permanent dict entry per distinct filename and release
never removed it, so a long-lived server ingesting many files grew one
asyncio.Lock per basename for its whole life. Release now drops the entry; it is
safe because try_acquire only acquires a free lock (these locks never have
waiters) and release runs synchronously, atomic with try_acquire on the loop.

* Suppress ProviderError in the read-only ready probe

_ready_models reaches endpoint(), which raises ProviderError when a concurrent
shutdown has cleared the port. role_ready is a read-only probe (HTTP status, SSE
warming), so the probe now reports nothing-ready instead of throwing.

* Make CUDA LD_LIBRARY_PATH composition idempotent

cuda_runtime_env always prepended the wheel dirs to the existing path, so
re-applying it on every reload pass accumulated duplicate copies that were baked
into each spawned server. Drop existing entries that are already wheel dirs.

* Close the old client pools when re-adopting a reloaded swap

_adopt_swap overwrote self._clients without closing the previous clients, so
every reload (TUI/REST/CLI apply) abandoned one httpx pool per replica per role
with open sockets. The reloaded swap already stopped the old upstreams, so the
old clients are closed in place.

* Terminate the chromium install subprocess on cancel

bootstrap_chromium awaited the ~180MB 'playwright install' with no try/finally,
so an SSE client disconnect cancelled the task and left the install running
orphaned (repeated retries stacked concurrent installs). Wrap the wait in a
finally that terminates then kills the process.

* Free the SDK vision OCR caller at the timeout deadline

The timeout path used 'with ThreadPoolExecutor', whose __exit__ shutdown(wait=
True) blocked until the hung call returned, so the caller was not freed at the
deadline. Shut the pool down without waiting (cancel_futures), matching the
fleet OCR path.

* Account for every shard of a split GGUF

A split GGUF was registered with one manifest tracking only the first shard, so
removal freed only the first shard's blob (orphaning the rest when a sibling
quant kept the repo cache alive) and reported only the first shard's size as
freed. The manifest now records the total size and all shard blob digests;
removal gc's every shard and freed size uses the total. Multi-shard download
progress reports one monotonic 0->100% against the real total instead of N
separate per-shard cycles.

* Forward stream close through the reasoning and bare-JSON wrappers

On reasoning cap-fire, filter_reasoning closed the _text_only generator, but its
plain for-loop did not propagate GeneratorExit to the underlying chat stream, so
the HTTP connection and its in_flight slot leaked until GC. _text_only and the
sibling _recover_bare_json_stream now close their source on exit.

* PR5 coverage: fix bootstrap test fakes, drop dead base-bar callback, cover missing-shard

Add returncode to the existing bootstrap _Proc fakes (the new terminate path
reads it), move the now-superseded callback out of the base progress bar (the
tracker subclass owns it), and cover the missing-shard branch in _shard_accounting.

* Address PR5 review: report full split-GGUF size in list/show; guard stream close

The list and show surfaces still read the first-shard size_bytes after the
removal path moved to disk_size_bytes; switch them too so a split GGUF reports
its real on-disk total everywhere. Suppress teardown errors when forwarding the
bare-JSON stream close (matching the reasoning wrapper), drop a redundant import,
tighten a comment, and gate the bootstrap-cancel test on the task actually
reaching proc.wait().

* Address PR5 review round 2: drain before close, real shard digests, recover legacy shards

Round-2 review caught regressions in the PR5 fixes:
- _adopt_swap now drains each old client's in-flight requests before closing it
  off-thread, so a reader mid-call on an old client snapshot is never closed out.
- _shard_accounting uses _blob_digest (handles copy/non-symlink mode) instead of
  the snapshot filename, and the cache-recovery path records shard accounting too,
  so split GGUFs free every shard outside the symlinked-cache case.
- Removal recovers shard blobs from the cache for pre-accounting manifests.
- Cap the post-kill reap wait in the crawler bootstrap.

* Address PR5 review round 3: REST freed size, progress total guard, drain grace

REST DELETE now delegates to the shared remove path so it reports the real
multi-shard freed size (was a hardcoded 0) like CLI and MCP. Multi-shard
download progress only uses the summed grand total when every shard size is
known (else falls back to per-shard, never exceeding 100%). The reload client
drain waits a short grace so a just-checked-out reader's request enters its
in-flight window before the drain inspects it; note the wedged-OCR thread relies
on the backend httpx timeout.

* Strengthen PR5 review-flagged tests (shard gc passthrough, lock eviction)

* Reserve streaming in_flight eagerly so reload drain can't close mid-handoff

A stream increments in_flight lazily on its first frame, but a stream is built
and handed off (SSE handshake, MessageStart) before that first next(), so a
concurrent reload's client drain saw in_flight==0 and could close the client out
from under a checked-out stream. Reserve the slot at chat/chat_stream_items call
time and release it when the stream closes or exhausts.

* Close reload-retired clients by generation, not by in_flight timing

The in_flight-based drain was racy: in_flight is incremented at request time, so
a checked-out-but-not-yet-issued request (especially a non-stream chat doing
prompt windowing, or a stream before its first frame) read as idle and could be
closed out from under, and eager stream reservation leaked when a stream was
closed before its first iteration. Replace it: a reload retires its old clients
and closes the *previous* reload's retired clients (idle by now, confirmed via
in_flight==0); shutdown closes whatever remains. Revert the eager-reservation
streaming changes (the drain no longer keys on in_flight at close time).

* PR5 review round 6 nits: legacy split freed size, sentinel constant, docstrings

Report the true on-disk total when removing a legacy (pre-accounting) split GGUF
so freed size matches what was actually deleted; use the _SIZE_UNKNOWN sentinel
in the progress-total guard; correct the _ProgressTracker wrapper/subclass
wording.

* Cover the legacy-shard recovery error fallback

* Drop bead-id and review-round references from test comments

Remove bb-* issue ids and '(review round N)' annotations from test comments and
docstrings repo-wide; they were tracking noise, not behavior. No test logic changes.
…ring (#404)

* Fix retrieval-correctness defects in scoring, scope, and clustering

bm25_probe rows never populated relevance_score because LanceDB FTS keys the
score column as _score while SearchChunk only aliased _relevance_score. The
confidence-based expansion-skip therefore always read 0 and never fired, and
its tests mocked values the probe could never return. SearchChunk now accepts
either alias, with a real-probe integration test.

Structured-query prefixes (term:/vec:/hyde:) silently dropped an explicit
chunk_type scope: only the wiki/raw branch honoured it. bm25_probe gained a
chunk_type filter and the vec/hyde branches now thread the scope through.

Concept boost mutated scores but returned store order, so the boost was inert
for callers that consume search() order directly (CLI search, MCP search). The
boosted results are now re-sorted by relevance.

The reranker treated the RRF fusion score as if it were a [0,1] confidence.
Real RRF magnitudes (~0.03) made the fusion term negligible in the blend and
left the BM25-protection pin permanently dead (compared against a 0.8
threshold). The fusion signal is now min-max normalized across the candidate
set like the reranker scores already were, so a strong hybrid hit keeps real
weight and resists demotion; the now-redundant pin is removed. A perfect
vector match (distance 0.0) is no longer misread as a falsy 0.5 fusion signal.

Concept PMI was computed per file with one file's chunk count as the
denominator and the per-file weights were then summed, which inflates pairs
spread across many small files and is not corpus PMI. Edges now store raw
co-occurrence counts; rebuild_clusters computes PMI once over corpus-wide
counts before Leiden clustering.

* Address review: scope the main-search path and sigmoid-normalize the BM25 probe

The alias fix activated the previously-dead expansion-skip path but fed it raw,
unbounded BM25 scores against a [0,1]-bounded threshold (the config calls it a
"sigmoid-normalized BM25 score" but no sigmoid was ever applied), so expansion
would skip on nearly every query. _should_skip_expansion now squashes the probe
score through a logistic before comparing, and its tests use realistic raw-BM25
magnitudes instead of synthetic [0,1] values.

The chunk_type scope was threaded through the structured term:/vec:/hyde: paths
but missed the non-structured main path: the HyDE merge and the confidence probe
both searched the mixed pool for a scoped query, leaking out-of-scope chunks into
the HyDE merge. _merge_hyde_results and _should_skip_expansion now thread the
scope, mirroring _merge_variant_results.

* Address review round 2: cohort-normalize reranker fusion, clamp excerpt relevance

The reranker min-max normalized the fusion signal over the whole candidate set,
but on the default HyDE-enabled path that set mixes two incomparable scales:
hybrid rows carry a tiny RRF relevance_score (~0.02) while HyDE rows carry a
cosine distance (1 - distance ~0.5). Normalizing them together let a HyDE chunk
dominate purely as a scale artifact and demote a strong hybrid top. The fusion
signal is now min-max normalized within each scoring family (RRF vs cosine)
separately, with a mixed-cohort test.

The _score alias also let a raw, unbounded BM25 score reach Excerpt.relevance
(via term: queries through /api/search), which is documented as a [0,1] value.
_to_excerpt now clamps it.

Also added a test asserting build_concept_records emits raw co-occurrence counts.

* Address review round 3: derive corpus PMI from chunk_concepts, tidy reranker

The concept edge table is append-only and not source-scoped, and its weight
column's meaning changed in this PR (PMI -> raw count). Deriving corpus PMI by
summing edge weights would therefore corrupt clustering on an in-place upgrade
(legacy PMI floats summed as counts) and double-count co-occurrences when a
source is re-ingested. rebuild_clusters now computes co-occurrence, concept
frequencies, and the chunk count from the chunk_concepts map, which is
source-scoped and schema-stable, so PMI stays correct across re-ingests and
upgrades regardless of what the edge weights hold.

Extracted the reranker's neutral-score 0.5 into a named constant, reused the
existing probe-result minimum for the confidence probe's top_k, and corrected
two reranker docstrings that still described whole-set (rather than per-family)
fusion normalization.

Also wrapped one pre-existing over-length call in pipeline.py that was tripping
the lint gate.

* Address review round 4: isolate the BM25 score, share per-family fusion ordering

The root cause behind the recurring dual-scale findings was that the _score alias
made one relevance_score field hold three incomparable scales (RRF fusion ~0.03,
raw BM25, and None for vector rows). Raw BM25 is now kept in its own bm25_score
field, so relevance_score stays fusion-scale only: the min_relevance_score filter,
excerpt relevance, and weighting all see a consistent scale again, and the
_to_excerpt clamp added earlier is no longer needed. Only the expansion-skip
confidence probe reads bm25_score (still sigmoid-squashed).

The concept-boost re-sort had reintroduced the RRF-vs-distance scale collision the
reranker already fixed: it sorted hybrid and HyDE rows together by a raw key, so a
HyDE recall's larger 1-distance could outrank a strong hybrid hit. The reranker's
per-family normalization is now a shared helper (order_by_fusion in dedup), used by
both the reranker and the boost re-sort.
…am (#406)

Bring every transport in line with the reference (non-streaming / in-process) path.

- MCP crawl tool validates off-loop and schedules on-loop (was dead); HTTP /api/crawl mirrors the off-loop validation so neither async transport stalls the loop.
- Streaming /v1/chat/completions reports finish_reason 'length' on max_tokens truncation via a new StreamFinish terminator frame in the provider stream contract; fleet client and SDK provider emit it, dispatch maps it without downgrading tool-call streams.
- ask/chat SSE handlers emit the cited subset with a fall-back to the full retrieved set, and refuse the no-embedder case by streaming the refusal as a normal answer token (mirrors Searcher.ask_stream) instead of an SSE error.
- mcp and agent-config commands accept and apply --data-dir/--global.
- wiki status counts every content subdir; an accepted unpaired drift draft is restored to its own page type via an origin marker; wiki prune preserves subdirs; concept/entity collision guard widened.
- OpenAI stream_options.include_usage honored; first streamed delta always carries the assistant role.
- code chunker line off-by-one fixed; SDK chat_with_tools routing added; models-pull arch precheck returns 409 at the route.

Covers bb-ziks.3/.25/.66/.64/.67/.36/.63/.75/.35/.68/.65/.31/.46/.69 and bb-7jg1.9/.18/.21/.24. Review converged over two rounds (9 findings then 1 parity gap, all fixed). 100% coverage on changed lines.
…el-safety (#407)

Fix data corruption and loss across the ingestion paths.

- Code chunker consumes tree-sitter's size-bounded chunks: content is the already-extracted UTF-8 text (no byte-offset slicing that corrupted non-ASCII source), honors chunk_max_size, lists nested symbols, drops the literal 'None' header for anonymous symbols, and carries only the relative source name (no absolute-path leak). Line ranges are derived from each chunk's own line count so they stay correct with or without a trailing newline.
- enable_ocr=False disables OCR entirely (vision and Tesseract); multi-frame TIFF/GIF images OCR every frame; chunk_text is offloaded off the event loop; the OCR cache key includes the per-page timeout.
- The ingest pipeline keeps draining a cancel batch so a sibling that completed in the same batch still flushes before the cancel propagates.
- Remote embeddings are ordered by the response index; SQL predicates stop doubling backslashes (Datafusion treats them literally).

Covers bb-7jg1.4, bb-7jg1.10, bb-ziks.18/.19/.20/.21/.22/.39/.60/.61/.62. Review converged over two rounds; 100% coverage on changed lines.
…over, placement (#409)

Resolve native slash GGUF refs and follow redirects so the unsupported-arch pull guard fires; read general.architecture by walking the GGUF KV table directly (validated byte-for-byte against gguf-py); make read_gguf_metadata return None on any parser failure instead of aborting the fleet build; route vision OCR through replica failover; order discrete single-model placement search-first; raise on out-of-range device pin; offload heavy wiki routes off the event loop.
…filenames, seeded Leiden (#411)

Stop stringifying TOML values so list config fields load natively; write the opencode setup marker atomically; fold a query hash into crawl filenames so distinct queries don't collide (with a robust .md guard for degenerate paths); seed Leiden for reproducible clusters.
…nk invariant (#414)

Add a newline-splitting validator for crawl_browser_extra_args so a persisted value no longer discards the whole config.toml on reload; reset removes stray symlinks instead of aborting; settings rejects lowering chunk_size below the existing overlap; flash_attention warns on bad input; stop persisting derived embedding_dim; widen settings rollback to parse errors; minor DRY/docstring fixes.
…nt-config parity (#415)

Validate wiki-lint source within wiki_root (blocking); stop the spawned serve under a finally; apply max_distance + cap top_k + reject empty queries in CLI search/ask; JSON error envelope for --task; agent-config includes the remote chat model and pins opencode defaults; atomic opencode skill install; memory remove exit code; server probe dedup + log-handle close; sync counter lock; self-check temp cleanup; warm-budget contract.
…ard helpers (#416)

* catalog: assign a selected remote/frontier model to its task's role, not always chat_model

* TUI fixes: per-tab list cache key, single catalog nav, single role reload, scan logging

- _append_more_hf_to_list updates the per-tab _list_cache_keys entry instead of a
  dead singular attribute, so a later _refresh_list sees the appended rows.
- /catalog uses switch_view only (was stacking a second orphaned CatalogScreen).
- settings model-pick passes reload_worker=False so the role server reloads once,
  not twice.
- setup _scan_installed_models logs before swallowing.

* TUI settings: per-key list defaults + regex validation only for regex lists

- _on_list_restore restores each key's own default via get_default (was always
  writing the crawl_exclude_patterns default for every LIST_COLLAPSED setting).
- regex validation runs only for settings marked validate_regex, so the
  crawl_browser_extra_args flag list isn't rejected as 'invalid regex'.
- model-pick reload now happens once (test updated for the new reload path).

* TUI: single-source the app window title via msg.app_title (consistent separator)

* TUI: narrow call_from_thread to shutdown errors so genuine callback bugs surface

* TUI: single-source the nav-view universe (messages.ALL_NAV_VIEWS)

The view set was encoded in three places with a duplicated wiki gate. Define
ALL_NAV_VIEWS once; get_nav_views() gates Wiki, app.get_views() derives its
factory map from get_nav_views(), and the status bar composes ALL_NAV_VIEWS.

* TUI wiki: debounce search so the tree re-walks once on pause, not per keystroke

* TUI nits: settable-key autocomplete filter, fail-closed spacy check, logging, constants

- /set autocomplete (suggester + autocomplete) offers only settable keys, so
  read-only wiki_dir is no longer suggested then refused.
- _spacy_available no longer fails open to 'available' on an unexpected error;
  it logs and reports absent so the install guidance still shows.
- suggester model/document lookups log on failure (parity with autocomplete).
- list-editor id uses EDITOR_ID_PREFIX; size-variant strip typed list[SizeVariant];
  focus-step extracted from a side-effecting ternary; docstring typo fixed.

* TUI nits: fit-chip grammar, canonical api-key field + role map, palette delete cache, frontmatter guard

- _render_fit_pill won't-run reads 'won't run, short by X GB' (was 'won't -X GB').
- catalog uses PROVIDER_API_KEY_FIELD; model_pick reuses MODEL_FIELD_TO_ROLE
  instead of a duplicate map.
- palette _delete_doc invalidates the doc cache and uses CMD_DELETE_SUCCESS.
- wiki _display_page coerces frontmatter via _safe_float/_safe_int so a
  non-numeric value can't crash the node-select handler.

* TUI nits: NATIVE_BACKEND constant + drop native pill, derive digit-tab map, distinct delete-read error

- Extract NATIVE_BACKEND; ModelCard drops the implied 'native' pill (parity with
  grid/list, which now use the constant too).
- catalog on_key derives the 1-6 -> tab index from ALL_TAB_IDS, not a rebuilt map.
- catalog_grouping sorts featured via a typed cast, not getattr-with-default.
- _populate_library_list drops a misleading contextlib.suppress(AttributeError).
- /delete store-read failure reports a distinct 'Could not read' error, not 'no documents'.

* TUI nits: remove dead wiki source helpers; stream timings as a named dataclass

- Drop unused WikiScreen._selected_source/_source_for_slug (no production caller)
  and their tests; drop the now-unused read_page module import.
- Replace the [last_flush, last_scroll] magic-index list with a _StreamTimings
  dataclass.

* TUI nits: shared estimate_min_ram_gb helper; narrow model_bar refresh suppress

- Extract estimate_min_ram_gb (catalog/models.py) so hf_client and the catalog
  install path share one RAM-from-size heuristic (was two diverging copies).
- RoleRow.refresh_state suppresses only NoMatches (children not mounted), not
  every exception, so a real pill/repaint failure surfaces.
- Update the model-pick role-map test to the canonical MODEL_FIELD_TO_ROLE.

* TUI nits: lru_cache the doc completion cache; set_setting download guard

- Replace the _doc_cache module global + global-statement writes with an
  lru_cache'd inner function (cleared by invalidate_document_cache).
- Extract _reject_if_downloading; set_setting now refuses a still-downloading
  model-role ref like set_active_model does.

* TUI nits: public ChatScreen.request_reset; per-batch taskbar failure count

- Extract ChatScreen.request_reset as the public reset entry; the palette calls
  it instead of reaching into the private _cmd_reset.
- TaskBar's FAILED flash counts only the just-finished batch's failures, not
  every failure in persistent history (which inflated the count across runs).

* TUI nits: extract shared catalog-card helpers (DRY across card/grid/detail)

Move the byte-identical name-truncation, spec/status pills, fit-level color map,
and verbose fit-pill renderer into catalog_card_shared; model_card, model_grid,
and catalog_detail import them instead of each carrying a copy. Also fixes the
grid drawer's stale 'won't -X GB' label (now shares the fixed renderer).

* TUI perf: cache ModelGrid card lines per cell so each card builds once per repaint

render_line is called once per terminal row, so each card was fully rebuilt
_CARD_HEIGHT times per repaint. Cache the built lines keyed by
(index, width, selected, border_style); cleared on set_rows/resize/highlight.

* test: align suppressed-error and stream-timings tests with narrowed contracts

* test: cover non-string download guard, out-of-range tab digit, and isolate crawl progress tests

* docs: clarify model_grid card-cache invalidation comment

* test: deterministic wiki search debounce test (remove inter-edit pause race)

* test: make wiki search debounce test fully deterministic (no real timer/wall-clock)
…cation, behavioral parity, event-loop offloads (#417)

* Server review fixes: catalog installed parity, ?source= 422, chunk_type decoder unification, stop-sequence forwarding, crawl unlimited sentinel, one-shot memory extraction, chat search-unavailable refusal, event-loop offloads, add-files lock partition

* test: update catalog kwargs, crawl unlimited sentinel, and reasoning-stream fixtures for server review fixes

* test: rename store-extracted-memories empty-answer test to match what it exercises
llama-swap was spawned with inherited stdout/stderr, so under a TUI or CLI parent
its HTTP access log (POST /upstream/embed-0/tokenize ...) printed straight onto
the terminal and corrupted the render. Capture its stdio to a per-owner
llama-swap.log instead; the per-model upstream logs are unaffected (those come
from llama-swap's /logs API, not its stdout).

Audited the other engine subprocesses while here: gguf-parser, the llama-server
--list-devices/ldd probes, nvidia-smi, and the managed lilbee serve already
redirect or capture their output; the foreground client (opencode) inherits the
terminal by design. llama-swap was the only leak.
* Make model swaps non-blocking with a progress indicator

Switching a model ran the multi-second fleet reload on the Textual event loop,
freezing (and sometimes crashing) the TUI. Two paths did it: a chat swap reset
services in apply_model_change, and an embed/rerank/vision swap reloaded the role
inline in model_pick._persist.

Both now run the reload in a thread worker behind a 'Switching model, loading...'
toast, with a 'Now using <model>' confirmation and an error toast on failure. The
chat reset still cancels the in-flight stream first and waits for workers to drain
(serialized so reset_services never runs twice at once). Updated the affected TUI
tests for the async behavior.

* Block chat input while a model swap loads

A non-blocking swap still left a window where a prompt sent mid-swap raced a
half-torn-down fleet. Add a swapping_model gate: apply_model_change disables the
chat input behind a 'switching' state, the worker resets AND warms the new model
(get_services eager-starts it) so unblock means loaded not just configured, and
the input re-enables with focus restored when the worker finishes or fails. A
submit attempted mid-swap is rejected with a clear toast.

* Review fixes: single async reload, guarded watcher, dead-class drop

Blast-radius review of the model-swap change found:
- Settings model picks reloaded the role twice: once in apply_model_pick's
  worker and again synchronously in the Settings on_done, which both doubled the
  fleet restart and re-blocked the UI thread. Make apply_model_pick the single
  (async) reloader; the Settings on_done now only repaints the button.
- watch_swapping_model touched the chat input unguarded; the unblock fires from
  the worker via call_from_thread and can land after navigating away, so guard
  the input access with suppress(NoMatches).
- Dropped a dead .swapping-model CSS class (the input disabled state is the
  visual); clarified the drain-wait rationale (it protects provider+store
  teardown, not just serialization).

* Convention fixes: name the persist worker, correct stale swap-worker comment

* Optimize chat swap: reload only the chat role, keep the store

The chat swap used reset_services(), which also closed LanceDB and rebuilt the
searcher/concept-graph -- none of which a chat-model change needs (the provider
reads cfg.chat_model late-bound; the Searcher never caches it). Every other role
swap already uses the lighter reload_role().

Switch the chat swap to reload_role(CHAT, wait=True): it re-plans and restarts the
fleet for the new model while keeping the store and searcher. wait=True is a new
synchronous path on reload_role (run in the caller's already-off-event-loop worker)
guarded by a threading.Condition so it returns only once the reload finishes and
preserves the existing single-flight semantics exactly. Threaded the param through
the provider protocol, both implementations, and the services pass-through.

* Oracle review: correct overstated 'model loaded' docstrings

diff-against-oracle found the swap docstrings claimed the model was loaded on
unblock, but reload_role(wait=True) returns when the fleet proxy is healthy; the
model loads lazily on its next request, same as every other role swap. The input
block correctly covers the proxy-restart window. Corrected the comments; no
behavior change (not warming after a reload is faithful to the non-chat
reload_role path this mirrors).

* Update services reload_role delegation test for the wait flag

* Update reload_role forwarding tests for the wait flag

The routing-provider forwarder and the services delegation now thread wait= to
the underlying reload_role, so the assertions must include it. Adds a wait=True
forwarding test for the chat-swap path.

* Fix CI: drop the swap drain-wait (hang), repair drifted tests

The full CI suite surfaced two failures the targeted runs missed:
- test_tui slash-model timed out: apply_model_change's drain-wait looped on
  call_later forever when a background worker never drained, spinning the event
  loop. Root cause + production risk (a long-lived worker would block any swap),
  so drop the drain entirely: reload_role keeps the store and the provider retires
  busy clients across the restart, and the stream is already cancelled, so the
  worker can start immediately. The provider single-flights overlapping reloads.
- test_worker_roles asserted settings.py re-exports MODEL_FIELD_TO_ROLE, which the
  earlier single-reload fix removed (settings no longer reloads); the test is
  obsolete, removed.
Added edge-branch tests (streaming-reject, idle-allow, persist reload-failure
toast) and mocked the reload in the slash-model test.

* Format test_tui_model_bar

* Fix racy embed-picker tests and a flaky drop-models test

CI on Python 3.11 / Windows (not the faster runners) surfaced races my targeted
runs hid:
- The embed picker-dismiss tests asserted cfg/reload_role right after
  _on_picker_dismissed, but _persist now reloads in a thread worker; await
  app.workers.wait_for_complete() inside the mock context before asserting.
- test_drop_loaded_models_async raced on _swap-is-None, which the worker clears
  before swap.shutdown() runs; wait on the shutdown count instead (pre-existing
  flake, unrelated to the swap change).

* Blast-radius review fixes: guard re-entrant chat model swap, run model write on main thread + block reload off-thread in picker worker, align reload_role wait param across mocks/tests
Adding any file failed with 'int() argument ... not NoneType' when the store
held a source row whose nullable stat columns (size_bytes / mtime_ns /
stat_captured_ns) were NULL. source_stat only guarded the SOURCE_STAT_UNKNOWN
sentinel and missing keys via .get-default, so an explicit NULL slipped through
to int(None). Treat NULL like unknown (return None, the caller re-hashes) and
coerce a NULL capture time to the sentinel instead of crashing.

Bug present since the stat-based sync-skip landed in #338; surfaces on any store
carrying a null-stat row.
…ut, tool-linkage, option-translation dedup (#419)

* Providers review fixes (batch 1): skip unreachable local server instead of dropping all models, forward streaming timeout, drop masking getattr default, dedup supports_tools ref

* test: assert streaming chat forwards caller timeout and skip-on-unreachable-server behavior

* Providers review fixes (batch 2): preserve tool-linkage message fields in chat_with_tools, correct _sdk_attr docstring overclaim

* Providers review fixes (batch 3): dedup litellm response model extraction into _response_model helper

* Providers review fixes (batch 4): remove dead write-only model_defaults cache (module, test, and the show_model write path)

* Providers review fixes (batch 5): dedup num_predict->max_tokens/drop num_ctx translation into shared normalize_generation_options

* fix: restore ModelDefaults dataclass (only the write-only cache was dead, not the type used by reasoning/config)

* test: restore generation_options 3-layer merge tests (live behavior dropped with the dead-cache test file)
…concept-boost + perf (#420)

* Retrieval review fixes (batch 1): route wiki:/raw: prefix through the wiki-disabled guard, apply concept boost even when query expansion is skipped

* Retrieval review fixes (batch 2): bare search() applies temporal filter (parity), correct boost_results copy docstring, degree_map Counter type, reasoning cap docstring

* Retrieval review fixes (batch 3): open chunk_concepts table once in boost_results (N+1), single-source structured-query modes, memoize repeated query concept extraction

* test: cover boost_results pass-through when chunk_concepts table is missing

* Retrieval review follow-ups: apply temporal filter to structured (mode:) queries too, clarify search() docstring, assert boost_results opens the table once
…f the event loop, bound crawl depth/max_pages like REST, re-tune search scope on vault switch (#421)
… lint, dedup (#422)

* Wiki review fixes (batch 1): JSON-serialize frontmatter sources (escape quotes/backslashes), normalize whitespace in excerpt location lookup

* Wiki review fixes (batch 2): read-only lint status skips audit-log write, provenance records the effective (fallback-resolved) entity mode

* Wiki review fixes (batch 3): prune log uses WikiLogAction.PRUNE enum, accept_draft reuses _classify_and_strip_markers (drop dead strip helpers)

* Wiki review fixes (batch 4): drop permanently-true synthesis condition, align empty-title fallback with browse resolver, longest-first citation source match
…p_enabled config) (#424)

Lets users opt out of lilbee's injected MCP search tool: opencode_config omits the mcp block, the lilbee-mcp skill install is skipped, and lilbee stays the model provider so a user's own MCP servers still apply. Tri-state CLI flag overrides the config default per launch.
TUI delete dropped index records but kept the source file, so the next sync re-ingested it and the doc reappeared. remove_documents_durably writes a hash-keyed skip-marker for the kept file so sync treats it as unchanged-and-skipped; editing the file or retry-skipped/rebuild restores it. Non-destructive (file stays on disk).
* Data/crawler review fixes (batch 1): offload import_dataset + force_rebuild store writes off the event loop, dedup host-scope via host_in_scope, correct SSRF docstring (TOCTOU, not rebinding protection)

* Data/crawler review fixes (batch 2): include_subdomains parity + nits

Thread include_subdomains through the MCP crawl tool and REST /api/crawl so
the subdomain scope reachable from the CLI and TUI is honored on every entry
point.

Nits: escape LIKE wildcards in the source search filter; warn instead of
silently discarding a corrupt crawl-metadata sidecar; dedup the empty-OCR skip
warning and the chunk char-budget computation; mirror lilbee's global
progress-bar suppression into the semantic-chunking embedding download; log
(not silently swallow) async-generator teardown errors; replace the flush
counter's inline magic-string dict with a typed counter; read LanceDB's
_distance column in the vector-search debug log; correct the crawl_and_save
max_pages docstring to match the safety-cap resolver.
…/) (#423)

* Modernize logging docs + move llama-swap.log into logs/

The fleet replaced the in-process model worker pool, but the docs still described
it. Bring them in sync:
- TROUBLESHOOTING: drop the worker-chat/embed/rerank/vision.log table row and the
  WorkerCrashError section (neither exists anymore); document the real logs --
  llama-swap.log and launcher-serve.log -- and rewrite 'model server crashes'
  around llama-swap's 'exited prematurely' with the server output embedded.
- CONTRIBUTING: 'in-process llama.cpp' -> a bundled llama-server process per model.
- Point the vision-OCR skip message at server.log (worker-vision.log never exists).
- Move llama-swap.log into the data root's logs/ so it sits beside server.log etc.
  instead of the data root, and create the dir on first write.

* Use the resolved per-platform log path in the vision-OCR message

The skipped-vision message hardcoded ~/Library/Application Support/...server.log,
which is macOS-only and wrong on Linux/Windows. Interpolate cfg.data_root (the
already per-platform-resolved data root) so the path is correct everywhere; add a
regression assertion. Audited the whole codebase -- this was the only hardcoded
platform path in a user-facing string (system.py's macOS branch, the Vulkan ICD
XDG paths, and the ~/-abbreviating display formatter are all correct).

Also make the README's TROUBLESHOOTING link absolute (the README ships to PyPI,
where relative links break).
* Align vision/embedding classification order across manifest and remote paths

reclassify_by_name checked vision before embedding while _classify_remote_task
checked embedding before vision, so an image embedder like nomic-embed-vision
(matching both patterns) classified differently depending on the path. Align
both to rerank -> embedding -> vision.

* Don't cache an empty arch from a transient probe failure

resolve_arch_for_pull cached probe_architecture's '' (returned on any network/
parse failure) as if it were a verdict, so one transient failure permanently
disabled the unsupported-arch guard for that ref. Only cache a non-empty arch.

* Guard the native registry walk in gather_known_model_refs

The docstring promises each primitive contributes an empty subset on failure,
but the registry.list_installed() walk was unguarded, so a corrupt manifest or
FS error raised out of the whole resolution. Wrap it to log and contribute no
refs, matching the remote/API primitives.

* Scope mmproj lookup to the matched vision repo's cache subtree

find_mmproj_file matched the right featured entry but then searched the whole
models dir, returning any file containing 'mmproj' regardless of which repo it
belonged to. A chat or unrelated-vision model could inherit another model's
mmproj and be misreported as vision-capable. Search only within the matched
repo's models--<org>--<repo> cache directory.

* Size a catalog row from the quant it names, not the largest GGUF

_estimate_size_from_siblings used the largest GGUF (often an F16/BF16) while
gguf_filename names the picked Q4_K_M quant, so size_gb (and the size-bucket
filter) didn't match the file a pull produces. Size the same picked quant.

* Run persisted-model canonicalization off the event loop at TUI mount

canonicalize_chat_model/embedding_model probe local model servers over HTTP/DNS;
calling them synchronously in on_mount froze the TUI for the probe's duration.
Make on_mount async and offload both probes via asyncio.to_thread, keeping the
ordering so the chat screen still installs against a settled ref.

* Reuse canonical ref helpers instead of redefining them

KnownModelCache.resolve hardcoded the 'ollama/' wire prefix; use OLLAMA.qualify.
role_validator redefined NATIVE_GGUF_REF_MIN_SLASHES and reimplemented the
native-GGUF-ref check; import is_native_gguf_ref from providers.model_ref.

* Fix overclaiming docs and a type-erasing annotation in catalog/modelhub

role_validator typed the catalog entry as Any, erasing CatalogModel; annotate it.
Correct the mmproj F16-preference comment (it doesn't deprioritize BF16 or
compare F16/F32 sizes) and the get_families docstring (variants keep featured
order; recommended comes from the entry flag, not size).

* Remove dead fetch_model_file_size and dedup a registry walk

fetch_model_file_size had no production callers (only its own tests via the
public export); remove the function, its export, and those tests. _is_local_installed
called registry.list_installed() twice per call; call it once.

* Validate the blob path before unlinking and log a swallowed installed-list error

_gc_blob validated the repo cache dir but built the blob path from a digest and
unlinked it unchecked; a traversal digest could escape models_dir. Validate the
blob path too. _get_installed_models swallowed all manager errors silently as
'nothing installed'; log it so a broken registry is visible.

* Read the manifest tree once when freeing a multi-shard model's blobs

remove() called _gc_blob per shard digest, and each call re-walked the whole
manifest tree via list_installed (N+1 for a split GGUF). Compute the surviving
siblings once and pass them in; _gc_blob still reads them itself when called
standalone.

* Cover the no-mmproj-in-repo-cache branch
* Fix test-quality review findings

Patch get_system_ram_gb where the setup screen looks it up, not at its source
module, so _patch_setup_ram actually pins RAM instead of silently using the
host's. Assert TestOptionsPassthrough actually forwards the request body's
options to the generation-options resolver, not just that the call succeeds.
Use entry.display_name (CatalogModel has no .name) in the featured-vision
assertion messages.

* Reformat messages.py to satisfy ruff 0.15.6 format-check

Pre-existing base-branch format-check failure (unrelated to the test fixes):
ruff 0.15.6 collapses a wrapped string literal that an older ruff left split.
Blocks lint on every PR until reformatted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant