Multi-GPU model serving and opencode integration by tobocop2 · Pull Request #267 · tobocop2/lilbee

tobocop2 · 2026-05-17T02:38:22Z

Problem

lilbee ran its models on a single GPU, and nothing outside lilbee could use them. A coding agent like opencode had no way to reach lilbee's local models, and lilbee couldn't call tools at all. Indexing a large library was strictly one card's worth of work at a time.

What this adds

Models run across all your GPUs

lilbee now spreads its models across every GPU on the machine instead of running on a single card. A model too big for one card tensor-splits across several; smaller roles pack onto the cards that have room. The whole box goes to work, so bigger models and heavier workloads have somewhere to run, and the fleet serves several requests at once instead of one at a time.

Bulk ingest fanned across the whole box

For indexing a large scanned library, lilbee can run several copies of the OCR and embedding models, one per spare GPU, and spread the work across all of them. A single lilbee sync then saturates every card instead of funneling through one, which is the difference between days and hours on a big corpus. Set vision_replicas / embed_replicas and lilbee places one server per GPU, load-balances pages and embedding batches to the least-busy one, and scales the ingest pipeline's file concurrency to match so the extra cards never sit idle.

opencode can use lilbee's models

lilbee speaks the OpenAI API. Point opencode, or any OpenAI-compatible client, at lilbee and it uses your local models directly: chat, tool calls, and lilbee_search over the same connection.

Tool calling

lilbee's models can call tools now. That is what makes the opencode integration useful: the model searches your library and acts on the results instead of only chatting.

Feedback while models load

lilbee launch opencode (and the other launchers) now show real progress while a cold model loads: a byte bar as the weights are read, then an engine-load spinner, then a clear handoff to the client. Before this it sat silent for what could be minutes on a large model. Once a server is up, repeat launches reuse it and start fast.

Better answers and longer conversations

The RAG path gained grounded refusal (it says when the library has nothing relevant instead of guessing), a context budget so retrieved chunks and history fit the model's window, and a cited-subset so a grounded answer is distinguishable from an off-corpus one. Long agentic conversations no longer overflow a native model's context: history is windowed to the served window instead of failing the turn. Retrieval quality improved too, with asymmetric query embedding for instruction-tuned embedders and LLM-based rerankers served alongside cross-encoders.

Indexes images, not just PDFs

Scanned images now route through the same vision OCR path as PDFs, so a folder of page scans indexes the same way a document does.

Architecture

Each request routes to a hosted model or to the local engine, which runs your models as a fleet of llama-server processes spread across the GPUs (one per role: chat, embedding, reranking, vision; plus extra replicas of a role when configured). A llama-swap supervisor owns the process lifecycle, and gguf-parser sizes each load so the fleet packs onto the available memory.

flowchart TD
    CLI[CLI] --> SVC
    TUI[TUI] --> SVC
    MCP[MCP server] --> SVC
    API[HTTP API] --> SVC

    SVC["App services · RAG pipeline<br/>(embed → rerank → chat)"] --> FAC{{"create_provider<br/>config.llm_provider"}}

    FAC -->|REMOTE| SDK["SdkLLMProvider"]
    FAC -->|AUTO| ROUTE["RoutingProvider<br/>(route by model-ref prefix)"]
    ROUTE -->|"hosted ref"| SDK
    ROUTE -->|"local GGUF ref"| FLEET["FleetProvider"]
    SDK --> CLOUD[("Hosted<br/>OpenAI-compatible API")]

    subgraph ENGINE["Local engine · managed llama-server fleet (one process per role, plus replicas)"]
        direction TB
        CHAT["chat · --jinja<br/>chat + native tool calls"]
        EMB["embed · --embeddings<br/>(×N replicas)"]
        RR["rerank · --pooling rank"]
        VIS["vision / OCR · --mmproj<br/>(×N replicas)"]
    end

    FLEET -->|"spawn · /health · restart<br/>+ localhost OpenAI HTTP"| ENGINE

On one GPU it is a fleet-of-one; on a multi-GPU host the supervisor packs the servers across cards by memory and, with replicas set, fans a role across the spare cards. The engine installs with lilbee, so there is nothing extra to set up.

Performance

Profiling a 595-page scanned PDF on a single GPU showed where OCR time actually went, and the fixes roughly halved it (this run used one card, not the replica fan-out):

Metric	Before	After
595-page scan, indexed end to end on one GPU	36 min	18 min
OCR pages in flight at once	1 (strictly sequential)	up to `vision_ocr_concurrency` (4) per server, batched on the vision server's slots
Worst page (a repetition runaway)	29,779 chars over 51 s, uncapped	capped at `vision_ocr_max_tokens` (4096)

Rasterization turned out to be negligible (1.9% of wall time); the real costs were a single-sequence decode that left the GPU about half idle and the occasional page that looped into a repetition, which the two changes above address. Embedding also regained a token/sequence-aware sub-batching step, so a large library no longer overruns the embed server's batch limit while indexing. Those numbers are a single GPU with one vision server; with vision_replicas the same OCR work fans across every card on the box on top of this per-GPU gain, so a multi-GPU host scales it further.

Supported models

These families were verified end to end through the local engine, so opencode can chat and call tools against them:

Family	Verified ref
Qwen3	Qwen3-4B, Qwen3-Coder-30B-A3B
Llama 3.1	Meta-Llama-3.1-8B-Instruct
Gemma 4	gemma-4-E2B-it
GLM	GLM-4.5-Air
MiniMax	MiniMax-M2
gpt-oss	gpt-oss-120b

Other families were tried but don't tool-call reliably through the engine yet: some ship no tool-call template (Hermes 3, GLM-4-9B), some never dispatch a tool in practice (functionary, the small DeepSeek-R1 distills, Granite, Phi-4-mini), and Mistral Nemo still errors on the fleet path for some message sequences. Very small or reasoning-only models may also decline to call tools.

…lidated PR Four-round audit loop on the full feat/local-model-api surface (PR #267) found and addressed 22 findings across cross-cutting-api-change and AGENTS.md compliance. Round 4 converged with zero findings. Significant changes: * Response-parser schemas lazy-load via @functools.cache get_schemas() instead of running 20 JSON file reads at module import on every parent-process bootstrap. Saves ~50ms off every CLI invocation. * Native supports_tools tightened: the substring set could match "tools" anywhere in a chat template (false-positive routing of tool requests to incapable models). Now an anchored Jinja-delimiter regex. * ContextWindowExceededError typed fields (requested / usable_budget / n_ctx) survive the worker-to-parent round-trip. Parent parses the message via regex when rebuilding the typed exception. * /api/chat and /api/ask routes translate ModelNotFoundError, ModelDoesNotSupportToolsError, and ContextWindowExceededError into typed HTTP responses (404/400/400) instead of bubbling to 500. * _classify_stream_error now uses CompletionsErrorCode enum values instead of magic strings. * Schema-loader, _AbortBridge, and other constant docstrings trimmed of multi-paragraph design rationale per AGENTS.md. * Historical narrative ("Defends against pre-fix manifests", "Without this, ...") removed from docstrings. * pyproject.toml version aligned with main (b479).

…ate self-tune demo - Built on expanded: llama.cpp / llama-cpp-python and Hugging Face Hub / huggingface_hub elevated with explicit 'without these there is no lilbee' framing. Tesseract, Litestar, MCP SDK, Typer, Pydantic, and tree-sitter- language-pack added; they were all load-bearing but uncredited. - Quick-start 'agent already knows what chunk size, MMR weight, and reranker depth do' was implementation-detail leakage. Reframed as 'agents already understand search engines, so the right knobs to move are obvious to them.' - Self-tune demo moved out of 'Grounding for AI agents' (its placement there was awkward) and into 'Already using an MCP-aware agent? Hand setup to it.' where the topic is exactly agent-driven retrieval tuning. The demo is now framed as 'coming with the opencode integration (#267)' since the launcher side of the integration isn't shipped yet.

…al-coming section The MiniMax M2.7 demos in 'Agent integration' weren't obviously cloud demos at a glance, which understated the lilbee-stays-local benefit and also gave the wrong impression that local opencode was a shipped story. Make it loud that opencode is driving a cloud model in both demos and link up to 'Opencode integration (coming)' where local-model support is in flight (#267). Drop the specific 'MiniMax M2.7' name in line with the no-named-models framing.

…f-parser Run native GGUF models as a fleet of llama-server processes (one per role: chat/embed/rerank/vision), supervised by llama-swap and VRAM-sized by gguf-parser. A model tensor-splits across GPUs and roles pack by capacity; data-parallel replicas fan a role across every card. The engine binaries ship in the lilbee-engine wheel built by tools/wheel-build.

Route each request by model ref: native GGUF refs go to the local fleet, remote-prefixed refs to the OpenAI-compatible SDK backend. Adds model discovery and role classification for installed and remote models.

Serve /v1 chat-completions and embeddings so any OpenAI-compatible client can use lilbee's local models, with tool-call extraction and a canonical chat dispatch path shared by the HTTP API, CLI, and MCP.

Launch opencode (and similar agents) against lilbee as their model provider, with the lilbee_search skill/MCP wired in and the built-in client tools scoped off so the model retrieves through lilbee.

…, page-text dataset Fan OCR pages and embeddings across the replica fleet during a single sync, cache per-page OCR so a downstream failure doesn't re-OCR, batch LanceDB writes across documents, and record a per-page text dataset alongside chunks.

Real-hardware fleet smoke test, opencode model matrix, and CI that builds the engine binaries and runs the integration suite against the fleet.

token_cap equalled the per-slot context, so an exact-truncated oversize input plus the server's re-added BOS overflowed by a token and the overflow retry still failed. Truncate a few tokens under n_ctx.

opencode 1.16.2 rejects the whole config with ConfigInvalidError 'Missing key limit.output' when a model's limit has only context, so it crashed on startup and the matrix's opencode session died. Emit both context and output.

…mpty (#394) A container orchestrator (SkyPilot, some Docker/Kubernetes GPU setups) can export an empty CUDA_VISIBLE_DEVICES while exposing the GPU through NVIDIA_VISIBLE_DEVICES. The empty value reads as no devices, so the fleet device probe reports no CUDA-capable device and the ingest fails on a host that has a working GPU. Clear an empty backend visible-devices var during fleet GPU bootstrap when a GPU is physically present (detected via nvidia-smi/NVML, independent of the var). Explicit pins and the conventional -1 CPU opt-out are non-empty and survive.

The store swallowed delete failures inside delete-then-add, silently producing duplicate or false-success state (memory add/update/delete, the sources-table replace, and the identity row). The delete helper now reports success and every caller gates the add or reports the real result; identity reads take the newest row. Narrows the wiki draft handlers to a dedicated PathTraversalError so genuine downstream errors are not masked, and rounds out the SSRF embedded-IPv4 coverage. Reviewed to convergence; 100% coverage.

…396) _max_concurrent only raised file-concurrency to the vision OCR capacity when vision_replicas > 1. On a single-GPU fleet it fell back to cpu_quota (cpu_count//2), so a many-core host fanned dozens of concurrent OCR requests at one vision server's few continuous-batching slots. The server returns HTTP 429; the bounded RATE_LIMIT retries exhaust under the sustained oversubscription and the file is dropped (observed live: a 1x A100 dropped 643 files to 429 while the GPU sat at 97%). Bound file-concurrency by the vision slot capacity (replicas x per-server slots) whenever a vision model is configured, single server included. This also fixes a latent multi-GPU flood where cpu_quota exceeded the total slots.

Harden lilbee for concurrent, always-on operation, fixing a class of races that only appear when many clients share one process: - Refuse the operations that rebuild the shared engine (vault init, factory reset, and a model-provider switch via settings_set / settings_reset / the REST config handler) on the HTTP server, where they would tear the singleton down under other clients' in-flight requests; the CLI, TUI, and stdio MCP keep them. Services swap their reference before closing the old instance. - Move per-request OCR settings to a ContextVar that propagates into worker threads, instead of mutating the global config. - Guard the lazy provider, spaCy, and HuggingFace xet globals so concurrent first-callers can't double-initialize or clobber them. - Resize the vision-request cap only when the gate is idle, so a mid-run change can't momentarily double the real concurrency. - Hand streaming answer tokens back to the event loop from the worker thread safely. - Remove an edited-to-empty file's stale chunks from the index on sync. - Coerce string-valued chunk settings before validation so MCP callers can't skip the guards. Full test coverage including the concurrency behaviors.

…398) * QA harness: fail ungrounded answers, auto-approve reel edits Two frame-validation findings from the full-matrix + reel run: - The scenario pass-gate accepted ungrounded answers: a fresh lilbee_search dispatch plus a chat completion passed even when the model answered 'not found in the indexed reference' / 'no documentation exists'. The tier prompts target content the indexed Godot reference does contain, so that is a real failure (retrieval gap or the model ignoring results), not a clean pass. Add an ungrounded-answer marker check to _poll_verdict so those FAIL, with unit tests. This stops the matrix from greenlighting demos in which lilbee appears to find nothing, and surfaces the underlying retrieval gap instead of masking it. - The coder/giant reels never wrote their artifact: an automated VHS reel has no human to satisfy opencode's default 'ask' edit permission, so file writes silently never happened and the model narrated the code instead. Add permission.edit=allow to the reel's opencode.json so writes auto-approve. * Harden the ungrounded-answer gate per review - Apply the check only to a would-be-PASS verdict (a fresh dispatch AND chat completion), so a model answering 'not found' from memory without dispatching is not mislabeled as an ungrounded-search failure; it keeps waiting and times out on the accurate no-dispatch detail. - Scan only the answer tail, not the whole cumulative pane, so a transient 'not found' from mid-search reasoning cannot false-fail a recovered grounded answer. - Narrow the markers to the unambiguous 'search found nothing in the reference' phrases; drop the generic ones that can appear in a good answer scoping a negative. - Match the sibling marker checks' case handling (m.lower()), and add a no-dispatch test. * Move the grounding gate to the settled pane (fix verdict-timing race) Round-2 review caught that the in-verdict gate was mistimed: run_scenario returns the instant dispatch+completion trips, but the harness is designed so the answer is still streaming then (it renders in wait_for_answer_settle, after the verdict is fixed). So the gate scanned mid-search reasoning, not the final answer -- under-firing the real 'not found' answer and re-introducing the transient false-positive. Extract the check into a pure downgrade_if_ungrounded(result, settled_pane) and apply it in run_smoke_scenarios right after wait_for_answer_settle, on a fresh capture, so grounding is judged on the rendered answer. Tests now inject a settled pane (how the harness calls it) and cover the no-op-on-non-PASS and tail-scan cases.

Make the transaction-less store's write paths atomic and lock per-instance: - Import each source through one locked transaction (cleanup + chunks + page texts + source row), with the embedding/dimension checks before the delete, so a failure or a concurrent reader never sees the source half-replaced. - Re-embed memories with the snapshot, embed, and table rebuild all under the write lock, so a memory added mid-rebuild is no longer erased. - Rebuild concept clusters via a single delete-and-add-under-one-lock helper, so readers never see the nodes table empty while the graph reports present. - Key the write lock on each store's own data directory (a per-instance engine coordinated on the wrong file before) and share one timeout budget across its two stages instead of applying it twice. - Report a file that yields page text but no searchable chunks as skipped rather than added, while still persisting its page text. - Count only published pages as backlinks in the wiki orphan linter (drafts and archived pages no longer exempt a live page). Full test coverage including the atomicity and locking behaviors.

…-main-20 # Conflicts: # pyproject.toml # src/lilbee/app/settings_map.py # src/lilbee/core/config/enums.py # src/lilbee/core/config/model.py # src/lilbee/mcp_server.py # tests/test_catalog_actions.py # tests/test_settings.py # tools/wheel-build/build_lilbee_binary.sh # uv.lock

* Evict ingest-lock registry entries on release try_acquire created a permanent dict entry per distinct filename and release never removed it, so a long-lived server ingesting many files grew one asyncio.Lock per basename for its whole life. Release now drops the entry; it is safe because try_acquire only acquires a free lock (these locks never have waiters) and release runs synchronously, atomic with try_acquire on the loop. * Suppress ProviderError in the read-only ready probe _ready_models reaches endpoint(), which raises ProviderError when a concurrent shutdown has cleared the port. role_ready is a read-only probe (HTTP status, SSE warming), so the probe now reports nothing-ready instead of throwing. * Make CUDA LD_LIBRARY_PATH composition idempotent cuda_runtime_env always prepended the wheel dirs to the existing path, so re-applying it on every reload pass accumulated duplicate copies that were baked into each spawned server. Drop existing entries that are already wheel dirs. * Close the old client pools when re-adopting a reloaded swap _adopt_swap overwrote self._clients without closing the previous clients, so every reload (TUI/REST/CLI apply) abandoned one httpx pool per replica per role with open sockets. The reloaded swap already stopped the old upstreams, so the old clients are closed in place. * Terminate the chromium install subprocess on cancel bootstrap_chromium awaited the ~180MB 'playwright install' with no try/finally, so an SSE client disconnect cancelled the task and left the install running orphaned (repeated retries stacked concurrent installs). Wrap the wait in a finally that terminates then kills the process. * Free the SDK vision OCR caller at the timeout deadline The timeout path used 'with ThreadPoolExecutor', whose __exit__ shutdown(wait= True) blocked until the hung call returned, so the caller was not freed at the deadline. Shut the pool down without waiting (cancel_futures), matching the fleet OCR path. * Account for every shard of a split GGUF A split GGUF was registered with one manifest tracking only the first shard, so removal freed only the first shard's blob (orphaning the rest when a sibling quant kept the repo cache alive) and reported only the first shard's size as freed. The manifest now records the total size and all shard blob digests; removal gc's every shard and freed size uses the total. Multi-shard download progress reports one monotonic 0->100% against the real total instead of N separate per-shard cycles. * Forward stream close through the reasoning and bare-JSON wrappers On reasoning cap-fire, filter_reasoning closed the _text_only generator, but its plain for-loop did not propagate GeneratorExit to the underlying chat stream, so the HTTP connection and its in_flight slot leaked until GC. _text_only and the sibling _recover_bare_json_stream now close their source on exit. * PR5 coverage: fix bootstrap test fakes, drop dead base-bar callback, cover missing-shard Add returncode to the existing bootstrap _Proc fakes (the new terminate path reads it), move the now-superseded callback out of the base progress bar (the tracker subclass owns it), and cover the missing-shard branch in _shard_accounting. * Address PR5 review: report full split-GGUF size in list/show; guard stream close The list and show surfaces still read the first-shard size_bytes after the removal path moved to disk_size_bytes; switch them too so a split GGUF reports its real on-disk total everywhere. Suppress teardown errors when forwarding the bare-JSON stream close (matching the reasoning wrapper), drop a redundant import, tighten a comment, and gate the bootstrap-cancel test on the task actually reaching proc.wait(). * Address PR5 review round 2: drain before close, real shard digests, recover legacy shards Round-2 review caught regressions in the PR5 fixes: - _adopt_swap now drains each old client's in-flight requests before closing it off-thread, so a reader mid-call on an old client snapshot is never closed out. - _shard_accounting uses _blob_digest (handles copy/non-symlink mode) instead of the snapshot filename, and the cache-recovery path records shard accounting too, so split GGUFs free every shard outside the symlinked-cache case. - Removal recovers shard blobs from the cache for pre-accounting manifests. - Cap the post-kill reap wait in the crawler bootstrap. * Address PR5 review round 3: REST freed size, progress total guard, drain grace REST DELETE now delegates to the shared remove path so it reports the real multi-shard freed size (was a hardcoded 0) like CLI and MCP. Multi-shard download progress only uses the summed grand total when every shard size is known (else falls back to per-shard, never exceeding 100%). The reload client drain waits a short grace so a just-checked-out reader's request enters its in-flight window before the drain inspects it; note the wedged-OCR thread relies on the backend httpx timeout. * Strengthen PR5 review-flagged tests (shard gc passthrough, lock eviction) * Reserve streaming in_flight eagerly so reload drain can't close mid-handoff A stream increments in_flight lazily on its first frame, but a stream is built and handed off (SSE handshake, MessageStart) before that first next(), so a concurrent reload's client drain saw in_flight==0 and could close the client out from under a checked-out stream. Reserve the slot at chat/chat_stream_items call time and release it when the stream closes or exhausts. * Close reload-retired clients by generation, not by in_flight timing The in_flight-based drain was racy: in_flight is incremented at request time, so a checked-out-but-not-yet-issued request (especially a non-stream chat doing prompt windowing, or a stream before its first frame) read as idle and could be closed out from under, and eager stream reservation leaked when a stream was closed before its first iteration. Replace it: a reload retires its old clients and closes the *previous* reload's retired clients (idle by now, confirmed via in_flight==0); shutdown closes whatever remains. Revert the eager-reservation streaming changes (the drain no longer keys on in_flight at close time). * PR5 review round 6 nits: legacy split freed size, sentinel constant, docstrings Report the true on-disk total when removing a legacy (pre-accounting) split GGUF so freed size matches what was actually deleted; use the _SIZE_UNKNOWN sentinel in the progress-total guard; correct the _ProgressTracker wrapper/subclass wording. * Cover the legacy-shard recovery error fallback * Drop bead-id and review-round references from test comments Remove bb-* issue ids and '(review round N)' annotations from test comments and docstrings repo-wide; they were tracking noise, not behavior. No test logic changes.

…ring (#404) * Fix retrieval-correctness defects in scoring, scope, and clustering bm25_probe rows never populated relevance_score because LanceDB FTS keys the score column as _score while SearchChunk only aliased _relevance_score. The confidence-based expansion-skip therefore always read 0 and never fired, and its tests mocked values the probe could never return. SearchChunk now accepts either alias, with a real-probe integration test. Structured-query prefixes (term:/vec:/hyde:) silently dropped an explicit chunk_type scope: only the wiki/raw branch honoured it. bm25_probe gained a chunk_type filter and the vec/hyde branches now thread the scope through. Concept boost mutated scores but returned store order, so the boost was inert for callers that consume search() order directly (CLI search, MCP search). The boosted results are now re-sorted by relevance. The reranker treated the RRF fusion score as if it were a [0,1] confidence. Real RRF magnitudes (~0.03) made the fusion term negligible in the blend and left the BM25-protection pin permanently dead (compared against a 0.8 threshold). The fusion signal is now min-max normalized across the candidate set like the reranker scores already were, so a strong hybrid hit keeps real weight and resists demotion; the now-redundant pin is removed. A perfect vector match (distance 0.0) is no longer misread as a falsy 0.5 fusion signal. Concept PMI was computed per file with one file's chunk count as the denominator and the per-file weights were then summed, which inflates pairs spread across many small files and is not corpus PMI. Edges now store raw co-occurrence counts; rebuild_clusters computes PMI once over corpus-wide counts before Leiden clustering. * Address review: scope the main-search path and sigmoid-normalize the BM25 probe The alias fix activated the previously-dead expansion-skip path but fed it raw, unbounded BM25 scores against a [0,1]-bounded threshold (the config calls it a "sigmoid-normalized BM25 score" but no sigmoid was ever applied), so expansion would skip on nearly every query. _should_skip_expansion now squashes the probe score through a logistic before comparing, and its tests use realistic raw-BM25 magnitudes instead of synthetic [0,1] values. The chunk_type scope was threaded through the structured term:/vec:/hyde: paths but missed the non-structured main path: the HyDE merge and the confidence probe both searched the mixed pool for a scoped query, leaking out-of-scope chunks into the HyDE merge. _merge_hyde_results and _should_skip_expansion now thread the scope, mirroring _merge_variant_results. * Address review round 2: cohort-normalize reranker fusion, clamp excerpt relevance The reranker min-max normalized the fusion signal over the whole candidate set, but on the default HyDE-enabled path that set mixes two incomparable scales: hybrid rows carry a tiny RRF relevance_score (~0.02) while HyDE rows carry a cosine distance (1 - distance ~0.5). Normalizing them together let a HyDE chunk dominate purely as a scale artifact and demote a strong hybrid top. The fusion signal is now min-max normalized within each scoring family (RRF vs cosine) separately, with a mixed-cohort test. The _score alias also let a raw, unbounded BM25 score reach Excerpt.relevance (via term: queries through /api/search), which is documented as a [0,1] value. _to_excerpt now clamps it. Also added a test asserting build_concept_records emits raw co-occurrence counts. * Address review round 3: derive corpus PMI from chunk_concepts, tidy reranker The concept edge table is append-only and not source-scoped, and its weight column's meaning changed in this PR (PMI -> raw count). Deriving corpus PMI by summing edge weights would therefore corrupt clustering on an in-place upgrade (legacy PMI floats summed as counts) and double-count co-occurrences when a source is re-ingested. rebuild_clusters now computes co-occurrence, concept frequencies, and the chunk count from the chunk_concepts map, which is source-scoped and schema-stable, so PMI stays correct across re-ingests and upgrades regardless of what the edge weights hold. Extracted the reranker's neutral-score 0.5 into a named constant, reused the existing probe-result minimum for the confidence probe's top_k, and corrected two reranker docstrings that still described whole-set (rather than per-family) fusion normalization. Also wrapped one pre-existing over-length call in pipeline.py that was tripping the lint gate. * Address review round 4: isolate the BM25 score, share per-family fusion ordering The root cause behind the recurring dual-scale findings was that the _score alias made one relevance_score field hold three incomparable scales (RRF fusion ~0.03, raw BM25, and None for vector rows). Raw BM25 is now kept in its own bm25_score field, so relevance_score stays fusion-scale only: the min_relevance_score filter, excerpt relevance, and weighting all see a consistent scale again, and the _to_excerpt clamp added earlier is no longer needed. Only the expansion-skip confidence probe reads bm25_score (still sigmoid-squashed). The concept-boost re-sort had reintroduced the RRF-vs-distance scale collision the reranker already fixed: it sorted hybrid and HyDE rows together by a raw key, so a HyDE recall's larger 1-distance could outrank a strong hybrid hit. The reranker's per-family normalization is now a shared helper (order_by_fusion in dedup), used by both the reranker and the boost re-sort.

…am (#406) Bring every transport in line with the reference (non-streaming / in-process) path. - MCP crawl tool validates off-loop and schedules on-loop (was dead); HTTP /api/crawl mirrors the off-loop validation so neither async transport stalls the loop. - Streaming /v1/chat/completions reports finish_reason 'length' on max_tokens truncation via a new StreamFinish terminator frame in the provider stream contract; fleet client and SDK provider emit it, dispatch maps it without downgrading tool-call streams. - ask/chat SSE handlers emit the cited subset with a fall-back to the full retrieved set, and refuse the no-embedder case by streaming the refusal as a normal answer token (mirrors Searcher.ask_stream) instead of an SSE error. - mcp and agent-config commands accept and apply --data-dir/--global. - wiki status counts every content subdir; an accepted unpaired drift draft is restored to its own page type via an origin marker; wiki prune preserves subdirs; concept/entity collision guard widened. - OpenAI stream_options.include_usage honored; first streamed delta always carries the assistant role. - code chunker line off-by-one fixed; SDK chat_with_tools routing added; models-pull arch precheck returns 409 at the route. Covers bb-ziks.3/.25/.66/.64/.67/.36/.63/.75/.35/.68/.65/.31/.46/.69 and bb-7jg1.9/.18/.21/.24. Review converged over two rounds (9 findings then 1 parity gap, all fixed). 100% coverage on changed lines.

…el-safety (#407) Fix data corruption and loss across the ingestion paths. - Code chunker consumes tree-sitter's size-bounded chunks: content is the already-extracted UTF-8 text (no byte-offset slicing that corrupted non-ASCII source), honors chunk_max_size, lists nested symbols, drops the literal 'None' header for anonymous symbols, and carries only the relative source name (no absolute-path leak). Line ranges are derived from each chunk's own line count so they stay correct with or without a trailing newline. - enable_ocr=False disables OCR entirely (vision and Tesseract); multi-frame TIFF/GIF images OCR every frame; chunk_text is offloaded off the event loop; the OCR cache key includes the per-page timeout. - The ingest pipeline keeps draining a cancel batch so a sibling that completed in the same batch still flushes before the cancel propagates. - Remote embeddings are ordered by the response index; SQL predicates stop doubling backslashes (Datafusion treats them literally). Covers bb-7jg1.4, bb-7jg1.10, bb-ziks.18/.19/.20/.21/.22/.39/.60/.61/.62. Review converged over two rounds; 100% coverage on changed lines.

…over, placement (#409) Resolve native slash GGUF refs and follow redirects so the unsupported-arch pull guard fires; read general.architecture by walking the GGUF KV table directly (validated byte-for-byte against gguf-py); make read_gguf_metadata return None on any parser failure instead of aborting the fleet build; route vision OCR through replica failover; order discrete single-model placement search-first; raise on out-of-range device pin; offload heavy wiki routes off the event loop.

…filenames, seeded Leiden (#411) Stop stringifying TOML values so list config fields load natively; write the opencode setup marker atomically; fold a query hash into crawl filenames so distinct queries don't collide (with a robust .md guard for degenerate paths); seed Leiden for reproducible clusters.

…nk invariant (#414) Add a newline-splitting validator for crawl_browser_extra_args so a persisted value no longer discards the whole config.toml on reload; reset removes stray symlinks instead of aborting; settings rejects lowering chunk_size below the existing overlap; flash_attention warns on bad input; stop persisting derived embedding_dim; widen settings rollback to parse errors; minor DRY/docstring fixes.

…nt-config parity (#415) Validate wiki-lint source within wiki_root (blocking); stop the spawned serve under a finally; apply max_distance + cap top_k + reject empty queries in CLI search/ask; JSON error envelope for --task; agent-config includes the remote chat model and pins opencode defaults; atomic opencode skill install; memory remove exit code; server probe dedup + log-handle close; sync counter lock; self-check temp cleanup; warm-budget contract.

…ard helpers (#416) * catalog: assign a selected remote/frontier model to its task's role, not always chat_model * TUI fixes: per-tab list cache key, single catalog nav, single role reload, scan logging - _append_more_hf_to_list updates the per-tab _list_cache_keys entry instead of a dead singular attribute, so a later _refresh_list sees the appended rows. - /catalog uses switch_view only (was stacking a second orphaned CatalogScreen). - settings model-pick passes reload_worker=False so the role server reloads once, not twice. - setup _scan_installed_models logs before swallowing. * TUI settings: per-key list defaults + regex validation only for regex lists - _on_list_restore restores each key's own default via get_default (was always writing the crawl_exclude_patterns default for every LIST_COLLAPSED setting). - regex validation runs only for settings marked validate_regex, so the crawl_browser_extra_args flag list isn't rejected as 'invalid regex'. - model-pick reload now happens once (test updated for the new reload path). * TUI: single-source the app window title via msg.app_title (consistent separator) * TUI: narrow call_from_thread to shutdown errors so genuine callback bugs surface * TUI: single-source the nav-view universe (messages.ALL_NAV_VIEWS) The view set was encoded in three places with a duplicated wiki gate. Define ALL_NAV_VIEWS once; get_nav_views() gates Wiki, app.get_views() derives its factory map from get_nav_views(), and the status bar composes ALL_NAV_VIEWS. * TUI wiki: debounce search so the tree re-walks once on pause, not per keystroke * TUI nits: settable-key autocomplete filter, fail-closed spacy check, logging, constants - /set autocomplete (suggester + autocomplete) offers only settable keys, so read-only wiki_dir is no longer suggested then refused. - _spacy_available no longer fails open to 'available' on an unexpected error; it logs and reports absent so the install guidance still shows. - suggester model/document lookups log on failure (parity with autocomplete). - list-editor id uses EDITOR_ID_PREFIX; size-variant strip typed list[SizeVariant]; focus-step extracted from a side-effecting ternary; docstring typo fixed. * TUI nits: fit-chip grammar, canonical api-key field + role map, palette delete cache, frontmatter guard - _render_fit_pill won't-run reads 'won't run, short by X GB' (was 'won't -X GB'). - catalog uses PROVIDER_API_KEY_FIELD; model_pick reuses MODEL_FIELD_TO_ROLE instead of a duplicate map. - palette _delete_doc invalidates the doc cache and uses CMD_DELETE_SUCCESS. - wiki _display_page coerces frontmatter via _safe_float/_safe_int so a non-numeric value can't crash the node-select handler. * TUI nits: NATIVE_BACKEND constant + drop native pill, derive digit-tab map, distinct delete-read error - Extract NATIVE_BACKEND; ModelCard drops the implied 'native' pill (parity with grid/list, which now use the constant too). - catalog on_key derives the 1-6 -> tab index from ALL_TAB_IDS, not a rebuilt map. - catalog_grouping sorts featured via a typed cast, not getattr-with-default. - _populate_library_list drops a misleading contextlib.suppress(AttributeError). - /delete store-read failure reports a distinct 'Could not read' error, not 'no documents'. * TUI nits: remove dead wiki source helpers; stream timings as a named dataclass - Drop unused WikiScreen._selected_source/_source_for_slug (no production caller) and their tests; drop the now-unused read_page module import. - Replace the [last_flush, last_scroll] magic-index list with a _StreamTimings dataclass. * TUI nits: shared estimate_min_ram_gb helper; narrow model_bar refresh suppress - Extract estimate_min_ram_gb (catalog/models.py) so hf_client and the catalog install path share one RAM-from-size heuristic (was two diverging copies). - RoleRow.refresh_state suppresses only NoMatches (children not mounted), not every exception, so a real pill/repaint failure surfaces. - Update the model-pick role-map test to the canonical MODEL_FIELD_TO_ROLE. * TUI nits: lru_cache the doc completion cache; set_setting download guard - Replace the _doc_cache module global + global-statement writes with an lru_cache'd inner function (cleared by invalidate_document_cache). - Extract _reject_if_downloading; set_setting now refuses a still-downloading model-role ref like set_active_model does. * TUI nits: public ChatScreen.request_reset; per-batch taskbar failure count - Extract ChatScreen.request_reset as the public reset entry; the palette calls it instead of reaching into the private _cmd_reset. - TaskBar's FAILED flash counts only the just-finished batch's failures, not every failure in persistent history (which inflated the count across runs). * TUI nits: extract shared catalog-card helpers (DRY across card/grid/detail) Move the byte-identical name-truncation, spec/status pills, fit-level color map, and verbose fit-pill renderer into catalog_card_shared; model_card, model_grid, and catalog_detail import them instead of each carrying a copy. Also fixes the grid drawer's stale 'won't -X GB' label (now shares the fixed renderer). * TUI perf: cache ModelGrid card lines per cell so each card builds once per repaint render_line is called once per terminal row, so each card was fully rebuilt _CARD_HEIGHT times per repaint. Cache the built lines keyed by (index, width, selected, border_style); cleared on set_rows/resize/highlight. * test: align suppressed-error and stream-timings tests with narrowed contracts * test: cover non-string download guard, out-of-range tab digit, and isolate crawl progress tests * docs: clarify model_grid card-cache invalidation comment * test: deterministic wiki search debounce test (remove inter-edit pause race) * test: make wiki search debounce test fully deterministic (no real timer/wall-clock)

…cation, behavioral parity, event-loop offloads (#417) * Server review fixes: catalog installed parity, ?source= 422, chunk_type decoder unification, stop-sequence forwarding, crawl unlimited sentinel, one-shot memory extraction, chat search-unavailable refusal, event-loop offloads, add-files lock partition * test: update catalog kwargs, crawl unlimited sentinel, and reasoning-stream fixtures for server review fixes * test: rename store-extracted-memories empty-answer test to match what it exercises

llama-swap was spawned with inherited stdout/stderr, so under a TUI or CLI parent its HTTP access log (POST /upstream/embed-0/tokenize ...) printed straight onto the terminal and corrupted the render. Capture its stdio to a per-owner llama-swap.log instead; the per-model upstream logs are unaffected (those come from llama-swap's /logs API, not its stdout). Audited the other engine subprocesses while here: gguf-parser, the llama-server --list-devices/ldd probes, nvidia-smi, and the managed lilbee serve already redirect or capture their output; the foreground client (opencode) inherits the terminal by design. llama-swap was the only leak.

* Make model swaps non-blocking with a progress indicator Switching a model ran the multi-second fleet reload on the Textual event loop, freezing (and sometimes crashing) the TUI. Two paths did it: a chat swap reset services in apply_model_change, and an embed/rerank/vision swap reloaded the role inline in model_pick._persist. Both now run the reload in a thread worker behind a 'Switching model, loading...' toast, with a 'Now using <model>' confirmation and an error toast on failure. The chat reset still cancels the in-flight stream first and waits for workers to drain (serialized so reset_services never runs twice at once). Updated the affected TUI tests for the async behavior. * Block chat input while a model swap loads A non-blocking swap still left a window where a prompt sent mid-swap raced a half-torn-down fleet. Add a swapping_model gate: apply_model_change disables the chat input behind a 'switching' state, the worker resets AND warms the new model (get_services eager-starts it) so unblock means loaded not just configured, and the input re-enables with focus restored when the worker finishes or fails. A submit attempted mid-swap is rejected with a clear toast. * Review fixes: single async reload, guarded watcher, dead-class drop Blast-radius review of the model-swap change found: - Settings model picks reloaded the role twice: once in apply_model_pick's worker and again synchronously in the Settings on_done, which both doubled the fleet restart and re-blocked the UI thread. Make apply_model_pick the single (async) reloader; the Settings on_done now only repaints the button. - watch_swapping_model touched the chat input unguarded; the unblock fires from the worker via call_from_thread and can land after navigating away, so guard the input access with suppress(NoMatches). - Dropped a dead .swapping-model CSS class (the input disabled state is the visual); clarified the drain-wait rationale (it protects provider+store teardown, not just serialization). * Convention fixes: name the persist worker, correct stale swap-worker comment * Optimize chat swap: reload only the chat role, keep the store The chat swap used reset_services(), which also closed LanceDB and rebuilt the searcher/concept-graph -- none of which a chat-model change needs (the provider reads cfg.chat_model late-bound; the Searcher never caches it). Every other role swap already uses the lighter reload_role(). Switch the chat swap to reload_role(CHAT, wait=True): it re-plans and restarts the fleet for the new model while keeping the store and searcher. wait=True is a new synchronous path on reload_role (run in the caller's already-off-event-loop worker) guarded by a threading.Condition so it returns only once the reload finishes and preserves the existing single-flight semantics exactly. Threaded the param through the provider protocol, both implementations, and the services pass-through. * Oracle review: correct overstated 'model loaded' docstrings diff-against-oracle found the swap docstrings claimed the model was loaded on unblock, but reload_role(wait=True) returns when the fleet proxy is healthy; the model loads lazily on its next request, same as every other role swap. The input block correctly covers the proxy-restart window. Corrected the comments; no behavior change (not warming after a reload is faithful to the non-chat reload_role path this mirrors). * Update services reload_role delegation test for the wait flag * Update reload_role forwarding tests for the wait flag The routing-provider forwarder and the services delegation now thread wait= to the underlying reload_role, so the assertions must include it. Adds a wait=True forwarding test for the chat-swap path. * Fix CI: drop the swap drain-wait (hang), repair drifted tests The full CI suite surfaced two failures the targeted runs missed: - test_tui slash-model timed out: apply_model_change's drain-wait looped on call_later forever when a background worker never drained, spinning the event loop. Root cause + production risk (a long-lived worker would block any swap), so drop the drain entirely: reload_role keeps the store and the provider retires busy clients across the restart, and the stream is already cancelled, so the worker can start immediately. The provider single-flights overlapping reloads. - test_worker_roles asserted settings.py re-exports MODEL_FIELD_TO_ROLE, which the earlier single-reload fix removed (settings no longer reloads); the test is obsolete, removed. Added edge-branch tests (streaming-reject, idle-allow, persist reload-failure toast) and mocked the reload in the slash-model test. * Format test_tui_model_bar * Fix racy embed-picker tests and a flaky drop-models test CI on Python 3.11 / Windows (not the faster runners) surfaced races my targeted runs hid: - The embed picker-dismiss tests asserted cfg/reload_role right after _on_picker_dismissed, but _persist now reloads in a thread worker; await app.workers.wait_for_complete() inside the mock context before asserting. - test_drop_loaded_models_async raced on _swap-is-None, which the worker clears before swap.shutdown() runs; wait on the shutdown count instead (pre-existing flake, unrelated to the swap change). * Blast-radius review fixes: guard re-entrant chat model swap, run model write on main thread + block reload off-thread in picker worker, align reload_role wait param across mocks/tests

Adding any file failed with 'int() argument ... not NoneType' when the store held a source row whose nullable stat columns (size_bytes / mtime_ns / stat_captured_ns) were NULL. source_stat only guarded the SOURCE_STAT_UNKNOWN sentinel and missing keys via .get-default, so an explicit NULL slipped through to int(None). Treat NULL like unknown (return None, the caller re-hashes) and coerce a NULL capture time to the sentinel instead of crashing. Bug present since the stat-based sync-skip landed in #338; surfaces on any store carrying a null-stat row.

…ut, tool-linkage, option-translation dedup (#419) * Providers review fixes (batch 1): skip unreachable local server instead of dropping all models, forward streaming timeout, drop masking getattr default, dedup supports_tools ref * test: assert streaming chat forwards caller timeout and skip-on-unreachable-server behavior * Providers review fixes (batch 2): preserve tool-linkage message fields in chat_with_tools, correct _sdk_attr docstring overclaim * Providers review fixes (batch 3): dedup litellm response model extraction into _response_model helper * Providers review fixes (batch 4): remove dead write-only model_defaults cache (module, test, and the show_model write path) * Providers review fixes (batch 5): dedup num_predict->max_tokens/drop num_ctx translation into shared normalize_generation_options * fix: restore ModelDefaults dataclass (only the write-only cache was dead, not the type used by reasoning/config) * test: restore generation_options 3-layer merge tests (live behavior dropped with the dead-cache test file)

…concept-boost + perf (#420) * Retrieval review fixes (batch 1): route wiki:/raw: prefix through the wiki-disabled guard, apply concept boost even when query expansion is skipped * Retrieval review fixes (batch 2): bare search() applies temporal filter (parity), correct boost_results copy docstring, degree_map Counter type, reasoning cap docstring * Retrieval review fixes (batch 3): open chunk_concepts table once in boost_results (N+1), single-source structured-query modes, memoize repeated query concept extraction * test: cover boost_results pass-through when chunk_concepts table is missing * Retrieval review follow-ups: apply temporal filter to structured (mode:) queries too, clarify search() docstring, assert boost_results opens the table once

…f the event loop, bound crawl depth/max_pages like REST, re-tune search scope on vault switch (#421)

… lint, dedup (#422) * Wiki review fixes (batch 1): JSON-serialize frontmatter sources (escape quotes/backslashes), normalize whitespace in excerpt location lookup * Wiki review fixes (batch 2): read-only lint status skips audit-log write, provenance records the effective (fallback-resolved) entity mode * Wiki review fixes (batch 3): prune log uses WikiLogAction.PRUNE enum, accept_draft reuses _classify_and_strip_markers (drop dead strip helpers) * Wiki review fixes (batch 4): drop permanently-true synthesis condition, align empty-title fallback with browse resolver, longest-first citation source match

…p_enabled config) (#424) Lets users opt out of lilbee's injected MCP search tool: opencode_config omits the mcp block, the lilbee-mcp skill install is skipped, and lilbee stays the model provider so a user's own MCP servers still apply. Tri-state CLI flag overrides the config default per launch.

TUI delete dropped index records but kept the source file, so the next sync re-ingested it and the doc reappeared. remove_documents_durably writes a hash-keyed skip-marker for the kept file so sync treats it as unchanged-and-skipped; editing the file or retry-skipped/rebuild restores it. Non-destructive (file stays on disk).

* Data/crawler review fixes (batch 1): offload import_dataset + force_rebuild store writes off the event loop, dedup host-scope via host_in_scope, correct SSRF docstring (TOCTOU, not rebinding protection) * Data/crawler review fixes (batch 2): include_subdomains parity + nits Thread include_subdomains through the MCP crawl tool and REST /api/crawl so the subdomain scope reachable from the CLI and TUI is honored on every entry point. Nits: escape LIKE wildcards in the source search filter; warn instead of silently discarding a corrupt crawl-metadata sidecar; dedup the empty-OCR skip warning and the chunk char-budget computation; mirror lilbee's global progress-bar suppression into the semantic-chunking embedding download; log (not silently swallow) async-generator teardown errors; replace the flush counter's inline magic-string dict with a typed counter; read LanceDB's _distance column in the vector-search debug log; correct the crawl_and_save max_pages docstring to match the safety-cap resolver.

…/) (#423) * Modernize logging docs + move llama-swap.log into logs/ The fleet replaced the in-process model worker pool, but the docs still described it. Bring them in sync: - TROUBLESHOOTING: drop the worker-chat/embed/rerank/vision.log table row and the WorkerCrashError section (neither exists anymore); document the real logs -- llama-swap.log and launcher-serve.log -- and rewrite 'model server crashes' around llama-swap's 'exited prematurely' with the server output embedded. - CONTRIBUTING: 'in-process llama.cpp' -> a bundled llama-server process per model. - Point the vision-OCR skip message at server.log (worker-vision.log never exists). - Move llama-swap.log into the data root's logs/ so it sits beside server.log etc. instead of the data root, and create the dir on first write. * Use the resolved per-platform log path in the vision-OCR message The skipped-vision message hardcoded ~/Library/Application Support/...server.log, which is macOS-only and wrong on Linux/Windows. Interpolate cfg.data_root (the already per-platform-resolved data root) so the path is correct everywhere; add a regression assertion. Audited the whole codebase -- this was the only hardcoded platform path in a user-facing string (system.py's macOS branch, the Vulkan ICD XDG paths, and the ~/-abbreviating display formatter are all correct). Also make the README's TROUBLESHOOTING link absolute (the README ships to PyPI, where relative links break).

* Align vision/embedding classification order across manifest and remote paths reclassify_by_name checked vision before embedding while _classify_remote_task checked embedding before vision, so an image embedder like nomic-embed-vision (matching both patterns) classified differently depending on the path. Align both to rerank -> embedding -> vision. * Don't cache an empty arch from a transient probe failure resolve_arch_for_pull cached probe_architecture's '' (returned on any network/ parse failure) as if it were a verdict, so one transient failure permanently disabled the unsupported-arch guard for that ref. Only cache a non-empty arch. * Guard the native registry walk in gather_known_model_refs The docstring promises each primitive contributes an empty subset on failure, but the registry.list_installed() walk was unguarded, so a corrupt manifest or FS error raised out of the whole resolution. Wrap it to log and contribute no refs, matching the remote/API primitives. * Scope mmproj lookup to the matched vision repo's cache subtree find_mmproj_file matched the right featured entry but then searched the whole models dir, returning any file containing 'mmproj' regardless of which repo it belonged to. A chat or unrelated-vision model could inherit another model's mmproj and be misreported as vision-capable. Search only within the matched repo's models--<org>--<repo> cache directory. * Size a catalog row from the quant it names, not the largest GGUF _estimate_size_from_siblings used the largest GGUF (often an F16/BF16) while gguf_filename names the picked Q4_K_M quant, so size_gb (and the size-bucket filter) didn't match the file a pull produces. Size the same picked quant. * Run persisted-model canonicalization off the event loop at TUI mount canonicalize_chat_model/embedding_model probe local model servers over HTTP/DNS; calling them synchronously in on_mount froze the TUI for the probe's duration. Make on_mount async and offload both probes via asyncio.to_thread, keeping the ordering so the chat screen still installs against a settled ref. * Reuse canonical ref helpers instead of redefining them KnownModelCache.resolve hardcoded the 'ollama/' wire prefix; use OLLAMA.qualify. role_validator redefined NATIVE_GGUF_REF_MIN_SLASHES and reimplemented the native-GGUF-ref check; import is_native_gguf_ref from providers.model_ref. * Fix overclaiming docs and a type-erasing annotation in catalog/modelhub role_validator typed the catalog entry as Any, erasing CatalogModel; annotate it. Correct the mmproj F16-preference comment (it doesn't deprioritize BF16 or compare F16/F32 sizes) and the get_families docstring (variants keep featured order; recommended comes from the entry flag, not size). * Remove dead fetch_model_file_size and dedup a registry walk fetch_model_file_size had no production callers (only its own tests via the public export); remove the function, its export, and those tests. _is_local_installed called registry.list_installed() twice per call; call it once. * Validate the blob path before unlinking and log a swallowed installed-list error _gc_blob validated the repo cache dir but built the blob path from a digest and unlinked it unchecked; a traversal digest could escape models_dir. Validate the blob path too. _get_installed_models swallowed all manager errors silently as 'nothing installed'; log it so a broken registry is visible. * Read the manifest tree once when freeing a multi-shard model's blobs remove() called _gc_blob per shard digest, and each call re-walked the whole manifest tree via list_installed (N+1 for a split GGUF). Compute the surviving siblings once and pass them in; _gc_blob still reads them itself when called standalone. * Cover the no-mmproj-in-repo-cache branch

* Fix test-quality review findings Patch get_system_ram_gb where the setup screen looks it up, not at its source module, so _patch_setup_ram actually pins RAM instead of silently using the host's. Assert TestOptionsPassthrough actually forwards the request body's options to the generation-options resolver, not just that the call succeeds. Use entry.display_name (CatalogModel has no .name) in the featured-vision assertion messages. * Reformat messages.py to satisfy ruff 0.15.6 format-check Pre-existing base-branch format-check failure (unrelated to the test fixes): ruff 0.15.6 collapses a wrapped string literal that an older ruff left split. Blocks lint on every PR until reformatted.

tobocop2 changed the title ~~Local-model API: opencode (and any OpenAI/Anthropic client) talks to lilbee models~~ Local-model API: popular clients can talk to lilbee models May 17, 2026

tobocop2 changed the title ~~Local-model API: popular clients can talk to lilbee models~~ feat: Local-model API: popular clients can talk to lilbee models May 17, 2026

tobocop2 marked this pull request as draft May 18, 2026 07:37

tobocop2 changed the title ~~feat: Local-model API: popular clients can talk to lilbee models~~ feat: local-model API + llama-server engine (opencode + multi-GPU fleet) May 27, 2026

tobocop2 mentioned this pull request May 27, 2026

Make llama-server lilbee's local inference engine #297

Closed

tobocop2 changed the title ~~feat: local-model API + llama-server engine (opencode + multi-GPU fleet)~~ Multi-GPU model serving + OpenAI-compatible API for opencode May 27, 2026

tobocop2 changed the title ~~Multi-GPU model serving + OpenAI-compatible API for opencode~~ Re-architect local inference onto a multi-GPU llama-server fleet, with opencode integration May 27, 2026

tobocop2 changed the title ~~Re-architect local inference onto a multi-GPU llama-server fleet, with opencode integration~~ Multi-GPU model serving and opencode integration May 27, 2026

tobocop2 force-pushed the feat/local-model-api branch 2 times, most recently from 5b843d7 to 559c777 Compare June 4, 2026 05:09

tobocop2 force-pushed the main branch 3 times, most recently from a5d1a07 to b399947 Compare June 4, 2026 23:37

tobocop2 marked this pull request as ready for review June 6, 2026 03:26

tobocop2 added 9 commits June 6, 2026 16:01

Provider routing and remote (Ollama / LM Studio / SDK) backends

356a4a0

Route each request by model ref: native GGUF refs go to the local fleet, remote-prefixed refs to the OpenAI-compatible SDK backend. Adds model discovery and role classification for installed and remote models.

OpenAI-compatible server API with native tool calling

1314d24

Serve /v1 chat-completions and embeddings so any OpenAI-compatible client can use lilbee's local models, with tool-call extraction and a canonical chat dispatch path shared by the HTTP API, CLI, and MCP.

opencode and agent-client integration

daaa7ad

Launch opencode (and similar agents) against lilbee as their model provider, with the lilbee_search skill/MCP wired in and the built-in client tools scoped off so the model retrieves through lilbee.

Config, CLI, TUI, MCP, and runtime wiring for the local engine

334720b

Multi-GPU QA harness and engine CI

6dae1a3

Real-hardware fleet smoke test, opencode model matrix, and CI that builds the engine binaries and runs the integration suite against the fleet.

Tests for the local-model engine, server API, ingest, and integrations

4e6b3dd

Docs and project metadata for the local-model engine

8381ab4

tobocop2 force-pushed the feat/local-model-api branch from 132938c to 8381ab4 Compare June 6, 2026 20:03

tobocop2 added 2 commits June 6, 2026 16:14

Truncate embed inputs below n_ctx so re-added BOS fits (bb-54r)

ba5513e

token_cap equalled the per-slot context, so an exact-truncated oversize input plus the server's re-added BOS overflowed by a token and the overflow retry still failed. Truncate a few tokens under n_ctx.

tobocop2 added 30 commits June 19, 2026 16:12

Misc/MCP review fixes: offload add-tool DNS validation + file copy of…

d9f2313

…f the event loop, bound crawl depth/max_pages like REST, re-tune search scope on vault switch (#421)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU model serving and opencode integration#267

Multi-GPU model serving and opencode integration#267
tobocop2 wants to merge 90 commits into
mainfrom
feat/local-model-api

tobocop2 commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tobocop2 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this adds

Models run across all your GPUs

Bulk ingest fanned across the whole box

opencode can use lilbee's models

Tool calling

Feedback while models load

Better answers and longer conversations

Indexes images, not just PDFs

Architecture

Performance

Supported models

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tobocop2 commented May 17, 2026 •

edited

Loading