feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured by Alex-Wengg · Pull Request #69 · FluidInference/mobius

Alex-Wengg · 2026-06-10T15:21:29Z

Research record for the EOU decode-loop fusion (consumed by FluidAudio#680):

LSTM gate: ios17.lstm in the decoder → ANE categorically closed; joint_decision has zero ANE segments under any CU config (small-graph floor, measured)
MIL-lean fusion rebuilt from shipped fp16 weights: 1.68×/step; bit-exact pipeline variant measured 0% (dispatch isn't the bottleneck)
Full 2,620-file three-way WER gate: traced ≡ lean on every file (fp32-only bit-exactness disproven), ΔWER +0.043 pp, 4 raw-window truncation outliers — fully absorbed under production overlap chunking (Swift-measured in FluidAudio#680: +7–9% e2e RTFx, WER −0.11 pp)
Harnesses committed for re-runs: fuse_decoder_joint_decision.py, wer_three_way.py, bench_three_way.py; full writeup in models/stt/parakeet-realtime-eou-120m/OPTIMIZATION.md

🤖 Generated with Claude Code

…ampaign ANE gate: decoder is ios17.lstm-blocked (no ANE kernel, categorical dead end); joint_decision (3 MB) gets no ANE segment under ALL or CPU_AND_NE (small-graph floor). The EOU decode loop can never be ANE-resident. Fusion branch (Nemotron B1 precedent), built from the shipped fp16 weight blobs with the MIL builder (no NeMo install needed): - fuse_decoder_joint_decision.py: single-graph decoder+joint_decision. Lean variant: 0.215 -> 0.125 ms/step median (-42%), ~1.6x end-to-end for the decode-dominated EOU pipeline. Not bit-exact: fp16 logits move ~4e-3 relative, flipping argmax on low-margin frames (transcripts change on ~5/20 LibriSpeech files; WER-neutral: 34.85% -> 35.04% over 1036 words). --replicate-boundary experiment (exact inter-model transpose/cast chain, optimizer passes disabled) failed to recover bit-exactness: the divergence is E5RT kernel selection below MIL control. - pipeline_decoder_joint_decision.py: ct.utils.make_pipeline of the two shipped specs. Bit-exact (STRICT parity PASS, 675 steps) but 0% faster on M5 - per-stage execution overhead dominates, not host dispatch. - parity_fused_decode.py / wer_ref_vs_fused.py / bench_fused_decode.swift: real-audio autoregressive parity, WER gate, interleaved A/B/C benchmark. Verdicts in OPTIMIZATION.md: ANE no-ship (settled), pipeline no-ship, MIL-fused conditional ship pending a production-pipeline WER run.

… traced strictly dominated (identical sequences, fp32-only bit-exactness)

Single-harness deciding run for the two fused decoder+joint candidates (MIL-lean vs traced re-export) against the shipped two-model pair: - bench_three_way.py: interleaved per-RNNT-step bench, same process. Lean 1.68-1.71x/step, traced 1.21-1.24x — both prior cross-harness claims reproduce; ~half the gap is the dropped 1027-way top-k. - wer_three_way.py: full LibriSpeech test-clean (2,620 files, 52,576 words) with per-file cached encoder outputs feeding all three decode paths identical frames. Ref 35.646%, both fused 35.689% (+0.043 pp); traced and lean emit identical token sequences on all 2,620 files (278 differ from ref, same set), confirming the drift is inherent to single-graph fp16 fusion, not the lean rebuild. - OPTIMIZATION.md §6: results + verdict. Traced fusion is strictly dominated (same outputs, 1.4x slower); lean is the only fused candidate, pending maintainer call on the 4/2,620 early-truncation tail (all on harness-degenerate utterances).

A concurrent commit (6d82f35) appended a second copy of the gate results after the repro block while §6 was being written. Fold its unique detail (per-blowup error counts, aggregate-delta attribution) into §6b and drop the duplicate section. No result changes.

Alex-Wengg added 4 commits June 9, 2026 23:59

docs(stt/eou): full 2620-file three-way WER gate — lean fusion ships,…

6d82f35

… traced strictly dominated (identical sequences, fp32-only bit-exactness)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69

feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69
Alex-Wengg wants to merge 4 commits into
mainfrom
feat/eou-decode-ane

Alex-Wengg commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant