feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69
Open
Alex-Wengg wants to merge 4 commits into
Open
feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69Alex-Wengg wants to merge 4 commits into
Alex-Wengg wants to merge 4 commits into
Conversation
…ampaign ANE gate: decoder is ios17.lstm-blocked (no ANE kernel, categorical dead end); joint_decision (3 MB) gets no ANE segment under ALL or CPU_AND_NE (small-graph floor). The EOU decode loop can never be ANE-resident. Fusion branch (Nemotron B1 precedent), built from the shipped fp16 weight blobs with the MIL builder (no NeMo install needed): - fuse_decoder_joint_decision.py: single-graph decoder+joint_decision. Lean variant: 0.215 -> 0.125 ms/step median (-42%), ~1.6x end-to-end for the decode-dominated EOU pipeline. Not bit-exact: fp16 logits move ~4e-3 relative, flipping argmax on low-margin frames (transcripts change on ~5/20 LibriSpeech files; WER-neutral: 34.85% -> 35.04% over 1036 words). --replicate-boundary experiment (exact inter-model transpose/cast chain, optimizer passes disabled) failed to recover bit-exactness: the divergence is E5RT kernel selection below MIL control. - pipeline_decoder_joint_decision.py: ct.utils.make_pipeline of the two shipped specs. Bit-exact (STRICT parity PASS, 675 steps) but 0% faster on M5 - per-stage execution overhead dominates, not host dispatch. - parity_fused_decode.py / wer_ref_vs_fused.py / bench_fused_decode.swift: real-audio autoregressive parity, WER gate, interleaved A/B/C benchmark. Verdicts in OPTIMIZATION.md: ANE no-ship (settled), pipeline no-ship, MIL-fused conditional ship pending a production-pipeline WER run.
… traced strictly dominated (identical sequences, fp32-only bit-exactness)
Single-harness deciding run for the two fused decoder+joint candidates (MIL-lean vs traced re-export) against the shipped two-model pair: - bench_three_way.py: interleaved per-RNNT-step bench, same process. Lean 1.68-1.71x/step, traced 1.21-1.24x — both prior cross-harness claims reproduce; ~half the gap is the dropped 1027-way top-k. - wer_three_way.py: full LibriSpeech test-clean (2,620 files, 52,576 words) with per-file cached encoder outputs feeding all three decode paths identical frames. Ref 35.646%, both fused 35.689% (+0.043 pp); traced and lean emit identical token sequences on all 2,620 files (278 differ from ref, same set), confirming the drift is inherent to single-graph fp16 fusion, not the lean rebuild. - OPTIMIZATION.md §6: results + verdict. Traced fusion is strictly dominated (same outputs, 1.4x slower); lean is the only fused candidate, pending maintainer call on the 4/2,620 early-truncation tail (all on harness-degenerate utterances).
A concurrent commit (6d82f35) appended a second copy of the gate results after the repro block while §6 was being written. Fold its unique detail (per-blowup error counts, aggregate-delta attribution) into §6b and drop the duplicate section. No result changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Research record for the EOU decode-loop fusion (consumed by FluidAudio#680):
ios17.lstmin the decoder → ANE categorically closed; joint_decision has zero ANE segments under any CU config (small-graph floor, measured)fuse_decoder_joint_decision.py,wer_three_way.py,bench_three_way.py; full writeup inmodels/stt/parakeet-realtime-eou-120m/OPTIMIZATION.md🤖 Generated with Claude Code