Skip to content

feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69

Open
Alex-Wengg wants to merge 4 commits into
mainfrom
feat/eou-decode-ane
Open

feat(stt/eou): fused decoder+joint_decision — MIL fusion, three-way WER gate, Swift-measured#69
Alex-Wengg wants to merge 4 commits into
mainfrom
feat/eou-decode-ane

Conversation

@Alex-Wengg

Copy link
Copy Markdown
Member

Research record for the EOU decode-loop fusion (consumed by FluidAudio#680):

  • LSTM gate: ios17.lstm in the decoder → ANE categorically closed; joint_decision has zero ANE segments under any CU config (small-graph floor, measured)
  • MIL-lean fusion rebuilt from shipped fp16 weights: 1.68×/step; bit-exact pipeline variant measured 0% (dispatch isn't the bottleneck)
  • Full 2,620-file three-way WER gate: traced ≡ lean on every file (fp32-only bit-exactness disproven), ΔWER +0.043 pp, 4 raw-window truncation outliers — fully absorbed under production overlap chunking (Swift-measured in FluidAudio#680: +7–9% e2e RTFx, WER −0.11 pp)
  • Harnesses committed for re-runs: fuse_decoder_joint_decision.py, wer_three_way.py, bench_three_way.py; full writeup in models/stt/parakeet-realtime-eou-120m/OPTIMIZATION.md

🤖 Generated with Claude Code

…ampaign

ANE gate: decoder is ios17.lstm-blocked (no ANE kernel, categorical dead
end); joint_decision (3 MB) gets no ANE segment under ALL or CPU_AND_NE
(small-graph floor). The EOU decode loop can never be ANE-resident.

Fusion branch (Nemotron B1 precedent), built from the shipped fp16 weight
blobs with the MIL builder (no NeMo install needed):
- fuse_decoder_joint_decision.py: single-graph decoder+joint_decision.
  Lean variant: 0.215 -> 0.125 ms/step median (-42%), ~1.6x end-to-end for
  the decode-dominated EOU pipeline. Not bit-exact: fp16 logits move ~4e-3
  relative, flipping argmax on low-margin frames (transcripts change on
  ~5/20 LibriSpeech files; WER-neutral: 34.85% -> 35.04% over 1036 words).
  --replicate-boundary experiment (exact inter-model transpose/cast chain,
  optimizer passes disabled) failed to recover bit-exactness: the
  divergence is E5RT kernel selection below MIL control.
- pipeline_decoder_joint_decision.py: ct.utils.make_pipeline of the two
  shipped specs. Bit-exact (STRICT parity PASS, 675 steps) but 0% faster
  on M5 - per-stage execution overhead dominates, not host dispatch.
- parity_fused_decode.py / wer_ref_vs_fused.py / bench_fused_decode.swift:
  real-audio autoregressive parity, WER gate, interleaved A/B/C benchmark.

Verdicts in OPTIMIZATION.md: ANE no-ship (settled), pipeline no-ship,
MIL-fused conditional ship pending a production-pipeline WER run.
… traced strictly dominated (identical sequences, fp32-only bit-exactness)
Single-harness deciding run for the two fused decoder+joint candidates
(MIL-lean vs traced re-export) against the shipped two-model pair:

- bench_three_way.py: interleaved per-RNNT-step bench, same process.
  Lean 1.68-1.71x/step, traced 1.21-1.24x — both prior cross-harness
  claims reproduce; ~half the gap is the dropped 1027-way top-k.
- wer_three_way.py: full LibriSpeech test-clean (2,620 files, 52,576
  words) with per-file cached encoder outputs feeding all three decode
  paths identical frames. Ref 35.646%, both fused 35.689% (+0.043 pp);
  traced and lean emit identical token sequences on all 2,620 files
  (278 differ from ref, same set), confirming the drift is inherent to
  single-graph fp16 fusion, not the lean rebuild.
- OPTIMIZATION.md §6: results + verdict. Traced fusion is strictly
  dominated (same outputs, 1.4x slower); lean is the only fused
  candidate, pending maintainer call on the 4/2,620 early-truncation
  tail (all on harness-degenerate utterances).
A concurrent commit (6d82f35) appended a second copy of the gate results
after the repro block while §6 was being written. Fold its unique detail
(per-blowup error counts, aggregate-delta attribution) into §6b and drop
the duplicate section. No result changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant