Skip to content

Canary streamatt#34

Open
azziko wants to merge 13 commits into
hlt-mt:mainfrom
azziko:canary-streamatt
Open

Canary streamatt#34
azziko wants to merge 13 commits into
hlt-mt:mainfrom
azziko:canary-streamatt

Conversation

@azziko

@azziko azziko commented Apr 29, 2026

Copy link
Copy Markdown

Changes:

  1. Add flag to the base streamatt, which determines whether the audio history is stored raw or in features
  2. Implement canary with streamatt

Resolves: #28

@azziko

azziko commented Apr 29, 2026

Copy link
Copy Markdown
Author

I'll fix the checks and run unit tests. I forgot about them to be honest

Let me know if the overall idea is fine

@mgaido91 mgaido91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you very much for your contribution @azziko ! The approach looks great to me and the code is very clean, thanks. I amonly concerned by the leading EOS, which I do not understand.

Only a couple of last points:

  • Can we please add a couple of unit tests for the audio history management? Only to ensure everything works like we expect and also future changes won''t break things.
  • This code relies on recent contribs to NeMo (thanks for them as well!), but currently we have in our dependencies nemo_toolkit[asr]==2.4.0 for canary. I think we have to update that.

Thanks!

Comment thread config/canary_streamatt.yaml
Comment thread simulstream/server/speech_processors/base_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py

return replace(self.transcription_cfg, prompt={"turns": turns})

def _remove_eos_tokens(self, token_ids: List[int]) -> List[int]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand, how and when can this happen? isn't it a problem for the attention to have these extra tokens?

@azziko azziko May 2, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were testing our system with Canary for IWSLT, there were EOS tokens occasionally in the beginning of the hypothesis. While we haven't traced the exact reason why, I speculate it's because of the forced prefix. In our system we solved it this way. The fix should probably be better done on the NeMo side, though. I will look into that

isn't it a problem for the attention to have these extra tokens?

In our tests they were outputted together with the other prefiction, so I assume again that they don't distrupt the attention scores.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a repro, I can also try to debug this, thanks. I would like to make sure here we do not have issues.

@azziko

azziko commented May 2, 2026

Copy link
Copy Markdown
Author

thanks for the review @mgaido91,

I pushed the quick fixes for most of the points, I will add some unit tests later too.

Regarding the EOS, I replied in the related conversation.

This code relies on recent contribs to NeMo (thanks for them as well!), but currently we have in our dependencies nemo_toolkit[asr]==2.4.0 for canary. I think we have to update that.

It does not seem like the contributions have been added to any release yet. I'm using latest commit from the repo when installing nemo toolkit as so:

pip install "nemo_toolkit[asr] @ git+https://github.com/NVIDIA/NeMo.git"

@mgaido91 mgaido91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, thanks, just a few minor comments. The main thing that worries me is the EOS stripping, which I would like to investigate more.

Regarding the version, the next release will be 2.8.0. So we can put that as a dependency. This might also mean we have to wait for that release to merge this but it may be fine if they stick with their scheduled release (June, so ~1 month from now). Otherwise we can put "@ git+https://github.com/NVIDIA/NeMo.git@main" as a dependency in the pyproject (actually it would be better to use a commit hash than main, to ensure we do not have falky issues with newer commits coming in). Then we will need another PR once they do the release to use that.

- **audio_subsampling_factor (int)**: Subsampling factor of the model, if any.
Defaults to 1.
- **mel_hop_samples (int)**: Number of raw waveform samples per mel frame.
Defaults to 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Defaults to 1.
Defaults to 160, i.e. 10ms at 16kHz.

Comment thread config/canary_streamatt.yaml
Comment on lines +53 to +55
self.use_raw_audio_history = True
self.mel_hop_samples = getattr(self.config, "mel_hop_samples", 160)
self.audio_subsampling_factor = getattr(self.config, "audio_subsampling_factor", 8)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these things are already set in the parent, no need to have them here.

Comment thread simulstream/server/speech_processors/canary_streamatt.py

return replace(self.transcription_cfg, prompt={"turns": turns})

def _remove_eos_tokens(self, token_ids: List[int]) -> List[int]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a repro, I can also try to debug this, thanks. I would like to make sure here we do not have issues.

@azziko

azziko commented May 5, 2026

Copy link
Copy Markdown
Author

I agree on the version, I changed it to 2.8.0

Regarding the EOS problem, I looked into the logs I had, it was the problem with our system in particular, so I removed the EOS trimming in the latest commit. It's still probably a good idea to run the processor on some small test set. I will try it out when I have time.

Comment thread uts/speech_processors/test_streamatt.py Outdated

@mgaido91 mgaido91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one comment regarding the UT. I agree on testing this more thoroughly, I'll also do that when I find the time.

Since we have to wait for nemo 2.8.0 to be out, please ping me if I do not notice it, so when nemo 2.8.0 is out we merge this.

Thanks!

Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
azziko and others added 3 commits May 6, 2026 10:36
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
Comment thread uts/speech_processors/test_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
Comment thread simulstream/server/speech_processors/canary_streamatt.py Outdated
@mgaido91

mgaido91 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I have been trying this on must-c and got weird results:

Lang Frame COMET SacreBLEU Ideal Latency (s) Comp. Latency (s) Norm. Erasure RTF
de 2 0.7104 16.6518 2.9936 3.7225 0.0000 0.5860
de 4 0.7091 16.5604 3.0939 3.7810 0.0000 0.5566
de 6 0.7191 17.2589 3.5756 4.5110 0.0000 0.7091
de 8 0.7228 17.1645 4.1756 4.9336 0.0000 0.5139
es 2 0.7402 20.6929 2.8251 3.5067 0.0000 0.5992
es 4 0.7395 20.7426 2.8670 3.4297 0.0000 0.5112
es 6 0.7452 21.0897 3.0740 3.7319 0.0000 0.5961
es 8 0.7462 21.3364 3.0958 3.6534 0.0000 0.4638
fr 2 0.7219 23.2862 2.8991 3.6615 0.0000 0.6582
fr 4 0.7257 23.7139 2.9642 3.6991 0.0000 0.6417
fr 6 0.7326 24.4975 3.0756 3.8335 0.0000 0.6594
fr 8 0.7394 25.0632 3.3128 3.8698 0.0000 0.4516
it 2 0.7394 17.5279 2.7474 3.4357 0.0000 0.6084
it 4 0.7390 17.6090 2.6898 3.2818 0.0000 0.5201
it 6 0.7409 18.2894 3.0193 3.5827 0.0000
it 8 0.7507 17.1477 3.1959 3.8328 0.0000
it 10 0.7602 18.2203 3.5713 4.3174 0.0000
nl 2 0.7514 18.2894 3.0193 3.5827 0.0000 0.4866
nl 4 0.7520 18.5538 3.1366 3.8632 0.0000 0.6211
nl 6 0.7546 18.8852 2.9528 3.6708 0.0000 0.6003
nl 8 0.7584 18.5149 3.1959 3.8328 0.0000 0.5269
pt 2 0.7518 17.1165 3.4080 3.9666 0.0000 0.4671
pt 4 0.7507 17.1477 3.3887 4.0860 0.0000 0.5956
pt 6 0.7552 17.5002 3.4613 4.1795 0.0000 0.5977
pt 8 0.7602 18.2203 3.5713 4.3174 0.0000 0.6283

does it make sense for you? Is it a behavior you have noticed as well?

@azziko

azziko commented Jun 9, 2026

Copy link
Copy Markdown
Author

The quality seems quite low, I would play with frame threshold a bit. What I found in my tests, is that with chunks >2s it's better to increase the frame threshold too to 16 or even 20. On MCIF dev subset I got 0.90 xcomet-xl with chunk 3 and frame 16 for en -> de direction. I will run a grid search on must-c and let you know what I get.

@mgaido91

mgaido91 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks, I was using 1 second as chunk size. This is the config file I used:

type: "simulstream.server.speech_processors.canary_streamatt.CanaryStreamAtt"
model_name: "nvidia/canary-1b-v2"
text_history:
  type: "simulstream.server.speech_processors.base_streamatt.FixedWordsTextHistory"
  history_words: 10
speech_chunk_size: 1.0  # seconds
detokenizer_type: "canary"
cross_attn_layer: -2
cutoff_frame_num: __FRAME__
num_beams: 5
audio_subsampling_factor: 8
audio_history_max_duration: 360  # Maximum length for the audio buffer, in seconds
mel_hop_samples: 160  # Number of audio samples between adjacent mel frames
text_history_max_len: 128
word_level_postprocess: True  # Disable if character-level language
use_raw_audio_history: True

I can also test different values if you think makes sense. I just want to double check we do not have issues with the code. Thanks.

@azziko

azziko commented Jun 9, 2026

Copy link
Copy Markdown
Author

I see, thanks. Does it mean "Chunk" in the table you shared actually represent the FRAME? If so, the results are ok for 1 second more or less.

@mgaido91

mgaido91 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Does it mean "Chunk" in the table you shared actually represent the FRAME?

yes, sorry, I have done a bit of a mess with naming.

@mgaido91

Copy link
Copy Markdown
Contributor

looks like there will be no 2.8.0. They just increased to 3.0 and then 3.1 without any release. In addition, the current repo is a huge refactor, so everything should be re-tested as soon as they release something, sigh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Canary-v2 streamatt speech processor

2 participants