Skip to content

fix: resolve NPU OOM with default training config#620

Open
curnane-lab wants to merge 5 commits into
sgl-project:mainfrom
curnane-lab:domino_npu
Open

fix: resolve NPU OOM with default training config#620
curnane-lab wants to merge 5 commits into
sgl-project:mainfrom
curnane-lab:domino_npu

Conversation

@curnane-lab

@curnane-lab curnane-lab commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Motivation

The default NPU training examples for Qwen3.5-4B DFlash use --num-anchors values (512) that cause out-of-memory errors on common 64GB Ascend NPU cards such as 910B(A2 node) and 910C(A3 node). This PR lowers the default to a value that fits within the available device memory while keeping the examples runnable out-of-the-box.

Modifications

  • examples/run_qwen3.5_4b_dflash_online_npu.sh
    • Changed --num-anchors from 512 to 186
  • examples/run_qwen3.5_4b_domino_online_npu.sh
    • Changed --num-anchors from 16 to 186

Both scripts now use the same --num-anchors 186 default, which avoids OOM on 64GB NPU devices.

Related Issues

N/A

Accuracy Test

Not applicable — this change only adjusts a training hyper-parameter default in example launch scripts. No model architecture or kernel code is modified.

Benchmark & Profiling

Not applicable — the change reduces memory usage for the default NPU example configuration.

Checklist

mingliangfu and others added 5 commits June 26, 2026 15:56
Read mask_token_id from draft_config.dflash_config before falling back

to tokenizer.mask_token_id or adding a new special token. Apply the same

fallback in both train_dflash.py and train_domino.py for consistency.

Closes sgl-project#500

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the --num-anchors parameter to 186 in both the run_qwen3.5_4b_dflash_online_npu.sh and run_qwen3.5_4b_domino_online_npu.sh example scripts. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants