Skip to content

[AMD] Enable AITER MoE for MiniMax-M3 MI355X vLLM MTP benchmarks#1955

Draft
Fangzhou-Ai wants to merge 4 commits into
mainfrom
amd/minimax-m3-mtp-aiter-moe
Draft

[AMD] Enable AITER MoE for MiniMax-M3 MI355X vLLM MTP benchmarks#1955
Fangzhou-Ai wants to merge 4 commits into
mainfrom
amd/minimax-m3-mtp-aiter-moe

Conversation

@Fangzhou-Ai

Copy link
Copy Markdown
Collaborator

Summary

  • Enable AITER MoE on the MiniMax-M3 MI355X EAGLE3 MTP launchers (minimaxm3_fp4_mi355x_vllm_mtp.sh, minimaxm3_fp8_mi355x_mtp.sh), mirroring the STP knobs from [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954.
  • Export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter.
  • Export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6 on the MXFP8 MTP path.
  • Append perf-changelog.yaml triggers for minimaxm3-fp4-mi355x-vllm-mtp and minimaxm3-fp8-mi355x-vllm-mtp.

Why

Manual validation showed EAGLE3 MTP serving works with the same AITER MoE stack as STP. This lands the benchmark coverage Andy asked to defer from #1954.

Test plan

  • bash -n on both MTP benchmark scripts
  • python utils/matrix_logic/generate_sweep_configs.py test-config --config-keys minimaxm3-fp4-mi355x-vllm-mtp minimaxm3-fp8-mi355x-vllm-mtp --config-files .github/configs/amd-master.yaml --no-evals
  • Apply full-sweep-enabled (or full-sweep-fail-fast) after [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 lands, or dispatch e2e against this branch

Made with Cursor

Mirror the STP AITER MoE and shared-expert fusion knobs on the MXFP4/MXFP8
EAGLE3 launchers, including INT6 quick-reduce on FP8 MTP.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@github-actions

Copy link
Copy Markdown
Contributor

@Fangzhou-Ai Fangzhou-Ai marked this pull request as draft June 29, 2026 21:25

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward AITER MoE enablement for the MTP benchmark scripts, mirroring the STP knobs.

Extended reasoning...

Overview

This PR enables AITER MoE on the MiniMax-M3 MI355X EAGLE3 MTP benchmark launchers (MXFP4 and MXFP8 variants) by exporting three VLLM_ROCM_USE_AITER* env vars and passing --moe-backend aiter to vllm serve. The MXFP8 path additionally sets VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6. A corresponding perf-changelog.yaml entry is appended. Total surface: two bash benchmark recipes and one YAML changelog entry.

Security risks

None. This is benchmark configuration touching env vars and a CLI flag for a vLLM serve invocation — no auth, crypto, permissions, or data handling logic.

Level of scrutiny

Low. These are sandbox benchmark scripts (not production-critical code paths), the changes are mechanical env-var exports plus a single new flag, and the PR description confirms manual validation showed EAGLE3 MTP works with the same AITER MoE stack as STP. Both files pass bash -n per the test plan.

Other factors

The single inline finding is a nit about a forward-reference comment pointing at the STP recipe in minimaxm3_fp4_mi355x_vllm.sh which, on current main, does not yet contain the AITER knobs (those land in #1954). It's a doc-only concern with no runtime impact and the PR description openly acknowledges the #1954 dependency. The change otherwise follows the established pattern visible in the existing perf-changelog entries and recipe scripts.

# minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from
# Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD
# MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed.
# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: line 8 says "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419)." but on current main that STP recipe sets no AITER env vars and no --moe-backend — its own header comment says it "lets vLLM select the MoE backend." Those STP knobs come from #1954, which hasn't landed. If #1955 merges first, this cross-reference is wrong. Consider pointing at #1954 directly, or just describing the AITER setup (with the vllm#46419 credit) without claiming the STP file already does the same.

Extended reasoning...

What the comment claims vs. what is true on main

The new line 8 in benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh reads:

# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419).

But benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh on current main does not yet set VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_MOE, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS, or pass --moe-backend aiter. Its line 6 explicitly states the opposite: "…lets vLLM select the MoE backend." A git grep --moe-backend confirms the only occurrences of --moe-backend aiter in the tree are the two MTP files introduced by this PR.

Step-by-step proof

  1. cat benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh on the current branch's view of main shows lines 5–6 saying the recipe "uses the text-only language-model path and lets vLLM select the MoE backend."
  2. grep -E 'VLLM_ROCM_USE_AITER|moe-backend' benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh returns nothing.
  3. git log --all --oneline | grep -i 1954 returns nothing — [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 has not landed.
  4. The PR description's own test plan says: "Apply full-sweep-enabled (or full-sweep-fail-fast) after [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 lands." That acknowledges the ordering dependency exists.
  5. Therefore, at the moment this PR's diff is merged (absent [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954), reader X opening minimaxm3_fp4_mi355x_vllm_mtp.sh and following the line-8 pointer to minimaxm3_fp4_mi355x_vllm.sh will find a file whose header comment says vLLM chooses the backend — directly contradicting the "mirrors … AITER MoE" claim.

Addressing the refutation

The refutation argues this is (a) a known dependency, (b) flagged in the test plan, (c) a nit, and (d) forward-looking design language is normal for paired PRs. Points (a)–(c) are accurate and are exactly why this is filed at nit severity, not blocking. The remaining concern is narrow: the comment's surface reading is a present-tense factual claim ("mirrors"), and merge order in this repo is not actually pinned — #1955 can land before, after, or instead of #1954. If #1954 is rebased, re-scoped, or abandoned, this comment ships indefinitely as a dangling cross-reference. The refutation's framing of "a comment that's true after the next-PR-in-the-stack merges" assumes a merge order that the PR description hopes for but does not enforce.

Impact

Documentation-only; no runtime effect. The AITER env vars and --moe-backend aiter flag in this file are self-contained and correct on their own. The risk is purely reader confusion if/when they trace the cross-reference and find the STP recipe in a state that contradicts the comment.

How to fix

One of three trivial options:

This is in-scope to flag because the comment is newly added by this PR, the fix is one-line, and option 1 makes the comment robust to any merge order without coupling the two PRs.

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x_mtp.sh
Fangzhou-Ai and others added 2 commits June 29, 2026 21:30
Use the emulation linear backend for MXFP8 EAGLE3 serving on MI355X.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pin nightly-4559c43a for AITER MoE, shared-expert fusion, and FP8
linear-backend emulation support on all four single-node configs.

Co-authored-by: Cursor <cursoragent@cursor.com>
# Run with CUDA graphs (no --enforce-eager): VLLM_USE_BREAKABLE_CUDAGRAPH=0
# avoids the M3-decode breakable-cudagraph path that previously forced eager.
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to set VLLM_ROCM_USE_AITER=0 when enable ep

@functionstackx

Copy link
Copy Markdown
Collaborator

same reminder here #1954 (comment)

@functionstackx

Copy link
Copy Markdown
Collaborator

Splitting this into two smaller PRs to make them easier to review and merge independently — one for FP4 MTP and one for FP8 MTP:

(The original was conflicting because main already moved the FP4 STP config to the new nightly via #1954.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants