-
Notifications
You must be signed in to change notification settings - Fork 211
[Klaud Cold] [AMD] Enable AITER MoE for MiniMax-M3 MI355X FP4 vLLM MTP benchmark #1958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4323,3 +4323,12 @@ | |
| - "Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM STP: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter." | ||
| - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e) for AITER MoE and shared-expert fusion support (vllm-project/vllm#46419, vllm-project/vllm#46545)." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1954 | ||
|
|
||
| - config-keys: | ||
| - minimaxm3-fp4-mi355x-vllm-mtp | ||
| description: | ||
| - "Enable AITER MoE on the MiniMax-M3 MI355X single-node vLLM EAGLE3 MTP MXFP4 benchmark for non-EP configs: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, and pass --moe-backend aiter." | ||
| - "EP and DP-attention configs disable AITER fused MoE (VLLM_ROCM_USE_AITER_MOE=0) since AITER MoE is incompatible with expert parallelism (vLLM #46419), but keep the general AITER backend on (VLLM_ROCM_USE_AITER=1) so MXFP4 weight dequant uses AITER instead of the Quark path (mxfp4_utils._dequant_mxfp4), which is broken in this nightly (ModuleNotFoundError: torch.ao.quantization.pt2e)." | ||
| - "Drop EP and DP-attention search-space entries for 8k1k (those EP>1 points are off the Pareto curve); 1k1k keeps its EP and DP-attention coverage." | ||
| - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e)." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new changelog entry's Extended reasoning...What the bug is. The final line of pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDINGThe author wrote this entry using a placeholder ( Verification. Running Impact / proof of broken link. After merge, anyone following the changelog link
Why existing tooling doesn't catch it. Fix. Replace - pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING
+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958This is a one-line change matching the format used by every other entry in the file. Severity is nit/normal — broken metadata link in a tracked changelog, no runtime/benchmark impact, but it should be fixed before merge so the link resolves correctly. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟣 Pre-existing follow-up (not blocking this PR): the STP sibling
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh(introduced by #1954, not touched by this PR) unconditionally exportsVLLM_ROCM_USE_AITER=1/VLLM_ROCM_USE_AITER_MOE=1/VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1and passes--moe-backend aiter, with no DP-attention /EP_SIZEgate. Its search space in.github/configs/amd-master.yaml(minimaxm3-fp4-mi355x-vllm) sweeps several EP / DP-attn points (tp:8 ep:8,tp:4 ep:4,tp:2 ep:2,tp:8 ep:8 dp-attn:true), which will hit the very same AITER-MoE-incompatible-with-EP issue the new MTP gate (lines 42-54) was written to avoid — please backport the same if/else block to the STP recipe in a follow-up.Extended reasoning...
What the bug is
This PR correctly adds an EP/DP-attention gate around AITER MoE in
minimaxm3_fp4_mi355x_vllm_mtp.sh(new lines 42-54), so thatVLLM_ROCM_USE_AITERis set to0(and--moe-backend aiteris dropped from the serve command) wheneverDP_ATTENTION=trueorEP_SIZE>1. The PR description and the file header explicitly justify this: "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh ... except AITER MoE is gated off when expert parallelism is enabled", which is exactly what @hongxiayang's review on #1955 (discussion_r3495386866— "will need to set VLLM_ROCM_USE_AITER=0 when enable ep") asked for.However, the STP sibling
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh— last touched by #1954, and not in this PR's diff — still unconditionally exports the three AITER vars (its lines 35-37) and always passes--moe-backend aiter(its line 65), with no DP/EP guard. Its existingPARALLEL_ARGSblock (lines 44-52) still handlesDP_ATTENTION=trueandEP_SIZE>1, proving the recipe IS invoked under EP — the AITER vars are live during those runs.The code path that triggers it
The STP search space lives in
.github/configs/amd-master.yamlunderminimaxm3-fp4-mi355x-vllm(lines visible in the preloaded modified-files dump):Every one of those rows runs
minimaxm3_fp4_mi355x_vllm.shwithEP_SIZE>1orDP_ATTENTION=true, which is exactly the configuration the new MTP gate was added to avoid.Step-by-step proof
{ tp: 8, ep: 8, conc-start: 1, conc-end: 512 }and invokesminimaxm3_fp4_mi355x_vllm.shwithTP=8andEP_SIZE=8.VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_MOE=1, andVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1. There is no gatingiflike the one this PR adds at lines 42-54 of the MTP recipe.PARALLEL_ARGSblock (lines 44-52) reaches theelif [ "$EP_SIZE" -gt 1 ]branch and adds--enable-expert-parallelto the vLLM serve args.--moe-backend aitertovllm serve(also unconditionally).The same path is hit by the
tp:4 ep:4,tp:2 ep:2, andtp:8 ep:8 dp-attn:truerows.Why existing code doesn't prevent it
There simply is no guard on the STP side — the AITER vars and
--moe-backend aiterare set at the top of the script, beforePARALLEL_ARGSis computed, and never reset based onDP_ATTENTION/EP_SIZE. The MTP recipe in this PR introduces such a guard; the STP recipe was last touched in #1954, which added the unconditional AITER exports without any EP-incompatibility consideration.Impact
All four EP/DP-attn STP sweep points (
tp:8 ep:8across both ISL/OSL pairs,tp:4 ep:4,tp:2 ep:2, andtp:8 ep:8 dp-attn:true) will hit the AITER+EP incompatibility — the MoE serving will either crash on startup or silently misbehave, leaving the EP portions of the STP sweep with no usable data. The non-EP STP rows (tp:8,tp:4) are unaffected because AITER MoE is the intended path there.How to fix it
Backport the same if/else block from the new MTP recipe (lines 42-54 of
minimaxm3_fp4_mi355x_vllm_mtp.sh) into the STP recipe, replacing the unconditional exports at lines 35-37 and gating--moe-backend aiterbehind an array (MOE_ARGS) interpolated into the vLLM serve command:Why this is pre-existing severity (not blocking)
The broken pattern was introduced by #1954 and lives in a file this PR does not modify, call, or extend. This PR scopes itself to the MTP file (description: "Splits the FP4 MTP half out of #1955") and explicitly acknowledges the divergence from STP in the new file header comment. The fix belongs in a separate follow-up PR so this MTP gate can land cleanly on its own.