Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2669,10 +2669,15 @@ minimaxm3-fp4-mi355x-vllm:

# EAGLE3 speculative-decoding variant of minimaxm3-fp4-mi355x-vllm. Pair the
# amd/MiniMax-M3-MXFP4 target with Inferact/MiniMax-M3-EAGLE3 and three draft
# tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base
# FP4 sweep at extreme concurrency where speculative decoding loses value.
# tokens.
#
# EP / dp-attn configs disable AITER fused MoE (incompatible with expert
# parallelism) but keep the general AITER backend on so MXFP4 weight dequant
# uses AITER instead of the Quark path (quark.torch.kernel.mx), which is broken
# in the current nightly (torch.ao.quantization.pt2e removed). See the EP branch
# in minimaxm3_fp4_mi355x_vllm_mtp.sh and run 28422097175 (PR #1958).
minimaxm3-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
Expand All @@ -2693,9 +2698,7 @@ minimaxm3-fp4-mi355x-vllm-mtp:
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 256, spec-decoding: mtp }
- { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256, spec-decoding: mtp }

# MiniMax-M3 MXFP4 MI355X atom recipe:
# https://github.com/ROCm/ATOM/blob/5d42d49f9e4292e5b61475917e92e7ec1b1dacb7/recipes/MiniMax-M3.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
# minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from
# Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD
# MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed.
# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419),
# except AITER MoE is gated off when expert parallelism is enabled (see below).

source "$(dirname "$0")/../../benchmark_lib.sh"

Expand Down Expand Up @@ -37,6 +39,26 @@ SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0

# AITER MoE accelerates the dense (non-EP) MoE path but is incompatible with
# expert parallelism, so disable AITER *fused MoE* when EP is enabled (DP
# attention or EP > 1). We still keep the general AITER backend enabled in that
# case: it routes the MXFP4 weight dequant through AITER instead of the Quark
# path (mxfp4_utils._dequant_mxfp4 -> `from quark.torch.kernel import mx`),
# which is broken in the current nightly (ModuleNotFoundError:
# torch.ao.quantization.pt2e). Fully disabling AITER here would fall back to
# that broken Quark dequant and crash engine-core startup on every EP config.
# https://github.com/SemiAnalysisAI/InferenceX/pull/1955#discussion_r3495386866
MOE_ARGS=()
if [ "${DP_ATTENTION}" = "true" ] || [ "$EP_SIZE" -gt 1 ]; then
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=0
else
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
MOE_ARGS=(--moe-backend aiter)
fi

Comment on lines +42 to +61

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing follow-up (not blocking this PR): the STP sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh (introduced by #1954, not touched by this PR) unconditionally exports VLLM_ROCM_USE_AITER=1 / VLLM_ROCM_USE_AITER_MOE=1 / VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 and passes --moe-backend aiter, with no DP-attention / EP_SIZE gate. Its search space in .github/configs/amd-master.yaml (minimaxm3-fp4-mi355x-vllm) sweeps several EP / DP-attn points (tp:8 ep:8, tp:4 ep:4, tp:2 ep:2, tp:8 ep:8 dp-attn:true), which will hit the very same AITER-MoE-incompatible-with-EP issue the new MTP gate (lines 42-54) was written to avoid — please backport the same if/else block to the STP recipe in a follow-up.

Extended reasoning...

What the bug is

This PR correctly adds an EP/DP-attention gate around AITER MoE in minimaxm3_fp4_mi355x_vllm_mtp.sh (new lines 42-54), so that VLLM_ROCM_USE_AITER is set to 0 (and --moe-backend aiter is dropped from the serve command) whenever DP_ATTENTION=true or EP_SIZE>1. The PR description and the file header explicitly justify this: "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh ... except AITER MoE is gated off when expert parallelism is enabled", which is exactly what @hongxiayang's review on #1955 (discussion_r3495386866"will need to set VLLM_ROCM_USE_AITER=0 when enable ep") asked for.

However, the STP sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh — last touched by #1954, and not in this PR's diff — still unconditionally exports the three AITER vars (its lines 35-37) and always passes --moe-backend aiter (its line 65), with no DP/EP guard. Its existing PARALLEL_ARGS block (lines 44-52) still handles DP_ATTENTION=true and EP_SIZE>1, proving the recipe IS invoked under EP — the AITER vars are live during those runs.

The code path that triggers it

The STP search space lives in .github/configs/amd-master.yaml under minimaxm3-fp4-mi355x-vllm (lines visible in the preloaded modified-files dump):

- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 16, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }

Every one of those rows runs minimaxm3_fp4_mi355x_vllm.sh with EP_SIZE>1 or DP_ATTENTION=true, which is exactly the configuration the new MTP gate was added to avoid.

Step-by-step proof

  1. The sweep dispatcher picks the row { tp: 8, ep: 8, conc-start: 1, conc-end: 512 } and invokes minimaxm3_fp4_mi355x_vllm.sh with TP=8 and EP_SIZE=8.
  2. Lines 35-37 of that script unconditionally export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1. There is no gating if like the one this PR adds at lines 42-54 of the MTP recipe.
  3. The script's PARALLEL_ARGS block (lines 44-52) reaches the elif [ "$EP_SIZE" -gt 1 ] branch and adds --enable-expert-parallel to the vLLM serve args.
  4. Line 65 then passes --moe-backend aiter to vllm serve (also unconditionally).
  5. vLLM starts with AITER MoE on and expert parallelism on — the exact incompatibility flagged by @hongxiayang and addressed by this PR for the MTP recipe ([ROCm]Enable AITER MoE backend for MiniMax-M3-MXFP4 vllm-project/vllm#46419). The run will crash, or silently produce incorrect MoE outputs, depending on the ROCm/vLLM stack.

The same path is hit by the tp:4 ep:4, tp:2 ep:2, and tp:8 ep:8 dp-attn:true rows.

Why existing code doesn't prevent it

There simply is no guard on the STP side — the AITER vars and --moe-backend aiter are set at the top of the script, before PARALLEL_ARGS is computed, and never reset based on DP_ATTENTION / EP_SIZE. The MTP recipe in this PR introduces such a guard; the STP recipe was last touched in #1954, which added the unconditional AITER exports without any EP-incompatibility consideration.

Impact

All four EP/DP-attn STP sweep points (tp:8 ep:8 across both ISL/OSL pairs, tp:4 ep:4, tp:2 ep:2, and tp:8 ep:8 dp-attn:true) will hit the AITER+EP incompatibility — the MoE serving will either crash on startup or silently misbehave, leaving the EP portions of the STP sweep with no usable data. The non-EP STP rows (tp:8, tp:4) are unaffected because AITER MoE is the intended path there.

How to fix it

Backport the same if/else block from the new MTP recipe (lines 42-54 of minimaxm3_fp4_mi355x_vllm_mtp.sh) into the STP recipe, replacing the unconditional exports at lines 35-37 and gating --moe-backend aiter behind an array (MOE_ARGS) interpolated into the vLLM serve command:

MOE_ARGS=()
if [ "${DP_ATTENTION}" = "true" ] || [ "$EP_SIZE" -gt 1 ]; then
    export VLLM_ROCM_USE_AITER=0
else
    export VLLM_ROCM_USE_AITER=1
    export VLLM_ROCM_USE_AITER_MOE=1
    export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
    MOE_ARGS=(--moe-backend aiter)
fi

Why this is pre-existing severity (not blocking)

The broken pattern was introduced by #1954 and lives in a file this PR does not modify, call, or extend. This PR scopes itself to the MTP file (description: "Splits the FP4 MTP half out of #1955") and explicitly acknowledges the divergence from STP in the new file header comment. The fix belongs in a separate follow-up PR so this MTP gate can land cleanly on its own.

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi
Expand Down Expand Up @@ -65,6 +87,7 @@ vllm serve "$MODEL" --port "$PORT" \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
"${MOE_ARGS[@]}" \
--speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4323,3 +4323,12 @@
- "Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM STP: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter."
- "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e) for AITER MoE and shared-expert fusion support (vllm-project/vllm#46419, vllm-project/vllm#46545)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1954

- config-keys:
- minimaxm3-fp4-mi355x-vllm-mtp
description:
- "Enable AITER MoE on the MiniMax-M3 MI355X single-node vLLM EAGLE3 MTP MXFP4 benchmark for non-EP configs: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, and pass --moe-backend aiter."
- "EP and DP-attention configs disable AITER fused MoE (VLLM_ROCM_USE_AITER_MOE=0) since AITER MoE is incompatible with expert parallelism (vLLM #46419), but keep the general AITER backend on (VLLM_ROCM_USE_AITER=1) so MXFP4 weight dequant uses AITER instead of the Quark path (mxfp4_utils._dequant_mxfp4), which is broken in this nightly (ModuleNotFoundError: torch.ao.quantization.pt2e)."
- "Drop EP and DP-attention search-space entries for 8k1k (those EP>1 points are off the Pareto curve); 1k1k keeps its EP and DP-attention coverage."
- "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new changelog entry's pr-link is https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING instead of /pull/1958. After merge this leaves a 404 link in the changelog (every other entry in the file uses a real PR number, e.g. 1954/1952 immediately above). The GitHub PR-diff renderer may substitute the actual PR number on display, so this is best verified by reading the file on disk — tail -1 perf-changelog.yaml shows the literal string PENDING. Fix is a one-line replace of PENDING with 1958 before merge.

Extended reasoning...

What the bug is. The final line of perf-changelog.yaml (line 4333) is:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING

The author wrote this entry using a placeholder (PENDING) intended to be swapped for the real PR number before merge, but the swap never happened. Note that GitHub's PR-diff renderer (and the harness's diff_context) often substitute the actual PR number for display in the rendered diff, which can hide this on review — direct disk/git inspection is authoritative.

Verification. Running tail -1 perf-changelog.yaml against the working tree shows pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING. The same string appears in git show HEAD:perf-changelog.yaml (i.e. the committed bytes for f59dca0 on this branch). Every adjacent entry uses a real PR number — the immediately preceding entry uses 1954 on line 4325, and the prior entries use 1952 (4318), 1942 (4312), and 1941 (4305). So the convention in the file is to write the real PR number, and this row visibly diverges.

Impact / proof of broken link. After merge, anyone following the changelog link
https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING will hit a GitHub 404 — there is no PR numbered "PENDING". Concretely, walking through it:

  1. A user (or downstream tool that ingests perf-changelog.yaml) sees the new entry for minimaxm3-fp4-mi355x-vllm-mtp.
  2. They click pr-link: .../pull/PENDING.
  3. GitHub parses PENDING as the PR number, fails to find a matching numeric ID, returns 404.
  4. The user has no way to navigate to the originating PR [Klaud Cold] [AMD] Enable AITER MoE for MiniMax-M3 MI355X FP4 vLLM MTP benchmark #1958 from the changelog.

Why existing tooling doesn't catch it. perf-changelog.yaml is a plain YAML data file — there is no schema validator in this PR's CI that asserts pr-link matches a numeric PR id, so the lint/build is happy with the literal string PENDING. The reviewer eye can also slide past it because GitHub's PR-diff renderer substitutes the actual PR number into the displayed diff when reviewing this PR. (That substitution is presentational only; the committed bytes are unchanged.)

Fix. Replace PENDING with 1958 on line 4333:

-  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958

This is a one-line change matching the format used by every other entry in the file. Severity is nit/normal — broken metadata link in a tracked changelog, no runtime/benchmark impact, but it should be fixed before merge so the link resolves correctly.

Loading