Add MTP evaluation by hjjq · Pull Request #1953 · SemiAnalysisAI/InferenceX

hjjq · 2026-06-29T19:54:45Z

M1 of #1651
Currently tested against DSV4 vLLM & SGL. Still needs some iterations before merging.

Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.

Make the workflow default to Option 1 (upload the AL matrix as an artifact for manual review/paste) rather than auto-opening a PR. The auto-PR path stays available as an opt-in (open-pr: true), but keeping it off by default avoids exposing a write-scoped PAT on the self-hosted runner and matches the repo's artifact-collection convention.

Address review: - Model is now a workflow input (model + model-prefix, default deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME, BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body are all derived from those inputs. The emitted YAML top-level key is now derived from the model (MODEL_KEY, defaults to the model basename lowercased). - Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.

…vals

… into codex/m1-speedbench-evals

…vals

…-on cells run

…dbench-evals # Conflicts: # .github/workflows/speedbench-al.yml # benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh # benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh # benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh # runners/launch_b300-nv.sh # runners/launch_gb200-nv.sh # runners/launch_gb300-cw.sh # utils/collect_eval_results.py # utils/evals/EVALS.md

github-actions · 2026-06-29T19:54:55Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-06-29T20:14:26Z

        conc_label = f"[conc={match.group(1)}] " if match else ""
        with open(f) as fh:
            data = json.load(fh)
+
+        speedbench_ok, speedbench_checked = validate_speedbench_al(data, f)
+        if speedbench_checked:
+            checked += speedbench_checked
+            if not speedbench_ok:
+                failed = True
+            continue
+
        for task, metrics in data.get("results", {}).items():
            min_score, source = resolve_threshold(config, prefix, task, args.min_score)
            for name, val in metrics.items():


🔴 validate_scores.py runs validate_batch_manifest on all globbed results*.json before the per-file SpeedBench skip in the main loop, so SpeedBench AL result files (results_speedbench_al_{mode}_mtp{mtp}.json) trip the manifest check fatally. With --expected-concs (multi-node eval-only), the single-conc path hard-fails with 'eval must produce exactly one result file' because the lm-eval + SpeedBench JSONs make len==2; in batched mode (eval_concs in meta_env.json), every SpeedBench file lacks a _concN suffix and triggers 'batched eval result lacks a concurrency suffix'. Partition SpeedBench files out of result_files before calling validate_batch_manifest (or have validate_batch_manifest skip files containing 'speedbench_al_eval_version').

Extended reasoning...

What the bug is

The PR adds SpeedBench AL eval result files alongside lm-eval results. run_speedbench_al_eval (in benchmark_lib.sh) writes results_speedbench_al_${mode}_mtp${mtp}.json into EVAL_RESULT_DIR (or PWD in the batched branch at benchmark_lib.sh:1713-1714), and write_dynamo_speedbench_al_from_logs.sh writes the same file shape into $GITHUB_WORKSPACE for multi-node Dynamo MTP runs.

validate_scores.py uses --results-glob 'results*.json' by default, which matches both the lm-eval JSONs and the new SpeedBench files. The PR teaches the per-file loop to skip SpeedBench files via the new validate_speedbench_al helper (utils/evals/validate_scores.py:347-360), but validate_batch_manifest is invoked at line 325 — before that loop — and it has no awareness of SpeedBench files.

Code path that triggers the failure

Two fatal paths in validate_batch_manifest:

Non-batched multi-node MTP (--expected-concs passed, eval_concs absent from meta_env.json — exactly the configuration emitted by benchmark-multinode-tmpl.yml:321 and write_dynamo_speedbench_al_from_logs.sh, which does not write meta_env.json):

Line 117-118: if len(result_files) != 1: errors.append('eval must produce exactly one result file').

result_files contains the lm-eval JSON (copied from $LOGS_DIR/eval_results) and the SpeedBench JSON (written by write_dynamo_speedbench_al_from_logs.sh), so len == 2 and the run hard-fails.

Batched single-node lm-eval + MTP (eval_concs in meta_env.json):

Lines 168-176 iterate every result file and require each to match CONC_SUFFIX_RE = r'_conc(\\d+)(?:_\\d+)?\.json$'. SpeedBench files are named results_speedbench_al_${mode}_mtp${mtp}.json — run_speedbench_al_eval is called once per engine, not per concurrency — so they never match. Each SpeedBench file triggers 'batched eval result lacks a concurrency suffix'.

Why existing code doesn't prevent it

The fix for SpeedBench files lives in the per-file loop (validate_speedbench_al at line 350, with continue at line 356), but validate_batch_manifest runs at line 325 — before the loop. By the time the loop's continue fires, failed = True has already been set by the manifest check, and main() returns 1.

Impact

Every multi-node MTP eval-only run after this PR merges will fail the 'Verify eval scores' step (benchmark-multinode-tmpl.yml:321). Every batched single-node lm-eval + MTP run will also fail. This is precisely the headline feature this PR adds (M1 of #1651, 'tested against DSV4 vLLM & SGL'), so the breakage is on the PR's own surface, not on unrelated configs.

Step-by-step proof (Scenario A — batched single-node MTP)

User runs a DSV4 vLLM MTP single-node eval with multiple EVAL_CONCURRENT_REQUESTS (e.g. '1 32').

run_eval (benchmark_lib.sh:1703-1714) takes the batched branch, exports EVAL_RESULT_DIR=$(pwd), and calls run_speedbench_al_eval once — writing results_speedbench_al_on_mtp2.json to PWD.

The same branch then runs lm-eval per conc, producing results_*_conc1.json and results_*_conc32.json. append_lm_eval_summary writes meta_env.json containing eval_concs: [1, 32].

benchmark-tmpl.yml:306 invokes python3 utils/evals/validate_scores.py in PWD (without --expected-concs).

main() globs results*.json → 3 files (2 lm-eval + 1 SpeedBench).

validate_batch_manifest is called at line 325. Meta has eval_concs, so we fall through to lines 168-176.

For each result_file, CONC_SUFFIX_RE.search is called. results_*_conc1.json matches, results_*_conc32.json matches, but results_speedbench_al_on_mtp2.json does not (no _concN suffix).

errors.append('batched eval result lacks a concurrency suffix: results_speedbench_al_on_mtp2.json').

main() prints FAIL: batched eval result lacks a concurrency suffix: … and returns 1, even though the lm-eval and SpeedBench results are individually valid.

How to fix

Filter SpeedBench files out of result_files before invoking validate_batch_manifest — e.g. partition by reading each candidate file once and excluding any whose contents contain the speedbench_al_eval_version key, or by filename pattern results_speedbench_al_*.json. Alternatively, teach validate_batch_manifest itself to ignore SpeedBench files.

claude · 2026-06-29T20:14:26Z

+_speedbench_prepare_dataset() {
+    local speedbench_dir="$1"
+    if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then
+        return 0
+    fi
+    mkdir -p "$speedbench_dir"
+    python3 -m pip install -q datasets tiktoken
+    curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \
+      | python3 - --config qualitative --output_dir "$speedbench_dir"
+    [[ -f "$speedbench_dir/qualitative.jsonl" ]]
+}


🟣 SpeedBench dataset prep in _speedbench_prepare_dataset carries two pre-existing repo hygiene issues that this PR now centralizes into a shared helper: (1) prepare.py is fetched from NVIDIA-NeMo/Skills@refs/heads/main and piped directly into python3 — unpinned and non-auditable; (2) the default speedbench_dir=$(pwd)/speed_bench_data resolves to /workspace/speed_bench_data in single-node MTP scripts (PWD is /workspace), creating a new directory under /workspace which AGENTS.md explicitly forbids during a benchmark. Both patterns already exist in 7+ benchmarks/single_node/speedbench/*_b300_vllm.sh scripts, but since this PR moves them into a shared helper that more recipes will use, it's a good moment to (a) pin the curl URL to a commit SHA and download-then-execute, and (b) default to /tmp/speed_bench_data (or any non-/workspace path).

Extended reasoning...

Where this lives

benchmarks/benchmark_lib.sh:1189-1199 — the new _speedbench_prepare_dataset helper introduced by this PR:

_speedbench_prepare_dataset() { local speedbench_dir="$1" if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then return 0 fi mkdir -p "$speedbench_dir" python3 -m pip install -q datasets tiktoken curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \ | python3 - --config qualitative --output_dir "$speedbench_dir" [[ -f "$speedbench_dir/qualitative.jsonl" ]] }

It is invoked from run_speedbench_al_eval (benchmark_lib.sh:1255) which defaults speedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}".

Issue 1 — unpinned curl-pipe-python

refs/heads/main resolves to whatever NVIDIA-NeMo/Skills HEAD is at curl time. There is no commit SHA, no tag, and no checksum. Consequences:

Supply chain: any force-push, account compromise, or accidental breaking change to that upstream file silently executes arbitrary Python inside the benchmark CI container, which has cluster credentials, model weights, and CI tokens accessible to it.

Reproducibility: every CI invocation can see a different prepare.py. That defeats the point of the golden-AL comparison this PR introduces — if upstream changes how the SpeedBench corpus is constructed, the AL reference stops being meaningful.

Auditability: piping straight into python3 means the executed bytes never touch disk, so CI logs cannot show what was executed for a post-hoc diff between a known-good and a regressed run.

The fix is mechanical: replace refs/heads/main with a specific commit SHA in the URL and use the download-then-execute idiom (curl -L … -o /tmp/prepare.py && python3 /tmp/prepare.py …) so CI logs can attest to what ran.

Issue 2 — new directory under /workspace

AGENTS.md (line 149) states explicitly:

No new directories in /workspace during a benchmark (files are fine).

Walking through the call chain for the new MTP recipes touched by this PR:

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_vllm_mtp.sh and …_b300_vllm_mtp.sh use absolute /workspace/server.log and --result-dir /workspace/ paths.

runners/launch_b200-dgxc.sh sets --container-workdir=$CONTAINER_MOUNT_DIR with default /workspace for every image except the deepseek-v4-blackwell sglang one (/ix).

So when _speedbench_prepare_dataset runs and SPEEDBENCH_DIR is unset, $(pwd)/speed_bench_data expands to /workspace/speed_bench_data, and mkdir -p creates it on first invocation.

This directory persists across benchmark runs on the same self-hosted runner (the helper is cache-aware: it short-circuits once qualitative.jsonl exists, but the directory itself stays). Self-hosted runners that switch users between jobs can hit checkout failures from root-owned residue under /workspace.

Fix: local speedbench_dir="${SPEEDBENCH_DIR:-/tmp/speed_bench_data}" (or any non-/workspace path). The SPEEDBENCH_DIR override is already plumbed through.

Why pre_existing, not normal

Both patterns are repo conventions today. The unpinned curl … prepare.py | python3 line appears verbatim in benchmarks/single_node/speedbench/{dsr1,dsv4,glm5,glm52,kimik2.5,minimaxm3,qwen3.5}_fp4_b300_vllm.sh. The SPEEDBENCH_DIR=${SPEEDBENCH_DIR:-/workspace/speed_bench_data} pattern likewise pre-dates this PR. Verifiers split between pre_existing (because the diff didn't introduce either anti-pattern) and normal (because the shared helper increases blast radius — the new MTP flow goes through this path on more recipes than the original 7).

I'm filing this as pre_existing: the diff consolidates existing behavior and the right fix is a uniform update across all callers (the new helper plus the 7 existing scripts) rather than a one-off in this PR. It is worth flagging in review now because (a) centralizing the bad pattern is the natural moment to fix it cluster-wide, and (b) the PR description explicitly says it 'needs some iterations before merging'.

Proof — concrete trace

Starting from dsv4_fp4_b200_vllm_mtp.sh with RUN_EVAL=true and SPEC_DECODING=mtp:

The recipe is launched by runners/launch_b200-dgxc.sh with --container-workdir=/workspace (default CONTAINER_MOUNT_DIR).

Inside the recipe, run_eval --framework lm-eval --port "$PORT" is called.

run_eval (benchmark_lib.sh:1764) calls run_speedbench_al_eval "${forwarded[@]}" since the single concurrency branch is taken.

run_speedbench_al_eval (benchmark_lib.sh:1255) sets speedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}". With SPEEDBENCH_DIR unset and PWD=/workspace, this is /workspace/speed_bench_data.

_speedbench_prepare_dataset "/workspace/speed_bench_data" runs: qualitative.jsonl doesn't yet exist, so mkdir -p /workspace/speed_bench_data creates a new directory under /workspace — violating AGENTS.md line 149.

The next line, curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - …, fetches whatever is at main and runs it without saving the script to disk. A determined-attacker scenario aside, the practical risk is reproducibility: a CI run last week and one today can see different generation logic for the very dataset that gates the golden-AL pass/fail check.

Both fixes are one-line each. Recommend applying them in this PR (or a precursor) and back-porting to the 7 existing speedbench scripts.

qiching and others added 30 commits June 2, 2026 14:23

feat: add SpeedBench AL eval validation

b2dd50a

Merge remote-tracking branch 'origin/main' into codex/m1-speedbench-e…

f13ad72

…vals

Merge remote-tracking branch 'qiching/albecheng/speedbench-al-action'…

d2ce037

… into codex/m1-speedbench-evals

test: add SpeedBench AL reference handling

4d72cdb

Merge remote-tracking branch 'origin/main' into codex/m1-speedbench-e…

7f00621

…vals

Add multi-framework SpeedBench AL metrics

f40d6f2

speedbench-al: fix --chat-template-kwargs default quoting so thinking…

f2aba4c

…-on cells run

Apply SpeedBench chat-template shim to eval helper

c12acba

Add native SpeedBench client fallback

fa83900

Use shared GB200 workspace for DSV4 Dynamo

60c19dd

Enable metrics for DSV4 SGLang MTP

081cbca

Enable TRT-LLM spec metrics for DSV4 MTP

4cf5bbf

Use TRT-LLM Prometheus metrics endpoint for SpeedBench

63bf3eb

Use TRT-LLM JSON stats for SpeedBench fallback

4a4fbf1

Use TRT-LLM decoded-token metric as AL fallback

de360bf

Collect Dynamo SpeedBench AL from decode logs

cc4c7e3

Use TRT-LLM server logs for SpeedBench AL fallback

2aef667

Capture GB200 srt-slurm bootstrap logs on early failure

816dd1a

Parse Dynamo SGLang SpeedBench AL from decode logs

9cbb11c

Read GB200 bootstrap log from Slurm stderr path

046f304

update SGL metrics gathering method

16e7aa5

fix: validate SpeedBench AL within golden tolerance

a48fc82

fix: use canonical golden AL distribution

ba7fa14

test: remove golden AL regression coverage

3935383

refactor: limit SpeedBench MTP exports to vLLM

cf98627

fix: restore GB200 Dynamo SpeedBench log collection

7ae7c56

refactor: trim SpeedBench AL integration

f3179ad

hjjq requested a review from a team June 29, 2026 19:54

github-project-automation Bot added this to InferenceMAX Board Jun 29, 2026

claude Bot reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MTP evaluation#1953

Add MTP evaluation#1953
hjjq wants to merge 31 commits into
mainfrom
codex/m1-speedbench-evals

hjjq commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

claude Bot Jun 29, 2026

Uh oh!

claude Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hjjq commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

claude Bot Jun 29, 2026

Choose a reason for hiding this comment

What the bug is

Code path that triggers the failure

Why existing code doesn't prevent it

Impact

Step-by-step proof (Scenario A — batched single-node MTP)

How to fix

Uh oh!

claude Bot Jun 29, 2026

Choose a reason for hiding this comment

Where this lives

Issue 1 — unpinned curl-pipe-python

Issue 2 — new directory under /workspace

Why pre_existing, not normal

Proof — concrete trace

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants