Add MTP evaluation#1953
Conversation
Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
Make the workflow default to Option 1 (upload the AL matrix as an artifact for manual review/paste) rather than auto-opening a PR. The auto-PR path stays available as an opt-in (open-pr: true), but keeping it off by default avoids exposing a write-scoped PAT on the self-hosted runner and matches the repo's artifact-collection convention.
Address review: - Model is now a workflow input (model + model-prefix, default deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME, BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body are all derived from those inputs. The emitted YAML top-level key is now derived from the model (MODEL_KEY, defaults to the model basename lowercased). - Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.
… into codex/m1-speedbench-evals
…dbench-evals # Conflicts: # .github/workflows/speedbench-al.yml # benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh # benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh # benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh # runners/launch_b300-nv.sh # runners/launch_gb200-nv.sh # runners/launch_gb300-cw.sh # utils/collect_eval_results.py # utils/evals/EVALS.md
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
| conc_label = f"[conc={match.group(1)}] " if match else "" | ||
| with open(f) as fh: | ||
| data = json.load(fh) | ||
|
|
||
| speedbench_ok, speedbench_checked = validate_speedbench_al(data, f) | ||
| if speedbench_checked: | ||
| checked += speedbench_checked | ||
| if not speedbench_ok: | ||
| failed = True | ||
| continue | ||
|
|
||
| for task, metrics in data.get("results", {}).items(): | ||
| min_score, source = resolve_threshold(config, prefix, task, args.min_score) | ||
| for name, val in metrics.items(): |
There was a problem hiding this comment.
🔴 validate_scores.py runs validate_batch_manifest on all globbed results*.json before the per-file SpeedBench skip in the main loop, so SpeedBench AL result files (results_speedbench_al_{mode}_mtp{mtp}.json) trip the manifest check fatally. With --expected-concs (multi-node eval-only), the single-conc path hard-fails with 'eval must produce exactly one result file' because the lm-eval + SpeedBench JSONs make len==2; in batched mode (eval_concs in meta_env.json), every SpeedBench file lacks a _concN suffix and triggers 'batched eval result lacks a concurrency suffix'. Partition SpeedBench files out of result_files before calling validate_batch_manifest (or have validate_batch_manifest skip files containing 'speedbench_al_eval_version').
Extended reasoning...
What the bug is
The PR adds SpeedBench AL eval result files alongside lm-eval results. run_speedbench_al_eval (in benchmark_lib.sh) writes results_speedbench_al_${mode}_mtp${mtp}.json into EVAL_RESULT_DIR (or PWD in the batched branch at benchmark_lib.sh:1713-1714), and write_dynamo_speedbench_al_from_logs.sh writes the same file shape into $GITHUB_WORKSPACE for multi-node Dynamo MTP runs.
validate_scores.py uses --results-glob 'results*.json' by default, which matches both the lm-eval JSONs and the new SpeedBench files. The PR teaches the per-file loop to skip SpeedBench files via the new validate_speedbench_al helper (utils/evals/validate_scores.py:347-360), but validate_batch_manifest is invoked at line 325 — before that loop — and it has no awareness of SpeedBench files.
Code path that triggers the failure
Two fatal paths in validate_batch_manifest:
-
Non-batched multi-node MTP (
--expected-concspassed,eval_concsabsent frommeta_env.json— exactly the configuration emitted bybenchmark-multinode-tmpl.yml:321andwrite_dynamo_speedbench_al_from_logs.sh, which does not write meta_env.json):- Line 117-118:
if len(result_files) != 1: errors.append('eval must produce exactly one result file'). result_filescontains the lm-eval JSON (copied from$LOGS_DIR/eval_results) and the SpeedBench JSON (written bywrite_dynamo_speedbench_al_from_logs.sh), solen == 2and the run hard-fails.
- Line 117-118:
-
Batched single-node lm-eval + MTP (
eval_concsinmeta_env.json):- Lines 168-176 iterate every result file and require each to match
CONC_SUFFIX_RE = r'_conc(\\d+)(?:_\\d+)?\.json$'. SpeedBench files are namedresults_speedbench_al_${mode}_mtp${mtp}.json—run_speedbench_al_evalis called once per engine, not per concurrency — so they never match. Each SpeedBench file triggers 'batched eval result lacks a concurrency suffix'.
- Lines 168-176 iterate every result file and require each to match
Why existing code doesn't prevent it
The fix for SpeedBench files lives in the per-file loop (validate_speedbench_al at line 350, with continue at line 356), but validate_batch_manifest runs at line 325 — before the loop. By the time the loop's continue fires, failed = True has already been set by the manifest check, and main() returns 1.
Impact
Every multi-node MTP eval-only run after this PR merges will fail the 'Verify eval scores' step (benchmark-multinode-tmpl.yml:321). Every batched single-node lm-eval + MTP run will also fail. This is precisely the headline feature this PR adds (M1 of #1651, 'tested against DSV4 vLLM & SGL'), so the breakage is on the PR's own surface, not on unrelated configs.
Step-by-step proof (Scenario A — batched single-node MTP)
- User runs a DSV4 vLLM MTP single-node eval with multiple
EVAL_CONCURRENT_REQUESTS(e.g. '1 32'). run_eval(benchmark_lib.sh:1703-1714) takes the batched branch, exportsEVAL_RESULT_DIR=$(pwd), and callsrun_speedbench_al_evalonce — writingresults_speedbench_al_on_mtp2.jsonto PWD.- The same branch then runs lm-eval per conc, producing
results_*_conc1.jsonandresults_*_conc32.json.append_lm_eval_summarywritesmeta_env.jsoncontainingeval_concs: [1, 32]. benchmark-tmpl.yml:306invokespython3 utils/evals/validate_scores.pyin PWD (without--expected-concs).main()globsresults*.json→ 3 files (2 lm-eval + 1 SpeedBench).validate_batch_manifestis called at line 325. Meta haseval_concs, so we fall through to lines 168-176.- For each result_file,
CONC_SUFFIX_RE.searchis called.results_*_conc1.jsonmatches,results_*_conc32.jsonmatches, butresults_speedbench_al_on_mtp2.jsondoes not (no_concNsuffix). errors.append('batched eval result lacks a concurrency suffix: results_speedbench_al_on_mtp2.json').main()printsFAIL: batched eval result lacks a concurrency suffix: …and returns 1, even though the lm-eval and SpeedBench results are individually valid.
How to fix
Filter SpeedBench files out of result_files before invoking validate_batch_manifest — e.g. partition by reading each candidate file once and excluding any whose contents contain the speedbench_al_eval_version key, or by filename pattern results_speedbench_al_*.json. Alternatively, teach validate_batch_manifest itself to ignore SpeedBench files.
| _speedbench_prepare_dataset() { | ||
| local speedbench_dir="$1" | ||
| if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then | ||
| return 0 | ||
| fi | ||
| mkdir -p "$speedbench_dir" | ||
| python3 -m pip install -q datasets tiktoken | ||
| curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \ | ||
| | python3 - --config qualitative --output_dir "$speedbench_dir" | ||
| [[ -f "$speedbench_dir/qualitative.jsonl" ]] | ||
| } |
There was a problem hiding this comment.
🟣 SpeedBench dataset prep in _speedbench_prepare_dataset carries two pre-existing repo hygiene issues that this PR now centralizes into a shared helper: (1) prepare.py is fetched from NVIDIA-NeMo/Skills@refs/heads/main and piped directly into python3 — unpinned and non-auditable; (2) the default speedbench_dir=$(pwd)/speed_bench_data resolves to /workspace/speed_bench_data in single-node MTP scripts (PWD is /workspace), creating a new directory under /workspace which AGENTS.md explicitly forbids during a benchmark. Both patterns already exist in 7+ benchmarks/single_node/speedbench/*_b300_vllm.sh scripts, but since this PR moves them into a shared helper that more recipes will use, it's a good moment to (a) pin the curl URL to a commit SHA and download-then-execute, and (b) default to /tmp/speed_bench_data (or any non-/workspace path).
Extended reasoning...
Where this lives
benchmarks/benchmark_lib.sh:1189-1199 — the new _speedbench_prepare_dataset helper introduced by this PR:
_speedbench_prepare_dataset() {
local speedbench_dir="$1"
if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then
return 0
fi
mkdir -p "$speedbench_dir"
python3 -m pip install -q datasets tiktoken
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \
| python3 - --config qualitative --output_dir "$speedbench_dir"
[[ -f "$speedbench_dir/qualitative.jsonl" ]]
}It is invoked from run_speedbench_al_eval (benchmark_lib.sh:1255) which defaults speedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}".
Issue 1 — unpinned curl-pipe-python
refs/heads/main resolves to whatever NVIDIA-NeMo/Skills HEAD is at curl time. There is no commit SHA, no tag, and no checksum. Consequences:
- Supply chain: any force-push, account compromise, or accidental breaking change to that upstream file silently executes arbitrary Python inside the benchmark CI container, which has cluster credentials, model weights, and CI tokens accessible to it.
- Reproducibility: every CI invocation can see a different
prepare.py. That defeats the point of the golden-AL comparison this PR introduces — if upstream changes how the SpeedBench corpus is constructed, the AL reference stops being meaningful. - Auditability: piping straight into
python3means the executed bytes never touch disk, so CI logs cannot show what was executed for a post-hoc diff between a known-good and a regressed run.
The fix is mechanical: replace refs/heads/main with a specific commit SHA in the URL and use the download-then-execute idiom (curl -L … -o /tmp/prepare.py && python3 /tmp/prepare.py …) so CI logs can attest to what ran.
Issue 2 — new directory under /workspace
AGENTS.md (line 149) states explicitly:
No new directories in /workspace during a benchmark (files are fine).
Walking through the call chain for the new MTP recipes touched by this PR:
benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_vllm_mtp.shand…_b300_vllm_mtp.shuse absolute/workspace/server.logand--result-dir /workspace/paths.runners/launch_b200-dgxc.shsets--container-workdir=$CONTAINER_MOUNT_DIRwith default/workspacefor every image except the deepseek-v4-blackwell sglang one (/ix).- So when
_speedbench_prepare_datasetruns andSPEEDBENCH_DIRis unset,$(pwd)/speed_bench_dataexpands to/workspace/speed_bench_data, andmkdir -pcreates it on first invocation.
This directory persists across benchmark runs on the same self-hosted runner (the helper is cache-aware: it short-circuits once qualitative.jsonl exists, but the directory itself stays). Self-hosted runners that switch users between jobs can hit checkout failures from root-owned residue under /workspace.
Fix: local speedbench_dir="${SPEEDBENCH_DIR:-/tmp/speed_bench_data}" (or any non-/workspace path). The SPEEDBENCH_DIR override is already plumbed through.
Why pre_existing, not normal
Both patterns are repo conventions today. The unpinned curl … prepare.py | python3 line appears verbatim in benchmarks/single_node/speedbench/{dsr1,dsv4,glm5,glm52,kimik2.5,minimaxm3,qwen3.5}_fp4_b300_vllm.sh. The SPEEDBENCH_DIR=${SPEEDBENCH_DIR:-/workspace/speed_bench_data} pattern likewise pre-dates this PR. Verifiers split between pre_existing (because the diff didn't introduce either anti-pattern) and normal (because the shared helper increases blast radius — the new MTP flow goes through this path on more recipes than the original 7).
I'm filing this as pre_existing: the diff consolidates existing behavior and the right fix is a uniform update across all callers (the new helper plus the 7 existing scripts) rather than a one-off in this PR. It is worth flagging in review now because (a) centralizing the bad pattern is the natural moment to fix it cluster-wide, and (b) the PR description explicitly says it 'needs some iterations before merging'.
Proof — concrete trace
Starting from dsv4_fp4_b200_vllm_mtp.sh with RUN_EVAL=true and SPEC_DECODING=mtp:
- The recipe is launched by
runners/launch_b200-dgxc.shwith--container-workdir=/workspace(defaultCONTAINER_MOUNT_DIR). - Inside the recipe,
run_eval --framework lm-eval --port "$PORT"is called. run_eval(benchmark_lib.sh:1764) callsrun_speedbench_al_eval "${forwarded[@]}"since the single concurrency branch is taken.run_speedbench_al_eval(benchmark_lib.sh:1255) setsspeedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}". WithSPEEDBENCH_DIRunset andPWD=/workspace, this is/workspace/speed_bench_data._speedbench_prepare_dataset "/workspace/speed_bench_data"runs:qualitative.jsonldoesn't yet exist, somkdir -p /workspace/speed_bench_datacreates a new directory under/workspace— violating AGENTS.md line 149.- The next line,
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - …, fetches whatever is atmainand runs it without saving the script to disk. A determined-attacker scenario aside, the practical risk is reproducibility: a CI run last week and one today can see different generation logic for the very dataset that gates the golden-AL pass/fail check.
Both fixes are one-line each. Recommend applying them in this PR (or a precursor) and back-porting to the 7 existing speedbench scripts.
M1 of #1651
Currently tested against DSV4 vLLM & SGL. Still needs some iterations before merging.