CollectiveX: experimental cross-vendor collective/EP benchmark by Oseltamivir · Pull Request #1896 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-23T07:04:23Z

Adds CollectiveX under experimental/CollectiveX/ — a cross-vendor collective / expert-parallel benchmark — plus an orchestration-only workflow.

What it adds

Per-SKU launch adapters (launchers/launch_<sku>.sh, the launch_${RUNNER_NAME%%_*}.sh convention) that run any benchmark via a CX_BENCH selector (nccl|deepep|all) through a shared launchers/run_in_container.sh.
Benchmarks: run_nccl.py (stock nccl-tests → parsed flat JSON), run_deepep.py (DeepEP dispatch/combine, normal mode), env_capture.py (Layer-0 provenance), plot.py. Every result is correctness-gated and carries a topology-aware comparison_key.
Single multi-arch, digest-pinned container for all NVIDIA SKUs (lmsysorg/sglang@sha256:4219…, amd64+arm64); DeepEP via rebuild-deepep. See CONTAINERS.md.
.github/workflows/collectivex-experimental.yml — push to collectivex (paths experimental/CollectiveX/**) → GB200 NCCL smoke; workflow_dispatch → chosen sku+benchmark (B200, DeepEP, larger sweeps). Logic stays under experimental/.

Validated on hardware

NCCL primitives: B200 (8× NVLink island) + GB200 (4× NVL72 MNNVL), 4 ops, correctness-passed, topology-keyed distinctly.
DeepEP dispatch/combine on GB200: correctness-gated (token conservation + combine vs DeepEP's own reference), ~154 µs roundtrip, 1.66M tok/s.
Local: shellcheck/bash -n, py_compile, actionlint, parser fixtures.

Notes / deferred

Result JSONs are gitignored (captured env embeds hostnames/UUIDs); CI uploads them as workflow artifacts. Headline numbers are summarized in CONTAINERS.md.
Importing the exact multi-arch digest needs the runner's registry creds (validated on the pre-staged v0.5.11-cu130).
Precision axes (NVFP4/MXFP8/…), low-latency EP, MoRI, EPLB, multinode DeepEP, and other collectives are captured as roadmap in plan.md, not built.

Note

Low Risk
Changes are isolated to experimental/CollectiveX/ and a read-only workflow; no production benchmark matrix or serving launchers are modified. Risk is mainly operational (self-hosted GPU time, Slurm/enroot failures) rather than app or security impact.

Overview
Introduces CollectiveX under experimental/CollectiveX/ — an experimental cross-vendor collective and MoE EP benchmark — plus orchestration-only .github/workflows/collectivex-experimental.yml. Production serving paths are untouched.

Benchmark stack: run_nccl.py wraps nccl-tests/rccl-tests into provenance-tagged JSON; run_deepep.py and run_mori.py add correctness-gated DeepEP and AMD MoRI dispatch/combine; env_capture.py, summarize.py, and plot.py handle environment capture, CI summaries, and plots. Results use topology-aware comparison_keys so unlike fabrics are not merged blindly.

Execution: Per-SKU Slurm launchers (launch_b200-dgxc.sh, launch_gb200-nv.sh, launch_b200-dgxc-slurm.sh, launch_mi355x-amds.sh) follow the same launch_${RUNNER_NAME%%_*}.sh pattern as serving, with shared common.sh (enroot squash by tag, optional CX_STAGE_DIR rsync, in-container nccl/rccl builds). CX_BENCH selects nccl, deepep, mori, or all via run_in_container.sh.

CI: Push to collectivex runs MI355X MoRI on mi355x runners; workflow_dispatch picks SKU and benchmark (GB200/B200 NCCL, DeepEP, etc.), writes markdown to the job summary, and uploads gitignored results/*.json as artifacts.

^{Reviewed by Cursor Bugbot for commit 871086d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Per-SKU launch adapters (launch_<sku>.sh) that run any benchmark via a CX_BENCH selector through a shared run_in_container.sh; multi-arch digest-pinned sglang container; NCCL-primitive + DeepEP dispatch/combine benchmarks with provenance + correctness gating; and an on:push workflow (GB200 NCCL smoke; workflow_dispatch for B200/DeepEP/larger sweeps). Validated on hardware: NCCL primitives on B200 (8x NVLink) and GB200 (4x NVL72 MNNVL); DeepEP dispatch/combine on GB200 (correctness-gated).

The GB200 on:push smoke hung 25 min in enroot import: a bare digest ref (repo@sha256:) can't form an anonymous Docker Hub token scope, so enroot prompted for a password and blocked in non-interactive CI. Import by the multi-arch TAG instead (anonymous auth works, same as the serving launchers) and add </dev/null so a missing token fails fast rather than hanging. Use v0.5.11-cu130 (multi-arch amd64+arm64, index sha256:061fb71f…): v0.5.12-cu130's 62 layers overflow enroot's overlay-based squash creation on these nodes (failed to mount overlay … Invalid argument). v0.5.11-cu130 imports cleanly and is pre-staged on GB200.

On the GB200 Actions path, CX_STAGE_DIR makes the launcher rsync the tree to compute-visible Lustre and the container writes results/ there; upload-artifact reads the checkout's results/ (empty), so the green smoke produced no artifact. Add cx_collect_results to copy result JSONs from the stage dir back to the checkout after the run (no-op when no staging was used).

Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.

cursor · 2026-06-23T08:25:44Z

+            is_token_in_rank=is_token_in_rank,
+            num_tokens_per_expert=num_tokens_per_expert,
+        )
+        combined_x, _, _ = buffer.combine(recv_x, handle, topk_weights=recv_topk_weights)


Dispatch dtype not applied

Medium Severity

The --dispatch-dtype / CX_DISPATCH_DTYPE value is stored in result metadata but never used when building inputs or calling buffer.dispatch. Runs always use bfloat16 token tensors regardless of fp8 vs bf16, so provenance and comparison keys can describe a different shape than what was measured.

Additional Locations (1)

experimental/CollectiveX/launchers/run_in_container.sh#L64-L65

^{Reviewed by Cursor Bugbot for commit b384171. Configure here.}

summarize.py --markdown emits GitHub-flavored markdown tables (NCCL + DeepEP); a per-job 'Results summary' workflow step appends it to $GITHUB_STEP_SUMMARY so the run page shows a rendered table (per the GitHub job-summaries feature). Plain-text mode still drives the in-container result gate.

cursor · 2026-06-23T08:52:23Z

+    --timestamp "$TS" || cx_log "WARN: parse $op failed"
+done
+
+cx_log "done — JSON artifacts under $CX_DIR/results/"


Multinode launcher ignores failures

High Severity

The B200 multinode adapter logs warnings when srun or run_nccl.py fail but always exits successfully. Unlike run_in_container.sh, it never runs summarize.py as a non-zero gate, so workflow_dispatch on b200-multinode can finish green with no valid NCCL results.

^{Reviewed by Cursor Bugbot for commit f48daed. Configure here.}

cursor · 2026-06-23T08:52:23Z

+        run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
+      - name: Results summary
+        if: always()
+        run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"


Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)

.github/workflows/collectivex-experimental.yml#L106-L109

^{Reviewed by Cursor Bugbot for commit f48daed. Configure here.}

cursor · 2026-06-23T08:52:23Z

+  dst="$repo_root/experimental/CollectiveX/results"
+  mkdir -p "$dst"
+  cp "$mount_src/experimental/CollectiveX/results/"*.json "$dst/" 2>/dev/null || true
+  cx_log "copied results from stage dir -> $dst (for artifact upload)"


Result copy errors ignored

Medium Severity

cx_collect_results wraps the staged-to-checkout cp in 2>/dev/null || true and always logs success, so a failed or empty copy does not affect the launcher exit code and the workflow can pass without uploadable JSON.

^{Reviewed by Cursor Bugbot for commit f48daed. Configure here.}

First AMD / cross-vendor reach, scaffolded ahead of Milestone 1: - run_mori.py: MoRI dispatch+combine (normal mode), correctness-gated, mirroring ROCm/mori's dispatch_combine example — int32 routing indices, (n,0) fp8 scales, the zero-copy registered-combine-input-buffer staging step, and expected = input x (#unique destination ranks). Emits the same flat JSON shape (family=moe, backend=mori) with CUDA-event timing. - launchers/launch_mi355x-amds.sh: AMD adapter — partition compute, no account, --cpus-per-task=128, node-local /var/lib/squash imported via srun on the allocated node, --container-writable --container-remap-root, forces CX_BENCH=mori, mounts the (compute-visible) checkout at /ix. - launchers/run_in_container.sh: run_mori_suite + mori case (nccl|deepep|mori|all). - launchers/common.sh: ROCm MoRI image (rocm/sgl-dev:...-mori-0227-2) in cx_default_image for mi355x*/mi350x*/mi325x*/mi300x*. - workflow: mi355x sku + mori benchmark options for workflow_dispatch. - docs: CONTAINERS.md AMD section, README files/run/risks, plan.md status. Not yet hardware-validated (no MI355X access) — MoRI's Python API is version-sensitive (marked ADAPT HERE); the first runner job is the validation, as GB200 was for DeepEP. The ROCm image isn't digest-pinned yet.

- workflow: replace the on:push GB200 NCCL smoke with the MI355X MoRI dispatch/combine run (runs-on: mi355x, CX_BENCH=mori), and name the job "CollectiveX Experimental" (no longer "smoke"). GB200/B200 NCCL + DeepEP remain on workflow_dispatch. - launch_mi355x-amds.sh: adapt more faithfully to runners/launch_mi355x-amds.sh — squeue by job-name only (no -u), flock -w 600, and clear ROCm gpucore.* dumps after the run so the next checkout is clean. Bump default CX_TIME to 60 for a cold ROCm-image import. - summarize.py: drop the "N/N results valid." footer from both the job-summary (markdown) and plain output; the failure gate still reports invalid results. Relabel the MoE section "MoE dispatch+combine (DeepEP / MoRI)". - docs: README/plan describe push -> MI355X MoRI.

cursor · 2026-06-23T09:27:40Z

+    rm -f \"$SQUASH_FILE\"
+    enroot import -o \"$SQUASH_FILE\" \"docker://$IMAGE\" </dev/null
+  fi
+"


MI355X import errors ignored

High Severity

The node-local enroot import runs inside an srun bash snippet without set -e and with no check after import. A failed import still yields exit 0 from that snippet, so the job continues into pyxis with a missing or corrupt squash file.

^{Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.}

cursor · 2026-06-23T09:27:40Z

+      - name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }}
+        env:
+          RUNNER_NAME: ${{ runner.name }}
+        run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"


Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

^{Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.}

… default) First MI355X run reached the MoRI dispatch kernel — salloc, ROCm-image import, mount, torchrun, 8-rank Gloo + shmem init, and EpDispatchCombineConfig/op/dispatch all worked, confirming the API signatures. It OOM'd MoRI's default 2 GiB static symmetric heap (hidden=7168 dispatch/combine buffers across 8 ranks request ~0.9 GiB each). run_mori.py now sets MORI_SHMEM_HEAP_SIZE before `import mori` (default 16 GiB, override CX_MORI_HEAP_BYTES). Docstring + CONTAINERS.md record the finding; correctness/timing validated by the heap-sized re-run.

cursor · 2026-06-23T09:39:52Z

+
+salloc --partition="$PARTITION" --exclude="$EXCLUDE_NODES" --gres=gpu:"$NGPUS" \
+       --exclusive --cpus-per-task=128 --time="$TIME_MIN" --no-shell --job-name="$RUNNER_NAME"
+JOB_ID="$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)"


Slurm job ID not scoped

Medium Severity

launch_mi355x-amds.sh resolves JOB_ID with squeue --name="$RUNNER_NAME" and no -u "$USER", while the other CollectiveX NVIDIA launchers filter by user. On a shared cluster, the first matching job name may belong to another account, so subsequent srun/scancel can target the wrong allocation.

Additional Locations (1)

experimental/CollectiveX/launchers/launch_b200-dgxc.sh#L52-L53

^{Reviewed by Cursor Bugbot for commit ac3f1b9. Configure here.}

The heap-bump run cleared the 2 GiB OOM but then failed registering the 16 GiB symmetric heap as an RDMA memory region (errno 22 EINVAL, size=17179869184). ROCm/mori's reference test uses MORI_SHMEM_HEAP_SIZE="6G" single-node — big enough for the hidden=7168 dispatch/combine buffers, small enough to register. Match it: default "6G" (override CX_MORI_HEAP_SIZE). The rest of the config already matches the reference (max_num_inp_token_per_rank=4096, hidden=7168, backend cpu:gloo,cuda:nccl), so this lands on the proven single-node setup.

Drove run_mori.py to a correct run on 8x MI355X (on-node via salloc+srun): dispatch+combine numerically correct (combine within tol, max_rel ~2e-3), ~85us round-trip at the decode shape. The first runs surfaced four issues, all fixed and re-validated: - RDMA MR ceiling: MoRI registers the WHOLE symmetric heap as one RDMA MR at init (even single-node; no disable-RDMA knob). The ionic_rdma NICs cap GPU MRs at ~4 GiB — a 6 GiB heap fails (RegisterRdmaMemoryRegion errno 22), 2 GiB registers. Hold heap at MORI_SHMEM_HEAP_SIZE=2G (override CX_MORI_HEAP_SIZE). - Buffer sizing: max_num_inp_token_per_rank 4096 -> max(512, n) so the buffers fit the 2 GiB heap (4096 was inherited from the reference test). - Correctness shape: combine returns the full max-token buffer; compare only combined[:n] against expected. - recv count: read total_recv BEFORE combine (combine resets recv_num, which made recv_nonzero a false negative). - Teardown: MoRI's shmem teardown asserts (CheckStatusValid -> SIGABRT) when the op is destroyed after shmem_finalize(); hard-exit after writing results. Docs (README/plan/CONTAINERS) updated from "scaffolded" to validated, with the fabric constraints recorded.

…CH=nccl) Adds the AMD collective-primitive path so all_reduce/reduce_scatter/all_gather/ alltoall run on MI355X, not just MoRI: - common.sh: cx_build_rccl_tests — clones ROCm/rccl-tests and builds with `make` against /opt/rocm (amdclang++/librccl). It's a nccl-tests fork producing the same <op>_perf binaries and output format, so run_nccl.py parses it unchanged. Validated building + running all 4 ops in-container on MI355X (correctness OK). - run_in_container.sh: run_nccl_suite picks rccl-tests on ROCm (/opt/rocm or hipcc), nccl-tests otherwise; identical op loop + run_nccl.py invocation. - launch_mi355x-amds.sh: honor CX_BENCH (mori default | nccl) instead of forcing mori; same -g N single-node 8-GPU launch. - docs: README/CONTAINERS note the rccl path. B200 already has the nccl path; this makes primitives available on all three SKUs via workflow_dispatch.

…t cancel each other

… job summary

cursor · 2026-06-23T12:00:31Z

+            if name:
+                devices.append(name)
+    elif _run(["ibstat", "-l"]):
+        devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]


ibstat fallback may crash capture

Low Severity

In _rdma, the ibstat -l branch calls _run twice. If the first call succeeds but the second returns None, None.splitlines() raises and env_capture.py aborts before writing provenance JSON for that run.

^{Reviewed by Cursor Bugbot for commit 2b23573. Configure here.}

…on-node launch_gb200-nv.sh now branches on CX_NODES: 1 (default) keeps the single-tray 4-GPU dispatcher path; >1 runs across the NVL72 NVLink fabric (e.g. CX_NODES=2 = 8 GPU) by building nccl-tests MPI=1, running each op across WORLD ranks via `srun --mpi=pmix` (1 GPU/rank) with the MNNVL env, and parsing on the login node — mirroring launch_b200-dgxc-slurm but staying on NVLink instead of IB. Validated on GB200 (2x watchtower-navy trays, 8 GPU): all 4 ops valid, peak busbw all_reduce 822.8 / reduce_scatter 670.6 / all_gather 651.2 / alltoall 625.0 GB/s — ~30% over single-tray and on par with B200 8-GPU NVLink, i.e. MNNVL engaged (not an IB fallback). - common.sh: cx_build_nccl_tests auto-detects MPI_HOME for MPI=1 (Debian OpenMPI headers live under /usr/lib/<arch>/openmpi/include; MPI_HOME=/usr fails). Works x86_64 + aarch64. - launch_b200-dgxc-slurm.sh: fix BUILD_IN_CTR path (.nccl-tests/nccl-tests/build). - workflow: add `nodes` dispatch input -> CX_NODES.

MoonCake on MI355X = evidenced ROCm wall (engine inits on rdma0 but the wheel has no transfer_write_on_hip, only _on_cuda; run 28342781762 invalid/0 groups) — needs an upstream Mooncake ROCm build. MI355X rccl-tests (All-reduce/All-gather tab) keeps failing in the runner checkout/setup step (shared with the agentic fleet) — a runner-contention infra flake, not an rccl limitation. mori-io (28.2), copy- engine/SDMA, and rccl-kv (71.7 GB/s) backfilled successfully.

…ests) The persistent MI355X rccl-primitives failure was capability.py rejecting benchmark=nccl on amd (exit 3 in the Validate-capability step, before the launcher ran) — masked earlier by the gharunner06 root-LOGS EACCES. But the nccl BENCHMARK runs on both vendors: run_nccl_suite auto-picks rccl-tests on ROCm. Make COLLECTIVE nccl valid on amd so the All-reduce/All-gather tabs get an MI355X line.

…l-parity sweeps Thread deepep_v2=true (kernel_gen=v2 from-source) and a --backend override that remaps the deepep suite matrix onto uccl/flashinfer/deepep-hybrid/nccl-ep, with a capability pre-filter (resolve() per case) so no doomed dispatch is fired. Enables per-backend full-matrix parity: deepep-v2 242 / uccl 242 / flashinfer 162 / deepep-hybrid 156 NVIDIA cases across H100/H200/B300.

…nodes Add b200 (8x NVLink, sibling of b300) + gb200 (NVL72, sibling of gb300) to platforms.yaml + every relevant suite's platform list (mirroring b300/gb300 coverage). Un-drop gb300 in _gha_suite.sh (runners online now) + map gb200/b200 in the SKU dict. Thread nodes for the rack-scale SKUs (gb200/gb300 = 4 GPU/tray, so EP8 = 2 trays/nodes). Enables full-parity sweeps across all 7 SKUs.

…gate Replace the thousands-of-individual-dispatches model with the InferenceX CI shape: ONE run = setup (resolve suites into shard matrix via sweep_matrix.py) -> sweep (a MATRIX job, one cell per SHARD = sku×backend×mode×resource, each sweeping its cases in ONE allocation via run_in_container SHARD mode) -> aggregate (collect every shard into ONE results/aggregate/*.ndjson via aggregate_results.py). Collapses ~534 deepep dispatches into ~45 cells + 1 aggregated file. run_in_container gains a CX_SHARD_FILE loop (per-case CX_TS keeps outputs unique); sweep_matrix resolves/ chunks/capability-filters shards + emits slim matrix (cases via artifact).

+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.gen.outputs.matrix }}
+      n: ${{ steps.gen.outputs.n }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - run: pip install --quiet pyyaml
+      - id: gen
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          ov=""; [ "${{ inputs.backend }}" != "deepep" ] && ov="--backend ${{ inputs.backend }}"
+          v2=""; [ "${{ inputs.deepep_v2 }}" = "true" ] && v2="--deepep-v2"
+          os=""; [ -n "${{ inputs.only_sku }}" ] && os="--only-sku ${{ inputs.only_sku }}"
+          # full matrix (with cases) -> artifact for the cells; slim (no cases) -> the strategy output.
+          python3 sweep_matrix.py --suites "${{ inputs.suites }}" --max-cases "${{ inputs.max_cases }}" $ov $v2 $os --out matrix_full.json >/dev/null
+          SLIM=$(python3 -c "import json;m=json.load(open('matrix_full.json'));print(json.dumps({'include':[{k:v for k,v in x.items() if k!='cases'} for x in m['include']]}))")
+          echo "matrix=$SLIM" >> "$GITHUB_OUTPUT"
+          echo "n=$(python3 -c "import json;print(len(json.load(open('matrix_full.json'))['include']))")" >> "$GITHUB_OUTPUT"
+          python3 -c "import json;m=json.load(open('matrix_full.json'));print('shard-cells:',len(m['include']),'cases:',sum(x['n'] for x in m['include']))"
+      - uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxsweep-matrix-${{ github.run_id }}
+          path: experimental/CollectiveX/matrix_full.json
+          if-no-files-found: error
+
+  # ---- sweep: ONE matrix cell per shard (the parent job with child jobs) ----
+  sweep:


+    needs: setup
+    if: ${{ fromJSON(needs.setup.outputs.n) > 0 }}
+    strategy:
+      fail-fast: false
+      max-parallel: 10            # don't saturate the ~20-runner fleet; cells queue as slots free
+      matrix: ${{ fromJSON(needs.setup.outputs.matrix) }}
+    # h200 label spans two clusters; pin to the validated dgxc pool (mirrors collectivex-experimental).
+    runs-on: ${{ matrix.sku == 'h200' && 'h200-dgxc' || matrix.sku }}
+    timeout-minutes: 350
+    env:
+      CX_BENCH: ${{ matrix.backend }}
+      CX_DEEPEP_V2: ${{ matrix.deepep_v2 && '1' || '' }}
+      CX_NODES: ${{ matrix.nodes }}
+      CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json
+      COLLECTIVEX_SOURCE_SHA: ${{ github.sha }}
+      CX_NODELIST: ${{ matrix.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }}
+      CX_STAGE_DIR: ${{ matrix.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
+        with:
+          name: cxsweep-matrix-${{ github.run_id }}
+          path: experimental/CollectiveX
+      - name: Extract this shard's cases (stdlib only — no runner deps)
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          python3 -c "
+          import json
+          m=json.load(open('matrix_full.json'))
+          s=[x for x in m['include'] if x['id']=='${{ matrix.id }}']
+          assert s, 'shard ${{ matrix.id }} not in matrix'
+          s=s[0]
+          json.dump({'id':s['id'],'sku':s['sku'],'backend':s['backend'],'nodes':s['nodes'],'deepep_v2':s['deepep_v2'],'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))
+          print('shard ${{ matrix.id }}:', len(s['cases']), 'cases')
+          "
+      - name: Sweep shard ${{ matrix.id }} (${{ matrix.n }} cases, one allocation)
+        env:
+          RUNNER_NAME: ${{ runner.name }}
+        run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
+      - name: Shard summary
+        if: always()
+        run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" || true
+      - name: Upload shard results
+        if: always()
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxshard-${{ matrix.id }}-${{ github.run_id }}
+          path: experimental/CollectiveX/results/*.json   # glob skips the hidden .shard_*.json
+          if-no-files-found: warn
+
+  # ---- aggregate: collect every shard into ONE ndjson (the "result aggregator at the end") ----
+  aggregate:


+                continue
+            p = os.path.join(root, f)
+            if f.endswith(".ndjson"):
+                for line in open(p):


+                            pass
+            elif f.endswith(".json"):
+                try:
+                    yield p, json.load(open(p))


Two fixes for the errored sweep cells (FileNotFoundError .cx_workloads/<wid>.manifest.json): 1. run_in_container SHARD loop unsets CX_WORKLOAD_DIR per case — cx_stage_canonical short-circuits when it's set, so the first case's staged dir was reused for all later cases (different routing/dims -> missing manifest). Now each re-stages. 2. sweep_matrix sets canonical=false: the broad sweep runs seeded-runtime (comparable-experimental; fixed seed = same cross-SKU trace) — no per-case canonical staging needed, removing the dependency + overhead entirely.

The 6 re-fired sweeps left residual failures concentrated on rack-scale EP8 (gb200/gb300) and b200 DeepEP-V2. Two distinct bugs: 1. cx_build_deepep_v2 built arch=9.0 (Hopper) for b200 — its CX_RUNNER arch case omitted b200* (sm100). DeepEP-V2 on b200 ran the wrong kernels. Mirror the hybrid builder: b300*|gb300*|b200* -> 10.0. 2. The gb200/gb300 EP8 path runs run_ep.py directly across trays (not run_in_container's shard loop), so in sweep mode it (a) referenced bare $CX_DISPATCH_DTYPE etc. — unbound under set -u, crashing the whole gb300 job on its first line — and (b) ran a single CX_* config instead of the shard's N cases, so rack-scale EP8 was never swept. Make the EP8 path shard-aware: expand CX_SHARD_FILE into one '|'- separated arg-line per case (| not tab: tab is IFS-whitespace, so read collapses empty fields like a false eplb and shifts columns), loop every case with per-case defaults, full axis set for parity. Add sweep_matrix --min-nodes + the workflow min_nodes input so the rack-scale EP8 cells can be re-run alone, without redoing the already- good single-tray EP4 shards (scarce gb200/gb300 trays).

The matrix sweep runs many cells concurrently; each launcher resolved its Slurm JOB_ID with `squeue --name=$RUNNER | head -1`, but the job-name is not unique per cell, so concurrent same-named allocations returned a SIBLING cell's id. Observed on gb300: salloc granted 11354 but the name lookup returned a still-pending 11356 -> srun "Expired or invalid job 11356" -> the cell failed though its own allocation was fine. Systematic on the contended gb200/gb300 clusters (uccl gb200 11/11, deepep gb300 4/6, hybrid gb200 6); single-node SKUs got occasional one-offs (h100). The old one-config-at-a-time dispatch path never hit it (serialized). Add cx_salloc_jobid() to common.sh: run salloc and parse the GRANTED id from its OWN output (race-free), streaming progress live via tee. Route every launcher's salloc through it (gb300-nv, gb200-nv, b200-dgxc, b200-dgxc-slurm, b300, h100-dgxc-slurm, h200, mi355x-amds).

The gb200/gb300 EP8 path expands CX_SHARD_FILE on the SUBMIT HOST (cwd = repo root), but CX_SHARD_FILE is workflow-relative (results/.shard_<id>.json) and the Extract step writes it under working-directory=experimental/ CollectiveX. So `[ -f "$CX_SHARD_FILE" ]` failed, the SHARD branch was skipped, and cx_ep8_cases fell back to ONE default case (bf16/normal/ uniform) instead of the shard's N — the gb300/gb200 'successes' ran 1/14 of the work (logs show a lone EP8[1], 1 JSON per 14-case shard). The single-node/EP4 path was unaffected: run_in_container reads the file from inside the container at /ix/experimental/CollectiveX. Resolve CX_SHARD_FILE against $CID when not found as-is (both rack launchers). Verified: relative path + cwd!=CX_DIR now finds the shard and emits every case.

…sult JSONs) The restructure's goal was 'a result aggregator so there aren't so many individual result files', but plot_ep only globbed *.json, so locally results/ still held ~1200 loose per-case JSONs. Add _iter_docs(): yield docs from *.json AND one-per-line from each *.ndjson (the aggregate), and route all 7 load_*_series loaders through it. Now the single results/aggregate/collectivex_ep.ndjson is a valid plot source — the per-case JSONs can be merged in (aggregate_results.py) and deleted. Verified: 1204-doc ndjson -> identical 1136 series; results/ 103M -> 43M, ~1200 files -> 1.

Previously each EP library was a separate workflow_dispatch (deepep, uccl, flashinfer, deepep-hybrid, nccl-ep, +deepep_v2) = ~6-8 runs to cover the matrix. sweep_matrix gains --backends ('all'|comma-list): each deepep-origin case is emitted once per target backend (capability-filtered; mori stays AMD-native), with deepep-v2 as a per-cell variant (kernel_gen=v2). The shard key/id carry backend+v2 so cells stay distinct. collectivex-sweep.yml's backend input gains 'all' (now the default): setup resolves the union matrix, the existing per-cell sweep job already reads matrix.backend/deepep_v2, and one aggregate folds everything into the ndjson. backend=all -> 211 shard-cells / 2474 cases in ONE run (under the GHA 256-cell matrix cap; slim matrix 35KB << 1MB output cap). --backend/--deepep-v2 single modes kept for targeted re-runs. One dispatch replaces the ~8.

The 22 tools/_*.sh were the pre-GHA dev-orchestration layer (SSH salloc/srun drivers, per-SKU probes, the old _gha_suite/_gha_matrix/_gha_collect dispatch path). The GHA model — collectivex-sweep.yml (now one combined backend=all run) + sweep_matrix.py + launchers/ + run_in_container.sh + the cron-driven aggregate_results/plot pipeline — fully supersedes all of it. Verified the live path (launchers, runtime, tests, workflow) never invokes tools/; only two sweep_matrix doc-comments mentioned _gha_suite.sh (updated). Recoverable from git history if any SSH-orchestration helper is needed again.

The combined backend=all sweep confirmed uccl and deepep-hybrid fail entirely on the aarch64 Grace-Blackwell SKUs (0 valid docs at both EP4 and EP8) while working on x86 single-node and while flashinfer/nccl-ep/deepep land full rack coverage on the same clusters — a backend-specific aarch64 from-source-build/transport wall (their builds were probe-confirmed on x86 B300 only), not a launcher issue. Rack-scale coverage is complete via the three backends that do run there.

The page opened on pub=official-headline, but the sweep is seeded-runtime (comparable-experimental, wid=null), which official/publishable exclude by design — so only ~373 official series showed and the 2586-series bulk looked missing. Default pub to 'all' (the full sweep is the point of this dashboard); official/publishable remain one toggle away for the canonical-wid cohort. Updated the stale footer note.

… on aarch64 rack) The rack EP8 launcher path runs run_ep.py via multi-srun and bypasses cx_build_deepep_v2 (which lives in run_in_container), so deepep_v2=true on gb200/gb300 EP8 silently ran bundled V1 (1.1.0) while the artifact got the 'deepep-v2' name — the doc kernel_gen was honestly v1, but the name implied V2. aarch64 Grace-Blackwell has also never produced a genuine from-source V2 (same wall class as uccl/deepep-hybrid). Genuine V2 (2.0.0+af9a040) is x86 single-node only (h100/h200/b300/b200, where the EP4/single-node path builds it once). Exclude v2 from gb200/gb300 in sweep_matrix so no mislabeled artifact is produced; deepep V1 still covers rack. Documented in gated.md; fixed my earlier wrong 'V2 works on aarch64' claim.

…s genuine results ep_uccl.py docstring + gated.md claimed UCCL was 'SCAFFOLD — NOT yet producing results / fails loudly, deferred', but cx_build_uccl vendors UCCL's deep_ep_wrapper as uccl_deepep (git-cloned from uccl-project/uccl at the wheel-matched tag) and ep_uccl.py runs genuine uccl.ep dispatch/combine through it — 507 valid docs, correct=True, uccl_version=0.1.1, intranode NVLink on h100/h200/b300/b200. The inverse of the deepep-v2 mislabel: docs under-claimed working data. (aarch64 gb200/gb300 still walled.)

…8 rack deferred) Correcting my earlier wrong 'aarch64 V2 walled' claim: gb300 EP4 (run 28429220764) built genuine kernel_gen=v2 / deepep_version=2.0.0 via run_in_container's cx_build_deepep_v2. The V1 fallback was solely because gb300 defaults to EP8 (2 trays) and the rack multi-srun path bypasses the build (8 separate per-rank containers). sweep_matrix now allows v2 on gb200/gb300 at EP4 (nodes='') and excludes only EP8 (nodes set), so aarch64 V2 is genuinely covered at EP4 with no mislabel. EP8 rack V2 deferred (needs a build-once-per-container step in the multi-srun).

…as silently running EP8) All gb300/gb200 deepep docs were world_size=8: the sweep passed CX_NODES='' for EP4 cells, and the gb300 launcher's NODES=${CX_NODES:-2} coerced empty to 2 (EP8) — so 'EP4' cells ran the EP8 rack multi-srun, which also bypasses cx_build_deepep_v2/cx_build_flashinfer_latest (hence the deepep-v2 sweep producing V1). Set nodes explicitly: EP4->'1', EP8->'2'. Now EP4 cells pass CX_NODES=1 -> launcher EP4 path -> run_in_container -> genuine V2/quant-combine at world=4. v2 exclusion updated to gate on tray count>1 (EP8) not truthy.

…rsistent container) The gb300 EP8 rack path ran run_ep.py over a per-rank multi-srun (8 separate ephemeral containers), bypassing the from-source build hooks AND never threading --combine-dtype — so EP8 V2 ran V1 and EP8 quant-combine ran none. Fix: a setup-srun builds the kernels ONCE PER NODE into a persistent --container-name (via run_in_container's new CX_BUILD_ONLY mode), and every case-srun reuses that named container (build visible to all 8 ranks); the case-srun now also threads --combine-dtype/--combine-quant-mode. Keeps the proven MNNVL transport. run_in_container gains CX_BUILD_ONLY (build + exit).

… internode DeepEP V2 EP8 (2 trays, world=8) crashed with cudaErrorIllegalAddress at csrc/legacy/buffer.hpp:301 while combine-fp8/nvfp4 EP8 succeeded. Root cause is not the build-once container (build rc=0, NCCL 2.30.7 satisfies V2): NVSHMEM's MNNVL auto-detect (NVSHMEM_DISABLE_MNNVL defaults false) wires the cross-tray NVL72 fabric as multi-node-NVLink, but DeepEP's LL kernels are architected around the RDMA topology team (cpu_rdma_team) and issue IBGDA WQE writes from device code -> transport mismatch -> illegal address. Per DeepEP hardware- integration docs, force NVSHMEM_DISABLE_MNNVL=1 (+IBGDA enable) for the deepep EP8 case so the LL device code's expected transport matches. DeepEP-gated; flashinfer EP8 keeps riding NCCL's MNNVL transport untouched.

…s (the real fix) The EP8 illegal-address was NOT a hardware wall: bundled-V1 DeepEP runs 180 correct cross-tray EP8 docs (ws8/nodes2/mnnvl) on the same gb300. Upstream DeepEP V2's legacy Buffer ADDED an allow_mnnvl param (default False); when off, DeepEP itself sets NVSHMEM_DISABLE_MNNVL=1 and the buffer takes the intranode- only CUDA-IPC peer path -> cudaErrorIllegalAddress at csrc/legacy/buffer.hpp across NVL72 trays. (This is why an *external* NVSHMEM_DISABLE_MNNVL had no effect — DeepEP was already forcing it.) tests/ep_deepep.py now passes allow_mnnvl=True on both Buffer constructions when CX_ALLOW_MNNVL=1, gated on the param actually existing (inspect) so bundled-V1 and x86 single-node are byte-for- byte unchanged; recorded in backend_provenance. launch_gb300-nv.sh exports CX_ALLOW_MNNVL=1 for the deepep EP8 case.

…200 mirror, docs) gb300 EP8 deepep-v2 validated genuine (run 28434764062: kernel_gen=v2/ws8/nodes2/ mnnvl/allow_mnnvl=True/correct=8/8). Finalize: - sweep_matrix: drop the EP8-v2 exclusion for gb200/gb300 (v2 now runs at every EP degree via build-once + allow_mnnvl). - launch_gb200-nv.sh: mirror the proven gb300 EP8 fix — build-once into a persistent --container-name, thread combine args, export CX_ALLOW_MNNVL=1 for deepep. (gb200 re-validation pending an allocation; pattern identical to the validated gb300 run.) - gated.md: DeepEP V2 rack EP8 moved from 'deferred' to DONE with the allow_mnnvl root cause + validation run.

…d16) h100 nccl-ep was ws8-only — the h100 launcher was single-node, lacking the CX_NODES>1 FileStore-rendezvous block that launch_h200.sh has (so cross-node world16 was never obtainable on h100). Port that block h200->h100 (adapting partition/account/exclude + h100-multinode-ib topology): one container task per node, FileStore rdzv on the compute-visible /mnt/nfs mount, AVOIDS torchrun's unreachable elastic TCPStore. nccl-ep is the validated portable cross-node EP.

@WS8

…h covers H100') gated.md claimed cross-node 'DONE via nccl-ep' for H100/H200, but only h200 ws16 was ever actually run; the h100 claim was aspirational. Attempt (run 28446105759, launcher cross-node block ported h200->h100): 2-node alloc + per-node containers come up, but the nccl-ep run reproducibly hangs to the 900s timeout on both decode and prefill (gloo+NCCL FileStore bringup that auto-detects the iface on h200 doesn't converge on hpc-gpu-1; no SSH to set SOCKET_IFNAME). Not a systematic-matrix data point either (sweep_matrix places h100 single-node only). h100 single-node EP (all backends @WS8) remains complete.

…for EP4-only sweeps Final completeness check found 12 uncovered cells, all gb300 EP4 (ws4): EP4 was left thin (probes only) under the prior 'ignore EP4' directive while EP8 was swept fully. The current goal includes gb300 EP4. --max-nodes 1 lets the sweep target single-tray (EP4) shards only, so EP4 can be filled without redundantly re-running the expensive 2-tray EP8 allocations.

+    needs: sweep
+    if: always()
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
+        with:
+          pattern: cxshard-*-${{ github.run_id }}
+          path: _shards
+          merge-multiple: true
+      - name: Aggregate shards -> one ndjson
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          tag="${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}"
+          python3 aggregate_results.py --in-dir ../../_shards --out "results/aggregate/collectivex_${tag}_${{ github.run_id }}.ndjson"
+          {
+            echo "## CollectiveX sweep aggregate (${tag})"
+            echo '```'
+            wc -l results/aggregate/*.ndjson 2>/dev/null || echo "no ndjson"
+            echo '```'
+          } >> "$GITHUB_STEP_SUMMARY"
+      - name: Upload aggregate
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxsweep-aggregate-${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}-${{ github.run_id }}
+          path: experimental/CollectiveX/results/aggregate/*.ndjson
+          if-no-files-found: warn
+
+  update-frontend-snapshot:


+    name: Update InferenceX-app snapshot
+    needs: aggregate
+    if: always() && needs.aggregate.result == 'success'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger CollectiveX snapshot update
+        env:
+          FRONTEND_PAT: ${{ secrets.INFX_FRONTEND_PAT }}
+        run: |
+          set -euo pipefail
+          tmp="$(mktemp -d)"
+          trap 'rm -rf "$tmp"' EXIT
+          git clone --quiet --depth 1 --branch collectivex \
+            "https://x-access-token:${FRONTEND_PAT}@github.com/SemiAnalysisAI/InferenceX-app.git" \
+            "$tmp/app"
+          cd "$tmp/app"
+          git pull --rebase origin collectivex
+          mkdir -p .github
+          {
+            echo "source_run_id=${{ github.run_id }}"
+            echo "source_sha=${{ github.sha }}"
+            echo "source_workflow=${{ github.workflow }}"
+            echo "source_run_url=https://github.com/SemiAnalysisAI/InferenceX/actions/runs/${{ github.run_id }}"
+            echo "triggered_at=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+          } > .github/collectivex-source-run.env
+
+          git config user.name "InferenceX Data Bot"
+          git config user.email "actions@users.noreply.github.com"
+          git add .github/collectivex-source-run.env
+          if git diff --cached --quiet; then
+            echo "CollectiveX source-run marker is already current."
+            exit 0
+          fi
+          git commit -m "chore: trigger CollectiveX data update for ${{ github.run_id }}"
+          git push origin HEAD:collectivex


… fresh runs) Caught a stale blanket wall while verifying EP4 correctness. Fresh per-backend re-validation on gb300: - deepep-hybrid EP4 WORKS (run 28452161275: 30 docs, 169/169 correct, branch= hybrid-ep) — the old '0 valid docs at EP4' was wrong. - deepep-hybrid EP8 WALL (run 28457026077: HybridEPBuffer not exposed in the multi-srun container + intranode-NVLink buffer can't span trays). - uccl aarch64 WALL confirmed (run 28457032490: ModuleNotFoundError uccl.ep). Corrected gated.md from the blanket claim to per-EP-degree truth.

…steps Web (UCCL-EP paper) confirms NVIDIA HybridEP supports inter-node (IBGDA) + Grace Blackwell — so the gb300 EP8 'intranode-only wall' was a misdiagnosis. Real cause: cx_build_deepep_hybrid builds build_ext --inplace and sets PYTHONPATH=/tmp/ DeepEP_hybrid + NVSHMEM LD_LIBRARY_PATH process-locally. EP4 single-node runs in that same process (works); EP8 multi-srun runs build-once and case in SEPARATE srun steps sharing only the pyxis --container-name fs, so the env doesn't cross -> 'module deep_ep has no attribute HybridEPBuffer'. Fix: build-once persists the env to /tmp/.cx_hybrid_env (lives in the named container); the EP8 case WRAP sources it (gb300+gb200). No-op for other backends.

Oseltamivir requested a review from a team June 23, 2026 07:04

github-project-automation Bot added this to InferenceMAX Board Jun 23, 2026