Fix KernelCh completion ordering in profiler proxy by usernamehaha2022 · Pull Request #2221 · NVIDIA/nccl

usernamehaha2022 · 2026-06-08T06:59:09Z

Description

This PR fixes a race in the NCCL profiler proxy where a KernelCh completion event could be processed before the corresponding start event was delivered.

The proxy now only stops/drains a KernelCh profiler sub-operation after StartKernelChEvent has been posted.

In stress testing with many small Ring+LL collectives, inspector logs showed missing coll_sn records across ranks, indicating dropped profiler records caused by this ordering race. Example gaps included:

Related Issues

N/A

Changes & Impact

Adds a guard so ncclProfilerStopKernelChEvent only runs after sub->posted == sub->nsteps.
Reuses the computed profiler ring index instead of recomputing it.
No API or ABI changes.

Performance Impact

No expected impact on normal NCCL collective performance.
The change is limited to the profiler proxy path and only adds one CPU-side condition when profiler/inspector is enabled.

xiaofanl-nvidia · 2026-06-09T02:52:42Z

++ @armratner @rishdas to take a look

armratner · 2026-06-09T03:13:38Z

This no longer will be relevant, as we decouple from the proxy in 2.31, and we covered this issue.

Wait for KernelCh start before profiler completion

01acf05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KernelCh completion ordering in profiler proxy#2221

Fix KernelCh completion ordering in profiler proxy#2221
usernamehaha2022 wants to merge 1 commit into
NVIDIA:masterfrom
usernamehaha2022:master

usernamehaha2022 commented Jun 8, 2026

Uh oh!

xiaofanl-nvidia commented Jun 9, 2026

Uh oh!

armratner commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

usernamehaha2022 commented Jun 8, 2026

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

xiaofanl-nvidia commented Jun 9, 2026

Uh oh!

armratner commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants