Skip to content

Fix KernelCh completion ordering in profiler proxy#2221

Open
usernamehaha2022 wants to merge 1 commit into
NVIDIA:masterfrom
usernamehaha2022:master
Open

Fix KernelCh completion ordering in profiler proxy#2221
usernamehaha2022 wants to merge 1 commit into
NVIDIA:masterfrom
usernamehaha2022:master

Conversation

@usernamehaha2022

Copy link
Copy Markdown

Description

This PR fixes a race in the NCCL profiler proxy where a KernelCh completion event could be processed before the corresponding start event was delivered.

The proxy now only stops/drains a KernelCh profiler sub-operation after StartKernelChEvent has been posted.

In stress testing with many small Ring+LL collectives, inspector logs showed missing coll_sn records across ranks, indicating dropped profiler records caused by this ordering race. Example gaps included:

image

Related Issues

N/A

Changes & Impact

  1. Adds a guard so ncclProfilerStopKernelChEvent only runs after sub->posted == sub->nsteps.
  2. Reuses the computed profiler ring index instead of recomputing it.
  3. No API or ABI changes.

Performance Impact

No expected impact on normal NCCL collective performance.
The change is limited to the profiler proxy path and only adds one CPU-side condition when profiler/inspector is enabled.

@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

++ @armratner @rishdas to take a look

@armratner

Copy link
Copy Markdown
Collaborator

This no longer will be relevant, as we decouple from the proxy in 2.31, and we covered this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants