Skip to content

Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2175

Open
Pritiks23 wants to merge 218 commits into
NVIDIA:masterfrom
Pritiks23:combine-prs
Open

Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2175
Pritiks23 wants to merge 218 commits into
NVIDIA:masterfrom
Pritiks23:combine-prs

Conversation

@Pritiks23

Copy link
Copy Markdown

Description

This pull request combines the changes from PRs #2172 and #2173:

Added NULL checks after malloc in ncclShadowPoolAlloc to prevent null pointer dereferences and silent crashes.
Improved diagnostic logging in ncclShadowPoolFree error path to include device pointer address, hash bucket index, and total object count for better debugging.
These changes enhance the robustness and maintainability of the code by addressing potential memory allocation failures and improving error diagnostics.

Related Issues

PR #2172: Add NULL checks after malloc in ncclShadowPoolAlloc
PR #2173: Improve diagnostic logging in ncclShadowPoolFree error path

Changes & Impact

Added NULL checks after malloc calls in ncclShadowPoolAlloc:
pool->table
table1 (hash table expansion)
page (shadow pool metadata)
obj (shadow pool objects)
Improved WARN messages in ncclShadowPoolFree to include:
Device pointer address
Hash bucket index
Total object count in the pool
These changes prevent silent crashes and make debugging memory issues easier.

Performance Impact

No significant performance impact is expected.
Changes have been tested to ensure no regressions in functionality.

Signed-off-by:
Pritika Vipin [pritikavipin@gmail.com]

pakmarkthub and others added 30 commits April 11, 2026 06:18
…resource sharing mode support to all GDAKI APIs

## Summary
- Introduce `NCCL_GIN_RESOURCE_SHARING_THREAD`.
- Add resource sharing support to Get and Flush in GDAKI.
- Add resource sharing thread to all GDAKI APIs.

## Implementation
- Add `NCCL_GIN_RESOURCE_SHARING_THREAD` to `enum ncclGinResourceSharingMode`.
- Modify Get and Flush APIs in GDAKI to support all resource sharing mode.
- Add resource sharing mode thread support to Put and PutValue in GDAKI.

Signed-off-by: Pak Markthub <pmarkthub@nvidia.com>

Signed-off-by: Xiaofan Li <xiaofanl@nvidia.com>

Mirrored-from: c622884df5b4bc3f8fee42de0368baddea93143a
worldGinBarrier sends a signal among all peers. With railed GIN, not all peers are connected, so worldGinBarrier does not work

Signed-off-by: Katie Gioioso <kgioioso@nvidia.com>

Mirrored-from: 40f87759058e2a86837c14085717150b1318336c
Mirrored-from: 95f7222a565c69972bb7cc65b1e5105444ec93ff
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com>

Mirrored-from: 986da69af53986859ad84384631e6ad54dfd0358
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com>

Mirrored-from: baae513fdbd309cd2012cbeb074ef267a02d6b30
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com>

Mirrored-from: 1b9e5172ba033638fcca26c2e3fd69f459628914
Signed-off-by: Rami Benayed <rbenayed@nvidia.com>

Mirrored-from: a7818c7952acf2d3222e4dcbe28efd5982617ede
Signed-off-by: Rami Benayed <rbenayed@nvidia.com>

Mirrored-from: 86836e195224482b51651d556479410c6d993471
When multiple GPUs share the network devices, selecting a different
starting device is key to avoid conflicts as much as possible.

To enforce consistency, we use the `gpu.dev % localGpuCount` in order to
avoid breaking it with the mirrorBit operation.

This change affect cases with no NET sharing (localGpuCount = 1) and
cases with 4 NETs to be shared by multiple GPUs.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: f431b8143728d9e6d2032dbc479d83c57e57ddc5
…communicators

When multiple NCCL communicators are created in the same process, the
first communicator creates virtual NICs (vNICs) via ncclTopoMakeVNics
and caches the vNIC count. Subsequent communicators skip vNIC creation
(topo.cc:1413) but fail to mark the merged physical NICs as keep=0.

This causes the second communicator's topology to contain both physical
and virtual NICs, giving the ring search extra NIC paths. When two
sub-communicators (e.g., DP groups in a TP+DP setup) are created after
a TP communicator, they can end up with different ring topologies due to
the extra NICs, leading to different reduction orders in reduce_scatter
and bitwise non-deterministic results.

The fix queries each existing vNIC's constituent physical devices and
marks them keep=0, matching what ncclTopoMakeVnic does at line 1037.

Signed-off-by: Naman Goyal <ngoyal2707@gmail.com>

Mirrored-from: d7f092607f32b68d2499441c885130e785fc20f5
When a communicator is created after the fusion has already been done,
NCCL must ignore the already fused NICs.
The physical NICs are always marked as keep = 1, then the virtual NICs
are added to the topology as keep = 1 and the fused physical NICs are
removed with keep = 0.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: 7309217a19f06b782861a9be6177d10523935de7
Fixes: NVIDIA#2026

Signed-off-by: Ronald Eytchison <ronald.eytchison@trailofbits.com>

Mirrored-from: c1cbee3d05ab3acac075abf6f44922cc27a150b2
WSAPoll was called on the first 1+maxnpeers slots instead of reaching
the listen socket at pollfds[maxProxyConnections], polling INVALID_SOCKET
and causing WSAENOTSOCK (10038) during proxy initialization.

Signed-off-by: Timothy Lo <tilo@nvidia.com>

Mirrored-from: 27b6898359fd3a60d2ba525b134168bb7b388ea6
After creation of the vNICs for a specific network plugin, the physical
NICs are marked as unused by that plugin.

If the physical dev is unused by all plugin, the dev is marked as keep=0
and trimmed later.

Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com>
Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: 7beadaadc11b0f7c35e97716363521eccb6ef6ee
Signed-off-by: Mark Santesson <msantesson@nvidia.com>

Mirrored-from: ad34e21727f315056b3c32d2c5b3227c79f40578
Reduce LL kernels memory consumption by including the top-k
indices in the token message payloads.

On dispatch, this allows to avoid maintaining a separate
buffer space per local-expert/remote-rank pair and instead
having one space for each remote rank.
This reduces the memory overhead from O(E x B x H) down to
O(N x B x H), where E is number of experts, N - number of ranks,
B - batch size, and H - token hidden dimension

On combine, the top-k indices are used to reduce the communication
buffer from O(E x B x H) to O(K x B x H).

Signed-off-by: Artem Y. Polyakov <artemp@nvidia.com>

Mirrored-from: 9abffbbc4a83968034f5c5966c44cb7b525d4d1c
Creating CUDA memory pools with the default maxSize=0 causes the driver
to reserve virtual address space equal to twice the GPU's physical memory,
leading to severe VA exhaustion on 48-bit systems during multi-communicator
training. This commit caps the shadow pool's maxSize to a default of 1 GiB,
Users can easily customize or disable this limit with a new environment
variable.

Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com>

Mirrored-from: 5610a4048efcb1a0fe98746576768e52e31335a7
In MoE training loops the expert routing (topk_idx) changes every
iteration but buffer sizes are determined by group-level constants.
The current API forces a full destroy+create cycle, causing unnecessary
cudaMalloc/cudaFree traffic on every iteration.

ncclEpUpdateHandle reuses the existing handle's buffers and re-runs
only the topk_idx-dependent computation (buffer resets, bitmap
conversion, allgather, metadata preprocessing). ncclEpCreateHandle now
delegates to it after allocation, so the first-call behavior is
unchanged.

Signed-off-by: Ke Wen <kwen@nvidia.com>

Mirrored-from: 13dcf1fbf49d441f004d3a7d2bc1f8023996c42a
…n HT mode

Signed-off-by: Subhadeep Bhattacharya <subhadeepb@nvidia.com>

Mirrored-from: 2110fd0b94d180dade732404e601419cb1cfb542
Signed-off-by: Katie Gioioso <kgioioso@nvidia.com>

Mirrored-from: 4f0a5727cf123c158aad0897af0eb2e84945a3a8
Move platform-specific PCI topology discovery functions out of
graph/xml.cc and into the OS abstraction layer (linux.cc/windows.cc),
declared in os.h

Signed-off-by: Timothy Lo <tilo@nvidia.com>

Mirrored-from: c95e613cf3e63b19d0a106a7886a0eb90f31518f
Introduce platform-specific code paths in ncclTopoGetXmlFromSys and
ncclTopoGetXmlFromGpu to support topology detection on Windows

Signed-off-by: Timothy Lo <tilo@nvidia.com>

Mirrored-from: 89c7ac993392ab8d4ed42c242679ee5470e57235
Add thin wrappers in nvmlwrap.h/nvmlwrap.cc for NVML APIs used by the
Windows topology discovery code paths

Signed-off-by: Timothy Lo <tilo@nvidia.com>

Mirrored-from: 144b16eec3cb13c1c6f7fe8521efe3b77e0c95f0
Signed-off-by: Pak Markthub <pmarkthub@nvidia.com>

Mirrored-from: a622fa1430843d91c739c6258a59c8ff4edb2925
The environment plugin init function does not currently take the nccl
debug logger function as other plugins do. Adding the logger function to
the env plugin init function avoid the plugin implementing adhoc logging
solutions.

Signed-off-by: Giuseppe Congiu <gcongiu@nvidia.com>

Mirrored-from: 7ce326dfaf79cae742b7f13d45a6243fd20faabb
… sizes

Replace 6 individual cudaMemsetAsync calls with 2 bulk calls using
the preprocessing_zero_region_size and preprocessing_s2d_size already
computed during allocation.  The buffers are sized to max_tokens
capacity, so clearing the full region is safe and eliminates per-call
size recomputation with no meaningful performance difference.

Signed-off-by: Ke Wen <kwen@nvidia.com>

Mirrored-from: fbdcc10487e6bf33987f220d3210786396c63cbc
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

Mirrored-from: 36a79161b86ab70e343a1efd43c423f9fd0923a3
… HT dispatch

Signed-off-by: Subhadeep Bhattacharya <subhadeepb@nvidia.com>

Mirrored-from: 228c3aa7405a5462c2311af3710a9cc88deed531
Remove the HOST-tag path that staged per-expert counts through a
device temporary and copied them back via cudaMemcpyAsync D2H.
Only the DEVICE tag is now accepted, which writes directly to the
caller's device buffer with no D2H copy.

The remaining D2H copy (num_tokens_for_experts_host mirror) is
annotated; removing it is a separate step toward full CUDA graph
compatibility.

Signed-off-by: Ke Wen <kwen@nvidia.com>

Mirrored-from: 25fe7df1f923d16a35b90e8d1704cd7de92d2b76
…ecvTokens

The async D2H mirror of num_tokens_for_experts in ncclEpUpdateHandle
was the last remaining obstacle to CUDA graph capture. Move the copy
to ncclEpHandleGetNumRecvTokens (a synchronous query API that doesn't
need to be graphable) and remove the host-pinned cache field entirely.

Signed-off-by: Ke Wen <kwen@nvidia.com>

Mirrored-from: a91d5d39bc89a7849dd8673bbf506172c06e6ce7
sjeaugey and others added 11 commits May 14, 2026 00:28
Indicate to plugins when they only need to connect to ranks on the
same rail.

Implement rankStride in the RMA IB plugin.
The parameter is currently ignored in the GDAKI plugin

Signed-off-by: Sylvain Jeaugey <sjeaugey@nvidia.com>

Mirrored-from: 80aa5e8ad689b8bee76cfdc82d45bb2d4ab95360
## Summary

Refactors NCCL internal logging so full location flows through **`ncclDebugLogInternal`** (`file`, `func`, `line`), while `ncclDebugLog` keeps the historical **`ncclDebugLogger_t`** calling convention to be ABI-safe.

Adds **`ALLOC_HOST`** under **`NCCL_DEBUG_SUBSYS`** for optional verbose host-heap **`INFO`** traces. Uses **`INFO_LOC` / `INFO_LOC_FN`** (and **`TRACE_LOC` / `TRACE_LOC_FN`**) so messages that should carry **`file:line (func)`** stay uniform without repeating **`"%s:%d (%s)"`** everywhere.

## Implementation details

### Debug symbols & ABI

- **`ncclDebugLogInternal(level, flags, file, func, line, fmt, …)`**: hidden/internal implementation path used by **`VERSION` / `WARN` / `INFO` / `TRACE`** macros (Linux: **`visibility("hidden")`**).
- **`ncclDebugLog`**: default-visible; **`legacy`** third argument semantics (**WARN** → path/file string; **TRACE** → function name string). Implemented via shared **`ncclDebugLogV`**.

### Host allocation diagnostics (**`ALLOC_HOST`**)

- Verbose host **`malloc` / `calloc` / realloc**, CUDA host alloc helpers, and related **`INFO`** lines use subsystem **`ALLOC_HOST`**; **`NCCL_DEBUG_SUBSYS`** must include it
- **`ncclCallocQuiet` / `ncclReallocQuiet`** suppress **`ALLOC_HOST`** **`INFO`** on high-churn paths (e.g. RAS).
- **`ncclCalloc` / `ncclRealloc`** **`unique_ptr`-style overloads** take **`bool logHostAlloc`** so **`ncclCalloc`** macros stay consistent.

### **`INFO_LOC`** helpers

- **`INFO`** still passes **`nullptr`** **`file`/`func`** into **`ncclDebugLogInternal`**; messages embed location only via **`INFO_LOC`** / **`INFO_LOC_FN`** (wrappers pass **`__FILE__`/`__LINE__`/`__func__`** or explicit **`file`/`line`/`caller`** from **`alloc.h`** debug wrappers).
- **`TRACE_LOC` / `TRACE_LOC_FN`** mirror **`INFO_LOC`** when **`ENABLE_TRACE`** is defined.

Signed-off-by: David Addison <daddison@nvidia.com>

Signed-off-by: Xiaofan Li <xiaofanl@nvidia.com>

Mirrored-from: f03e1b354e71062bb456b959fc3a33d5af04786d
… (Please DO NOT squash)

## Summary

* Extend the allocator interface per customer request
* Address C99 issues spotted during python integration (see !2611)

Signed-off-by: Artem Y. Polyakov <artemp@nvidia.com>

Signed-off-by: Ke Wen <kwen@nvidia.com>

Mirrored-from: 4eb8d7ec9f1e89152e30881e3ea44a4ad301a903
- prefer path with higher bandwidth over shorter hop count,
- refactor ncclGetLocalCpu function to use ncclTopoGetLocal

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: 6a5b60617f6dfb5f098633c6500631a3e22887e5
Refactor the multirank GPU detection to not trigger when MloPart is
being used in the communicator.

We still silently disable NVLS when two ranks are using the same NVML
dev.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: 83f0005568223da5405bffe0ff7f3f65610c0432
When MloPart is enabled, NCCL will split the device into two sub DEV nodes, one for each of the uGPU.

This will lead to 50% NVLink bw for each sub DEV to the NVS and a LOC link between both sub DEVs.
Further, each sub-dev is able to saturate 100% of the C2C bw.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: f5a945fa7680fefa20669160d0844c5868d15fe6
gdrSupport is disabled if unsupported.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: feef28ea1b47202a9c4ac9a0ea837ed38b857ff1
Disable the UB registration path if the mlopart buffer doesn't have RDMA.
At enqueue, if the buffer hasn't been registered, NCCL will use the
internal buffering.

Signed-off-by: Thomas Gillis <tgillis@nvidia.com>

Mirrored-from: 3f2473766411b7ff2cadf610b95fa7773b921f24
## Summary
Adds a Python device API that lets users write NCCL device-side kernels
in CuTeDSL (`cutlass.cute`).

Adds an example (`examples/cute/main.py`): 2-rank, 1 MiB GIN put with
signal/wait. Two separate registered windows for clarity; one-step
`@cute.jit` direct call with `cutlass.Int64`-annotated args; host-side
element-wise validation after `device.sync()`.

Signed-off-by: Xiakun Lu <xiakunl@nvidia.com>

Mirrored-from: 0ad86ca37281a2acb888725146992e5c4f9ce072
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

Mirrored-from: 3d9d627c3facb45a2c4308aafc6cde20c7f66a09
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

Mirrored-from: 33a689942a57801bd217626d078397a750cc0d38

@Pritiks23 Pritiks23 left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate your review and valuable insights @justus-nv, Please see my changes.

@justus-nv

Copy link
Copy Markdown
Contributor

Appreciate your review and valuable insights @justus-nv, Please see my changes.

Thank you for incorporating my feedback. It is in much better shape. There is one possible memory leak for page (noted in comments). Once the leak is properly handled. The change should be good to go.

@Pritiks23 Pritiks23 left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justus-nv Thank you, Implemented. In the page-allocation path, I converted both cudaMallocFromPoolAsync and cudaMemsetAsync checks to CUDACHECKGOTO. I also added cleanup in the error path so if either call fails, any half-created page is properly freed. While checking this logic, I fixed the same kind of leak in the non-page path and initialized the page pointer to prevent undefined behavior if execution jumps to the fail label before page is assigned.

@justus-nv justus-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick follow up. The changes look good.

There are two minor issues that need to be addressed.

  1. All commits need sign-off (use -s git option). I checked your commit messages, but only one commit has a sign-off. The easiest solution may be to squash and rebase with a new signed-off commit.
  2. pool->count is an int, but the formatting is %lu in updated warning log in ncclShadowPoolFree.

Once those two issues are dealt with, we should be good to move forward.

@Pritiks23 Pritiks23 force-pushed the combine-prs branch 2 times, most recently from 83fd6ea to 6b1c0a9 Compare May 18, 2026 19:28
@Pritiks23

Copy link
Copy Markdown
Author

Appreciate you pointing these out. I’ve fixed the Signed-off-by handling and updated the format specifier to match pool->count. @justus-nv

@justus-nv

Copy link
Copy Markdown
Contributor

/mirror v2.31

1 similar comment
@jskrobola

Copy link
Copy Markdown
Collaborator

/mirror v2.31

@NCCL-Bot

Copy link
Copy Markdown
Collaborator

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

kiskra-nvidia and others added 3 commits May 20, 2026 17:06
Casting FLAGS to unsigned long was unnecessary and was in fact hiding
errors if somebody forgot to specify the debug subsystem.

Also fix a compiler warning about mismatched argument types when
compiling with TRACE enabled.

Signed-off-by: Kamil Iskra <kiskra@nvidia.com>

Mirrored-from: 99144ee3beeb73ef0f57a687b65c660a4964f0dd
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

Mirrored-from: 57d9f7ae7a4396e5232d1472797a56aff3d8c231
…acros that was incorrectly reverted during merge

Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

Mirrored-from: d048f213dea659ad4634c76aba4a1a1c8fc3dbe2
@justus-nv

Copy link
Copy Markdown
Contributor

/mirror v2.31

@justus-nv

Copy link
Copy Markdown
Contributor

@Pritiks23 Can you rebase your branch onto dev. This will enable us to pull your branch into our internal rep.

Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
@Pritiks23

Copy link
Copy Markdown
Author

Sure @justus-nv

#2188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.