Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2175
Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2175Pritiks23 wants to merge 218 commits into
Conversation
…resource sharing mode support to all GDAKI APIs ## Summary - Introduce `NCCL_GIN_RESOURCE_SHARING_THREAD`. - Add resource sharing support to Get and Flush in GDAKI. - Add resource sharing thread to all GDAKI APIs. ## Implementation - Add `NCCL_GIN_RESOURCE_SHARING_THREAD` to `enum ncclGinResourceSharingMode`. - Modify Get and Flush APIs in GDAKI to support all resource sharing mode. - Add resource sharing mode thread support to Put and PutValue in GDAKI. Signed-off-by: Pak Markthub <pmarkthub@nvidia.com> Signed-off-by: Xiaofan Li <xiaofanl@nvidia.com> Mirrored-from: c622884df5b4bc3f8fee42de0368baddea93143a
worldGinBarrier sends a signal among all peers. With railed GIN, not all peers are connected, so worldGinBarrier does not work Signed-off-by: Katie Gioioso <kgioioso@nvidia.com> Mirrored-from: 40f87759058e2a86837c14085717150b1318336c
Mirrored-from: 95f7222a565c69972bb7cc65b1e5105444ec93ff
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com> Mirrored-from: 986da69af53986859ad84384631e6ad54dfd0358
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com> Mirrored-from: baae513fdbd309cd2012cbeb074ef267a02d6b30
Signed-off-by: Abhilash Kolluri <akolluri@nvidia.com> Mirrored-from: 1b9e5172ba033638fcca26c2e3fd69f459628914
Signed-off-by: Rami Benayed <rbenayed@nvidia.com> Mirrored-from: a7818c7952acf2d3222e4dcbe28efd5982617ede
Signed-off-by: Rami Benayed <rbenayed@nvidia.com> Mirrored-from: 86836e195224482b51651d556479410c6d993471
When multiple GPUs share the network devices, selecting a different starting device is key to avoid conflicts as much as possible. To enforce consistency, we use the `gpu.dev % localGpuCount` in order to avoid breaking it with the mirrorBit operation. This change affect cases with no NET sharing (localGpuCount = 1) and cases with 4 NETs to be shared by multiple GPUs. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: f431b8143728d9e6d2032dbc479d83c57e57ddc5
…communicators When multiple NCCL communicators are created in the same process, the first communicator creates virtual NICs (vNICs) via ncclTopoMakeVNics and caches the vNIC count. Subsequent communicators skip vNIC creation (topo.cc:1413) but fail to mark the merged physical NICs as keep=0. This causes the second communicator's topology to contain both physical and virtual NICs, giving the ring search extra NIC paths. When two sub-communicators (e.g., DP groups in a TP+DP setup) are created after a TP communicator, they can end up with different ring topologies due to the extra NICs, leading to different reduction orders in reduce_scatter and bitwise non-deterministic results. The fix queries each existing vNIC's constituent physical devices and marks them keep=0, matching what ncclTopoMakeVnic does at line 1037. Signed-off-by: Naman Goyal <ngoyal2707@gmail.com> Mirrored-from: d7f092607f32b68d2499441c885130e785fc20f5
When a communicator is created after the fusion has already been done, NCCL must ignore the already fused NICs. The physical NICs are always marked as keep = 1, then the virtual NICs are added to the topology as keep = 1 and the fused physical NICs are removed with keep = 0. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: 7309217a19f06b782861a9be6177d10523935de7
Fixes: NVIDIA#2026 Signed-off-by: Ronald Eytchison <ronald.eytchison@trailofbits.com> Mirrored-from: c1cbee3d05ab3acac075abf6f44922cc27a150b2
WSAPoll was called on the first 1+maxnpeers slots instead of reaching the listen socket at pollfds[maxProxyConnections], polling INVALID_SOCKET and causing WSAENOTSOCK (10038) during proxy initialization. Signed-off-by: Timothy Lo <tilo@nvidia.com> Mirrored-from: 27b6898359fd3a60d2ba525b134168bb7b388ea6
After creation of the vNICs for a specific network plugin, the physical NICs are marked as unused by that plugin. If the physical dev is unused by all plugin, the dev is marked as keep=0 and trimmed later. Signed-off-by: Ahsan Pervaiz <apervaiz@nvidia.com> Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: 7beadaadc11b0f7c35e97716363521eccb6ef6ee
Signed-off-by: Mark Santesson <msantesson@nvidia.com> Mirrored-from: ad34e21727f315056b3c32d2c5b3227c79f40578
Reduce LL kernels memory consumption by including the top-k indices in the token message payloads. On dispatch, this allows to avoid maintaining a separate buffer space per local-expert/remote-rank pair and instead having one space for each remote rank. This reduces the memory overhead from O(E x B x H) down to O(N x B x H), where E is number of experts, N - number of ranks, B - batch size, and H - token hidden dimension On combine, the top-k indices are used to reduce the communication buffer from O(E x B x H) to O(K x B x H). Signed-off-by: Artem Y. Polyakov <artemp@nvidia.com> Mirrored-from: 9abffbbc4a83968034f5c5966c44cb7b525d4d1c
Creating CUDA memory pools with the default maxSize=0 causes the driver to reserve virtual address space equal to twice the GPU's physical memory, leading to severe VA exhaustion on 48-bit systems during multi-communicator training. This commit caps the shadow pool's maxSize to a default of 1 GiB, Users can easily customize or disable this limit with a new environment variable. Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com> Mirrored-from: 5610a4048efcb1a0fe98746576768e52e31335a7
In MoE training loops the expert routing (topk_idx) changes every iteration but buffer sizes are determined by group-level constants. The current API forces a full destroy+create cycle, causing unnecessary cudaMalloc/cudaFree traffic on every iteration. ncclEpUpdateHandle reuses the existing handle's buffers and re-runs only the topk_idx-dependent computation (buffer resets, bitmap conversion, allgather, metadata preprocessing). ncclEpCreateHandle now delegates to it after allocation, so the first-call behavior is unchanged. Signed-off-by: Ke Wen <kwen@nvidia.com> Mirrored-from: 13dcf1fbf49d441f004d3a7d2bc1f8023996c42a
…n HT mode Signed-off-by: Subhadeep Bhattacharya <subhadeepb@nvidia.com> Mirrored-from: 2110fd0b94d180dade732404e601419cb1cfb542
Signed-off-by: Katie Gioioso <kgioioso@nvidia.com> Mirrored-from: 4f0a5727cf123c158aad0897af0eb2e84945a3a8
Move platform-specific PCI topology discovery functions out of graph/xml.cc and into the OS abstraction layer (linux.cc/windows.cc), declared in os.h Signed-off-by: Timothy Lo <tilo@nvidia.com> Mirrored-from: c95e613cf3e63b19d0a106a7886a0eb90f31518f
Introduce platform-specific code paths in ncclTopoGetXmlFromSys and ncclTopoGetXmlFromGpu to support topology detection on Windows Signed-off-by: Timothy Lo <tilo@nvidia.com> Mirrored-from: 89c7ac993392ab8d4ed42c242679ee5470e57235
Add thin wrappers in nvmlwrap.h/nvmlwrap.cc for NVML APIs used by the Windows topology discovery code paths Signed-off-by: Timothy Lo <tilo@nvidia.com> Mirrored-from: 144b16eec3cb13c1c6f7fe8521efe3b77e0c95f0
Signed-off-by: Pak Markthub <pmarkthub@nvidia.com> Mirrored-from: a622fa1430843d91c739c6258a59c8ff4edb2925
The environment plugin init function does not currently take the nccl debug logger function as other plugins do. Adding the logger function to the env plugin init function avoid the plugin implementing adhoc logging solutions. Signed-off-by: Giuseppe Congiu <gcongiu@nvidia.com> Mirrored-from: 7ce326dfaf79cae742b7f13d45a6243fd20faabb
… sizes Replace 6 individual cudaMemsetAsync calls with 2 bulk calls using the preprocessing_zero_region_size and preprocessing_s2d_size already computed during allocation. The buffers are sized to max_tokens capacity, so clearing the full region is safe and eliminates per-call size recomputation with no meaningful performance difference. Signed-off-by: Ke Wen <kwen@nvidia.com> Mirrored-from: fbdcc10487e6bf33987f220d3210786396c63cbc
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com> Mirrored-from: 36a79161b86ab70e343a1efd43c423f9fd0923a3
… HT dispatch Signed-off-by: Subhadeep Bhattacharya <subhadeepb@nvidia.com> Mirrored-from: 228c3aa7405a5462c2311af3710a9cc88deed531
Remove the HOST-tag path that staged per-expert counts through a device temporary and copied them back via cudaMemcpyAsync D2H. Only the DEVICE tag is now accepted, which writes directly to the caller's device buffer with no D2H copy. The remaining D2H copy (num_tokens_for_experts_host mirror) is annotated; removing it is a separate step toward full CUDA graph compatibility. Signed-off-by: Ke Wen <kwen@nvidia.com> Mirrored-from: 25fe7df1f923d16a35b90e8d1704cd7de92d2b76
…ecvTokens The async D2H mirror of num_tokens_for_experts in ncclEpUpdateHandle was the last remaining obstacle to CUDA graph capture. Move the copy to ncclEpHandleGetNumRecvTokens (a synchronous query API that doesn't need to be graphable) and remove the host-pinned cache field entirely. Signed-off-by: Ke Wen <kwen@nvidia.com> Mirrored-from: a91d5d39bc89a7849dd8673bbf506172c06e6ce7
Indicate to plugins when they only need to connect to ranks on the same rail. Implement rankStride in the RMA IB plugin. The parameter is currently ignored in the GDAKI plugin Signed-off-by: Sylvain Jeaugey <sjeaugey@nvidia.com> Mirrored-from: 80aa5e8ad689b8bee76cfdc82d45bb2d4ab95360
## Summary
Refactors NCCL internal logging so full location flows through **`ncclDebugLogInternal`** (`file`, `func`, `line`), while `ncclDebugLog` keeps the historical **`ncclDebugLogger_t`** calling convention to be ABI-safe.
Adds **`ALLOC_HOST`** under **`NCCL_DEBUG_SUBSYS`** for optional verbose host-heap **`INFO`** traces. Uses **`INFO_LOC` / `INFO_LOC_FN`** (and **`TRACE_LOC` / `TRACE_LOC_FN`**) so messages that should carry **`file:line (func)`** stay uniform without repeating **`"%s:%d (%s)"`** everywhere.
## Implementation details
### Debug symbols & ABI
- **`ncclDebugLogInternal(level, flags, file, func, line, fmt, …)`**: hidden/internal implementation path used by **`VERSION` / `WARN` / `INFO` / `TRACE`** macros (Linux: **`visibility("hidden")`**).
- **`ncclDebugLog`**: default-visible; **`legacy`** third argument semantics (**WARN** → path/file string; **TRACE** → function name string). Implemented via shared **`ncclDebugLogV`**.
### Host allocation diagnostics (**`ALLOC_HOST`**)
- Verbose host **`malloc` / `calloc` / realloc**, CUDA host alloc helpers, and related **`INFO`** lines use subsystem **`ALLOC_HOST`**; **`NCCL_DEBUG_SUBSYS`** must include it
- **`ncclCallocQuiet` / `ncclReallocQuiet`** suppress **`ALLOC_HOST`** **`INFO`** on high-churn paths (e.g. RAS).
- **`ncclCalloc` / `ncclRealloc`** **`unique_ptr`-style overloads** take **`bool logHostAlloc`** so **`ncclCalloc`** macros stay consistent.
### **`INFO_LOC`** helpers
- **`INFO`** still passes **`nullptr`** **`file`/`func`** into **`ncclDebugLogInternal`**; messages embed location only via **`INFO_LOC`** / **`INFO_LOC_FN`** (wrappers pass **`__FILE__`/`__LINE__`/`__func__`** or explicit **`file`/`line`/`caller`** from **`alloc.h`** debug wrappers).
- **`TRACE_LOC` / `TRACE_LOC_FN`** mirror **`INFO_LOC`** when **`ENABLE_TRACE`** is defined.
Signed-off-by: David Addison <daddison@nvidia.com>
Signed-off-by: Xiaofan Li <xiaofanl@nvidia.com>
Mirrored-from: f03e1b354e71062bb456b959fc3a33d5af04786d
… (Please DO NOT squash) ## Summary * Extend the allocator interface per customer request * Address C99 issues spotted during python integration (see !2611) Signed-off-by: Artem Y. Polyakov <artemp@nvidia.com> Signed-off-by: Ke Wen <kwen@nvidia.com> Mirrored-from: 4eb8d7ec9f1e89152e30881e3ea44a4ad301a903
- prefer path with higher bandwidth over shorter hop count, - refactor ncclGetLocalCpu function to use ncclTopoGetLocal Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: 6a5b60617f6dfb5f098633c6500631a3e22887e5
Refactor the multirank GPU detection to not trigger when MloPart is being used in the communicator. We still silently disable NVLS when two ranks are using the same NVML dev. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: 83f0005568223da5405bffe0ff7f3f65610c0432
When MloPart is enabled, NCCL will split the device into two sub DEV nodes, one for each of the uGPU. This will lead to 50% NVLink bw for each sub DEV to the NVS and a LOC link between both sub DEVs. Further, each sub-dev is able to saturate 100% of the C2C bw. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: f5a945fa7680fefa20669160d0844c5868d15fe6
gdrSupport is disabled if unsupported. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: feef28ea1b47202a9c4ac9a0ea837ed38b857ff1
Disable the UB registration path if the mlopart buffer doesn't have RDMA. At enqueue, if the buffer hasn't been registered, NCCL will use the internal buffering. Signed-off-by: Thomas Gillis <tgillis@nvidia.com> Mirrored-from: 3f2473766411b7ff2cadf610b95fa7773b921f24
## Summary Adds a Python device API that lets users write NCCL device-side kernels in CuTeDSL (`cutlass.cute`). Adds an example (`examples/cute/main.py`): 2-rank, 1 MiB GIN put with signal/wait. Two separate registered windows for clarity; one-step `@cute.jit` direct call with `cutlass.Int64`-annotated args; host-side element-wise validation after `device.sync()`. Signed-off-by: Xiakun Lu <xiakunl@nvidia.com> Mirrored-from: 0ad86ca37281a2acb888725146992e5c4f9ce072
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com> Mirrored-from: 3d9d627c3facb45a2c4308aafc6cde20c7f66a09
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com> Mirrored-from: 33a689942a57801bd217626d078397a750cc0d38
Pritiks23
left a comment
There was a problem hiding this comment.
Appreciate your review and valuable insights @justus-nv, Please see my changes.
Thank you for incorporating my feedback. It is in much better shape. There is one possible memory leak for |
Pritiks23
left a comment
There was a problem hiding this comment.
@justus-nv Thank you, Implemented. In the page-allocation path, I converted both cudaMallocFromPoolAsync and cudaMemsetAsync checks to CUDACHECKGOTO. I also added cleanup in the error path so if either call fails, any half-created page is properly freed. While checking this logic, I fixed the same kind of leak in the non-page path and initialized the page pointer to prevent undefined behavior if execution jumps to the fail label before page is assigned.
justus-nv
left a comment
There was a problem hiding this comment.
Thanks for the quick follow up. The changes look good.
There are two minor issues that need to be addressed.
- All commits need sign-off (use -s git option). I checked your commit messages, but only one commit has a sign-off. The easiest solution may be to squash and rebase with a new signed-off commit.
pool->countis anint, but the formatting is%luin updated warning log in ncclShadowPoolFree.
Once those two issues are dealt with, we should be good to move forward.
83fd6ea to
6b1c0a9
Compare
|
Appreciate you pointing these out. I’ve fixed the Signed-off-by handling and updated the format specifier to match pool->count. @justus-nv |
|
/mirror v2.31 |
1 similar comment
|
/mirror v2.31 |
|
Mirroring to the internal repository failed. The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes. |
Casting FLAGS to unsigned long was unnecessary and was in fact hiding errors if somebody forgot to specify the debug subsystem. Also fix a compiler warning about mismatched argument types when compiling with TRACE enabled. Signed-off-by: Kamil Iskra <kiskra@nvidia.com> Mirrored-from: 99144ee3beeb73ef0f57a687b65c660a4964f0dd
Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com> Mirrored-from: 57d9f7ae7a4396e5232d1472797a56aff3d8c231
…acros that was incorrectly reverted during merge Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com> Mirrored-from: d048f213dea659ad4634c76aba4a1a1c8fc3dbe2
|
/mirror v2.31 |
|
@Pritiks23 Can you rebase your branch onto dev. This will enable us to pull your branch into our internal rep. |
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
|
Sure @justus-nv |
Description
This pull request combines the changes from PRs #2172 and #2173:
Added NULL checks after malloc in ncclShadowPoolAlloc to prevent null pointer dereferences and silent crashes.
Improved diagnostic logging in ncclShadowPoolFree error path to include device pointer address, hash bucket index, and total object count for better debugging.
These changes enhance the robustness and maintainability of the code by addressing potential memory allocation failures and improving error diagnostics.
Related Issues
PR #2172: Add NULL checks after malloc in ncclShadowPoolAlloc
PR #2173: Improve diagnostic logging in ncclShadowPoolFree error path
Changes & Impact
Added NULL checks after malloc calls in ncclShadowPoolAlloc:
pool->table
table1 (hash table expansion)
page (shadow pool metadata)
obj (shadow pool objects)
Improved WARN messages in ncclShadowPoolFree to include:
Device pointer address
Hash bucket index
Total object count in the pool
These changes prevent silent crashes and make debugging memory issues easier.
Performance Impact
No significant performance impact is expected.
Changes have been tested to ensure no regressions in functionality.
Signed-off-by:
Pritika Vipin [pritikavipin@gmail.com]