Skip to content

plugin: finalize contexts on init fallback failure#2200

Open
wanglei875 wants to merge 1 commit into
NVIDIA:devfrom
wanglei875:fix-plugin-init-cleanup
Open

plugin: finalize contexts on init fallback failure#2200
wanglei875 wants to merge 1 commit into
NVIDIA:devfrom
wanglei875:fix-plugin-init-cleanup

Conversation

@wanglei875

@wanglei875 wanglei875 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Description

Related Issues

Summary

Fix plugin-owned context leaks on initialization fallback paths.

When RMA or CollNet plugin init() succeeds but later capability discovery or validation fails, NCCL disables the plugin without finalizing the per-communicator context. Normal teardown only finalizes enabled plugin states, so those contexts can be leaked.

This change mirrors the existing GIN cleanup pattern by tracking successful init and explicitly finalizing the context before disabling the plugin.

Fixed paths:

  • RMA plugin: init() succeeds, then devices() fails or returns 0
  • CollNet plugin: init() succeeds, then devices() fails or returns 0
  • RMA v13 compatibility adapter: init() succeeds, then getProperties() fails or reports a non-GIN-proxy device type

Changes

  • Finalize and clear comm->rmaContext when RMA devices discovery fails after init.
  • Finalize and clear comm->collNetContext when CollNet devices discovery fails after init.
  • Finalize and clear v13 RMA adapter context when post-init validation fails.
  • Add shared cleanup helper used by production cleanup paths and the standalone lifecycle test.
  • Add a standalone C++ mock lifecycle test for the affected failure paths.

Validation

make -C tests/plugin_lifecycle run
valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=definite --error-exitcode=99 build/test/plugin_lifecycle/plugin_lifecycle_test
git diff --check

```bash
make -C tests/plugin_lifecycle run

<!-- Reference any related issues or PRs -->

## Changes & Impact

<!-- Note any breaking changes or API modifications -->

## Performance Impact

<!-- If possible include benchmark results for performance changes & list what testing you've performed -->

Copilot AI review requested due to automatic review settings June 2, 2026 05:53

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression tests and fixes plugin lifecycle cleanup to ensure plugin contexts are finalized and cleared when initialization/validation fails.

Changes:

  • Ensure RMA v13 init finalizes plugin context on getProperties failure or invalid device type.
  • Finalize and null out RMA/CollNet contexts when devices() fails after a successful init().
  • Add a standalone lifecycle cleanup test binary and Makefile to reproduce/verify the cleanup behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/plugin_lifecycle/plugin_lifecycle_test.cc Adds a standalone test program that models “before vs after” behavior and asserts cleanup on failure paths.
tests/plugin_lifecycle/Makefile Provides a minimal build/run target for the new standalone test binary.
src/plugin/rma/rma_v13.cc Adds fail-path cleanup to finalize and clear the RMA v13 context when validation fails.
src/plugin/rma.cc Finalizes and clears comm->rmaContext when devices() fails after init().
src/plugin/net.cc Finalizes and clears comm->collNetContext when devices() fails after init().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/plugin/rma/rma_v13.cc
Comment thread src/plugin/rma.cc
Comment thread src/plugin/net.cc
Comment thread tests/plugin_lifecycle/plugin_lifecycle_test.cc
Comment thread src/plugin/rma/rma_v13.cc
@wanglei875 wanglei875 force-pushed the fix-plugin-init-cleanup branch 3 times, most recently from d249166 to b9da964 Compare June 2, 2026 06:10
Signed-off-by: WangLei <1539790288@qq.com>
@theherrovn

Copy link
Copy Markdown

Nice fix — clean, well-structured, and the mock tests are excellent evidence. A few observations:

Strengths

  1. Correct root cause: The leak is real — init() allocates context but devices()/validation failure skips finalize because the state was set to disabled before normal teardown runs.

  2. Consistent pattern: Mirrors the existing GIN cleanup pattern (src/plugin/net.cc already does this for GIN proxy contexts). Good consistency with codebase conventions.

  3. Complete coverage: CollNet, RMA, and RMA v13 all handled. The v13 fix is particularly important since it previously return ncclInternalError directly without cleanup (now uses goto fail).

  4. Excellent tests: The mock framework with beforeFix/afterFix side-by-side is exactly right. Tests prove:

    • Context is leaked before fix (ctx != nullptr, finalizeCalls == 0)
    • Context is cleaned after fix (ctx == nullptr, finalizeCalls == 1)

Minor suggestions

  1. plugin_cleanup.h uses ncclResult_t without including the definition. It works because callers already include nccl headers, but the header itself should be self-contained. Consider adding #include "nccl.h" or using int return type.

  2. The rma.cc change follows the same pattern as net.cc CollNet path — but is there a risk that pluginLib->ncclRma is NULL when rmaInitCompleted is true? The guard at line 108 checks pluginLib->ncclRma non-null before init; rmaInitCompleted is only set true inside that same block, so it's fine. Worth a comment maybe?

LGTM

This is a solid fix. The performance impact is zero in the success path, and the fallback path runs finalize exactly once. +1 from me.

(I came across this while working on GB10 topology fixes in #2202 — good to see NCCL getting attention on plugin robustness.)

@wanglei875

Copy link
Copy Markdown
Contributor Author

@xiaofanl-nvidia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants