plugin: finalize contexts on init fallback failure#2200
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds regression tests and fixes plugin lifecycle cleanup to ensure plugin contexts are finalized and cleared when initialization/validation fails.
Changes:
- Ensure RMA v13 init finalizes plugin context on
getPropertiesfailure or invalid device type. - Finalize and null out RMA/CollNet contexts when
devices()fails after a successfulinit(). - Add a standalone lifecycle cleanup test binary and Makefile to reproduce/verify the cleanup behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/plugin_lifecycle/plugin_lifecycle_test.cc | Adds a standalone test program that models “before vs after” behavior and asserts cleanup on failure paths. |
| tests/plugin_lifecycle/Makefile | Provides a minimal build/run target for the new standalone test binary. |
| src/plugin/rma/rma_v13.cc | Adds fail-path cleanup to finalize and clear the RMA v13 context when validation fails. |
| src/plugin/rma.cc | Finalizes and clears comm->rmaContext when devices() fails after init(). |
| src/plugin/net.cc | Finalizes and clears comm->collNetContext when devices() fails after init(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d249166 to
b9da964
Compare
Signed-off-by: WangLei <1539790288@qq.com>
|
Nice fix — clean, well-structured, and the mock tests are excellent evidence. A few observations: Strengths
Minor suggestions
LGTMThis is a solid fix. The performance impact is zero in the success path, and the fallback path runs finalize exactly once. +1 from me. (I came across this while working on GB10 topology fixes in #2202 — good to see NCCL getting attention on plugin robustness.) |
Description
Related Issues
Summary
Fix plugin-owned context leaks on initialization fallback paths.
When RMA or CollNet plugin
init()succeeds but later capability discovery or validation fails, NCCL disables the plugin without finalizing the per-communicator context. Normal teardown only finalizes enabled plugin states, so those contexts can be leaked.This change mirrors the existing GIN cleanup pattern by tracking successful init and explicitly finalizing the context before disabling the plugin.
Fixed paths:
init()succeeds, thendevices()fails or returns0init()succeeds, thendevices()fails or returns0init()succeeds, thengetProperties()fails or reports a non-GIN-proxy device typeChanges
comm->rmaContextwhen RMA devices discovery fails after init.comm->collNetContextwhen CollNet devices discovery fails after init.Validation