dev_runtime: restore capture mode on GIN setup failure#2229
Open
wanglei875 wants to merge 1 commit into
Open
Conversation
Signed-off-by: WangLei <1539790288@qq.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates error handling in ncclDevrCommCreateInternal to route a setup failure through the common cleanup path.
Changes:
- Switches
ncclGinDevCommSetupfailure handling fromNCCLCHECKtoNCCLCHECKGOTO(..., ret, fail).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
ncclDevrCommCreateInternal()switches the CUDA stream capture mode before setting up device communicator resources. Most failure paths after that point useGOTOcleanup labels so the original capture mode is restored.The GIN setup path used
NCCLCHECK(ncclGinDevCommSetup(...)), which returns directly on failure and skips the commonfailpath. This can leave the thread capture mode unrestored when GIN setup fails.This change routes the failure through the existing
faillabel by usingNCCLCHECKGOTO.Changes & Impact
ncclGinDevCommSetup()failure.Testing
git diff --checkmake -j$(nproc) src.build;src/dev_runtime.cccompiled successfully. Full device kernel build was not completed due to long compile time.