Skip to content

[RFE]: Include qp_num in detailed RoCE completion error log #2219

@zrss

Description

@zrss

What is the goal of this request?

Include wc.qp_num in the detailed RoCE completion error log. When debugging RoCE failures like IBV_WC_RETRY_EXC_ERR(12), it is useful to know both the RoCE path and the QP number.

Who will benefit from this feature?

Users and support engineers debugging NCCL over RoCE, especially failures involving IBV_WC_RETRY_EXC_ERR(12), GID/routing issues, cross-NIC paths, timeout settings, or fabric connectivity problems.

Is this request for a specific GPU architecture or network infrastructure?

This is not specific to a GPU architecture. It is mainly useful for RoCE deployments.

How will this feature improve current workflows or processes?

Today NCCL already prints both pieces of information, but they are split across two WARN lines.

One line has the QP number:

  NET/IB: Got CQE with error (... devIndex=..., req=..., comm=..., wr_id=..., qp_num=...)

The next detailed line has the peer, status, vendor error, local/remote GID, and HCA:

  NET/IB: Got completion from peer ... status=IBV_WC_RETRY_EXC_ERR(12) opcode=... vendor_err=... localGid ... remoteGid ... hca ...

So the current workflow is usually:

  1. Find the detailed error line with localGid/remoteGid.
  2. Look at the nearby CQE error line to get qp_num.
  3. Search earlier logs for qp_num/qpn to find the QP creation/connect information.

This is workable, but a little fragile when logs from multiple ranks are interleaved, or when users only paste the detailed error line in an issue/support thread.

we could include wc.qp_num in the detailed completion error line as qp_num=..., consistent with the existing NET/IB logs?

For example:

  NET/IB: Got completion from peer ... status=IBV_WC_RETRY_EXC_ERR(12) opcode=... vendor_err=... qp_num=... localGid ... remoteGid ... hca ...

This would make the detailed RoCE error line self-contained enough to start debugging from: peer, status, vendor_err, qp_num, localGid, remoteGid, and hca are all in one place.

This is only a logging improvement and should not change NCCL behavior.

What is the priority level of this request?

Medium. It is a small logging-only improvement, but it can make RoCE error debugging less ambiguous.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions