Skip to content

fix(build): check for IB verbs extension(s) support#2178

Open
miEsMar wants to merge 1 commit into
NVIDIA:masterfrom
miEsMar:BSC-fix-ibv-api
Open

fix(build): check for IB verbs extension(s) support#2178
miEsMar wants to merge 1 commit into
NVIDIA:masterfrom
miEsMar:BSC-fix-ibv-api

Conversation

@miEsMar

@miEsMar miEsMar commented May 14, 2026

Copy link
Copy Markdown

Description

Latest NCCL implementations added support for
Infiniband extension features, such as active_speed_ex in struct ibv_port_attr. However, such additional support was not added in a backward compatible way.
This fix adds proper configuration-time checks to make sure the Infiniband installation supports newer features and extensions, instead of blindly assuming they are supported,
causing compile errors in the "net_ib" transport component.

Credits: AccelCom @ Barcelona Supercomputing Center

Related Issues

N/A.

Changes & Impact

N/A.

Performance Impact

N/A.

Latest NCCL implementations added support for
Infiniband extension features, such as `active_speed_ex` in
`struct ibv_port_attr`. However, such additional support
was not added in a backward compatible way.
This fix adds proper configuration-time checks to make sure
the Infiniband installation supports newer features and extensions,
instead of blindly assuming they are supported,
causing compile errors in the "net_ib" transport component.

Credits: AccelCom @ Barcelona Supercomputing Center

(c) Barcelona Supercomputing Center 2026
@sjeaugey

Copy link
Copy Markdown
Member

I'm not sure I understand what problem we're solving here.

Under which conditions do you see compilation errors? Can you provide the build command line and the errors in the output ?

@miEsMar

miEsMar commented May 18, 2026

Copy link
Copy Markdown
Author

If NCCL is built with RDMA_CORE=1, and the system supports an IB version older than the one(s) which included support for extensions, then compilation fails since NCCL blindly assumes that they are always supported.

Build command:

    make src.${_make_tgt} \
        VERBOSE=1 \
        TRACE=1 \
        RDMA_CORE=1 \
        NET_PROFILER=1 \
        CC=gcc \
        CXX=g++ \
        PREFIX=${INSTALLDIR} \
        CUDA_HOME=${NVCUDA_ROOT} \
        NVCC_GENCODE=${_gpu_archs} \
        -j ${_nthreads}

Error:

transport/net_ib/init.cc:357:28: error: 'struct ibv_port_attr' has no member named 'active_speed_ex'; did you mean 'active_speed'?
  357 |               if (portAttr.active_speed_ex) {
      |                            ^~~~~~~~~~~~~~~
      |                            active_speed
transport/net_ib/init.cc:359:70: error: 'struct ibv_port_attr' has no member named 'active_speed_ex'; did you mean 'active_speed'?
  359 |                 ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed_ex) * ncclIbWidth(portAttr.active_width);
      |                                                                      ^~~~~~~~~~~~~~~
      |                                                                      active_speed

Therefore, the PR adds a check at build time that preemptively checks for actual IB extensions supports, and only includes usage if they are, otherwise falls back to base features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants