Skip to content

feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526

Open
lijianing-sudo wants to merge 22 commits into
deepmodeling:developfrom
Audrey-777:fix/cuda-gpu-integration
Open

feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526
lijianing-sudo wants to merge 22 commits into
deepmodeling:developfrom
Audrey-777:fix/cuda-gpu-integration

Conversation

@lijianing-sudo

Copy link
Copy Markdown

Linked Issue

N/A — new feature: GPU acceleration for NEP potential in MD module.

Unit Tests and/or Case Tests for my changes

  • Two HfO2 supercell test cases added: tests/04_FF/101_NEP_HfO2_S2/ (192 atoms) and tests/04_FF/101_NEP_HfO2_S4/ (1536
    atoms)
  • GPU path correctness verified bit-exact vs CPU on 24-atom HfO2 (two independent runs)
  • All existing CPU-path functionality preserved; GPU path only activated when device gpu is set in INPUT

What's changed?

This PR connects the previously-written NEP CUDA kernels (nep_cuda_compute.cu, three kernels: descriptor+ANN, radial
force, angular force) into ESolver_NEP::runner(). Previously these kernels existed but were never called — the runner
used CPU nep.compute() followed by GPU postprocessing only. This change completes the GPU path: CPU neighbor list → GPU
neural network forward pass → GPU postprocessing.

Key changes:

  • Added #ifdef __CUDA GPU path in runner() alongside the existing CPU path, switchable via PARAM.inp.device == "gpu" at
    runtime
  • Fixed 6 compilation bugs in the original kernel code (CMake bin name fallback, CUDA::cudart target, GCC-11/CUDA-11.5
    incompatibility via auto-detected GCC-10, NEP API member name correction, extern declaration for .cuh header, kernel
    argument reordering)
  • CPU path fully preserved with zero changes to behavior
  • Added GCC-10 auto-detection in CMakeLists.txt for nvcc 11.x compatibility

Performance on Tesla T4 (200 NVE steps):

  • 24 atoms: GPU 5.6ms/step vs CPU 4.65ms/step (GPU overhead dominates)
  • 192 atoms: GPU 7.6ms vs CPU 7.5ms (≈ tied)
  • 1536 atoms: GPU 69.4ms vs CPU 69.4ms (≈ tied)
    GPU catches up from -20% at 24 atoms to parity. The current NEP model (ANN=30-30-1, 1921 params) is too small for GPU to
    outperform — the framework is correct and ready for larger models.

Any changes of core modules? (ignore if not applicable)

  • ESolver_NEP class: added GPU neighbor list buffers (g_NN_radial, g_NL_radial, etc.) and NepCudaPostprocessWorkspace,
    guarded by #ifdef __CUDA
  • runner(): added #ifdef __CUDA branch that calls find_neighbor_list_small_box() → nep_cuda_compute() before
    postprocess_outputs(); existing CPU path unchanged
  • No changes to ESolver base class or other solvers

Donetella and others added 22 commits May 30, 2026 16:53
…ngular force, ZBL) with 15 device functions and fine-grained cudaEvent timing. Comprehensive report v2 with phase 2 details and test results.
- Fix ABACUS_BIN_NAME unset when ENABLE_MPI=OFF (both PW and LCAO paths)
- Fix cudart/nvToolsExt → CUDA::cudart/CUDA::nvToolsExt for imported targets
- Set default CMAKE_CUDA_STANDARD=14 for GCC 11 + CUDA 11.5 compat
- Add nep_cuda_compute.cu to source_esolver build list
- Integrate GPU compute path in esolver_nep::runner() with neighbor list
- Add CUDA neighbor list buffers to ESolver_NEP class

Co-Authored-By: Claude <noreply@anthropic.com>
- Set CMAKE_CUDA_STANDARD as CACHE variable (regular set() not honored)
- Add -Xcompiler=-std=c++14 to CMAKE_CUDA_FLAGS before CUDA objects

Co-Authored-By: Claude <noreply@anthropic.com>
The -Xcompiler=-std=c++14 flag and CMAKE_CUDA_STANDARD must be set
before enable_language(CUDA) or CMake's compiler detection uses
the host's default C++17, causing std::function pack errors with
GCC 11 + CUDA 11.5.

Co-Authored-By: Claude <noreply@anthropic.com>
The remote NEP_CPU library uses different naming than expected:
- nep.ann → nep.annmb (ANN annmb member)
- w0/b0/w1 are const double* [94] arrays, pass w0[0] etc.

Co-Authored-By: Claude <noreply@anthropic.com>
The function was defined in .cu but never declared, causing
'not declared in this scope' when esolver_nep.cpp includes the header.

Co-Authored-By: Claude <noreply@anthropic.com>
nep_cuda_compute.cuh contains __device__ syntax that g++ cannot parse.
Replace #include with extern forward declaration for the host-callable
function, which is the only symbol esolver_nep.cpp needs from it.

Co-Authored-By: Claude <noreply@anthropic.com>
Kernel signature was changed (added n_max_radial, dim_angular params)
but call sites at lines 736 and 929 were not updated — wrong order,
wrong types, missing arguments. Reorder to match kernel declaration.

Co-Authored-By: Claude <noreply@anthropic.com>
When nvcc 11.x runs with GCC 11+ host, template errors occur
in std::function. Auto-find gcc-10/g++-10 and pass -ccbin to nvcc.

Co-Authored-By: Claude <noreply@anthropic.com>
CMAKE_CUDA_FLAGS is not applied during cmake's CUDA compiler ID test.
CMAKE_CUDA_HOST_COMPILER is the proper variable and must be set
before enable_language(CUDA).

Co-Authored-By: Claude <noreply@anthropic.com>
Explain why passing w0[0] (first type pointer) works: NEP_CPU
allocates all atom-type weights contiguously in a single block.
Verified with HfO2 2-element system — energy bit-exact vs CPU.

Co-Authored-By: Claude <noreply@anthropic.com>
Generated from 101_NEP_HfO2 (24 atoms) by replicating 2x in each
lattice direction. 64 Hf + 128 O atoms. Lattice vectors scaled 2x.
INPUT set to 200 MD steps + device gpu.

Co-Authored-By: Claude <noreply@anthropic.com>
512 Hf + 1024 O atoms, lattice vectors scaled 4x.
INPUT set to 200 MD NVE steps + device gpu.

Co-Authored-By: Claude <noreply@anthropic.com>
esolver_nep.h included neighbor_nep.h under #ifdef __CUDA, but
this header is only available when NEP_CPU library is installed.
CI environments with CUDA but without NEP would fail at the
include stage (before even reaching the compile).

Move the include to esolver_nep.cpp where it is actually used,
keeping the header dependency minimal.

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove hardcoded 'device gpu' from supercell test INPUTs
- Change md_type to NVE (NPT fails with zero initial velocities)
- Set init_vel=0 for supercell tests
- Add README for S4 test case

Co-Authored-By: Claude <noreply@anthropic.com>
CI has CUDA enabled but no NEP_CPU library installed. The
neighbor_nep.h header requires NEP, so guard the include with
both __CUDA and __NEP.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants