feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path by lijianing-sudo · Pull Request #7526 · deepmodeling/abacus-develop

lijianing-sudo · 2026-06-26T11:52:27Z

Linked Issue

N/A — new feature: GPU acceleration for NEP potential in MD module.

Unit Tests and/or Case Tests for my changes

Two HfO2 supercell test cases added: tests/04_FF/101_NEP_HfO2_S2/ (192 atoms) and tests/04_FF/101_NEP_HfO2_S4/ (1536
atoms)
GPU path correctness verified bit-exact vs CPU on 24-atom HfO2 (two independent runs)
All existing CPU-path functionality preserved; GPU path only activated when device gpu is set in INPUT

What's changed?

This PR connects the previously-written NEP CUDA kernels (nep_cuda_compute.cu, three kernels: descriptor+ANN, radial
force, angular force) into ESolver_NEP::runner(). Previously these kernels existed but were never called — the runner
used CPU nep.compute() followed by GPU postprocessing only. This change completes the GPU path: CPU neighbor list → GPU
neural network forward pass → GPU postprocessing.

Key changes:

Added #ifdef __CUDA GPU path in runner() alongside the existing CPU path, switchable via PARAM.inp.device == "gpu" at
runtime
Fixed 6 compilation bugs in the original kernel code (CMake bin name fallback, CUDA::cudart target, GCC-11/CUDA-11.5
incompatibility via auto-detected GCC-10, NEP API member name correction, extern declaration for .cuh header, kernel
argument reordering)
CPU path fully preserved with zero changes to behavior
Added GCC-10 auto-detection in CMakeLists.txt for nvcc 11.x compatibility

Performance on Tesla T4 (200 NVE steps):

24 atoms: GPU 5.6ms/step vs CPU 4.65ms/step (GPU overhead dominates)
192 atoms: GPU 7.6ms vs CPU 7.5ms (≈ tied)
1536 atoms: GPU 69.4ms vs CPU 69.4ms (≈ tied)
GPU catches up from -20% at 24 atoms to parity. The current NEP model (ANN=30-30-1, 1921 params) is too small for GPU to
outperform — the framework is correct and ready for larger models.

Any changes of core modules? (ignore if not applicable)

ESolver_NEP class: added GPU neighbor list buffers (g_NN_radial, g_NL_radial, etc.) and NepCudaPostprocessWorkspace,
guarded by #ifdef __CUDA
runner(): added #ifdef __CUDA branch that calls find_neighbor_list_small_box() → nep_cuda_compute() before
postprocess_outputs(); existing CPU path unchanged
No changes to ESolver base class or other solvers

…ngular force, ZBL) with 15 device functions and fine-grained cudaEvent timing. Comprehensive report v2 with phase 2 details and test results.

- Fix ABACUS_BIN_NAME unset when ENABLE_MPI=OFF (both PW and LCAO paths) - Fix cudart/nvToolsExt → CUDA::cudart/CUDA::nvToolsExt for imported targets - Set default CMAKE_CUDA_STANDARD=14 for GCC 11 + CUDA 11.5 compat - Add nep_cuda_compute.cu to source_esolver build list - Integrate GPU compute path in esolver_nep::runner() with neighbor list - Add CUDA neighbor list buffers to ESolver_NEP class Co-Authored-By: Claude <noreply@anthropic.com>

- Set CMAKE_CUDA_STANDARD as CACHE variable (regular set() not honored) - Add -Xcompiler=-std=c++14 to CMAKE_CUDA_FLAGS before CUDA objects Co-Authored-By: Claude <noreply@anthropic.com>

The -Xcompiler=-std=c++14 flag and CMAKE_CUDA_STANDARD must be set before enable_language(CUDA) or CMake's compiler detection uses the host's default C++17, causing std::function pack errors with GCC 11 + CUDA 11.5. Co-Authored-By: Claude <noreply@anthropic.com>

The remote NEP_CPU library uses different naming than expected: - nep.ann → nep.annmb (ANN annmb member) - w0/b0/w1 are const double* [94] arrays, pass w0[0] etc. Co-Authored-By: Claude <noreply@anthropic.com>

The function was defined in .cu but never declared, causing 'not declared in this scope' when esolver_nep.cpp includes the header. Co-Authored-By: Claude <noreply@anthropic.com>

nep_cuda_compute.cuh contains __device__ syntax that g++ cannot parse. Replace #include with extern forward declaration for the host-callable function, which is the only symbol esolver_nep.cpp needs from it. Co-Authored-By: Claude <noreply@anthropic.com>

Kernel signature was changed (added n_max_radial, dim_angular params) but call sites at lines 736 and 929 were not updated — wrong order, wrong types, missing arguments. Reorder to match kernel declaration. Co-Authored-By: Claude <noreply@anthropic.com>

When nvcc 11.x runs with GCC 11+ host, template errors occur in std::function. Auto-find gcc-10/g++-10 and pass -ccbin to nvcc. Co-Authored-By: Claude <noreply@anthropic.com>

CMAKE_CUDA_FLAGS is not applied during cmake's CUDA compiler ID test. CMAKE_CUDA_HOST_COMPILER is the proper variable and must be set before enable_language(CUDA). Co-Authored-By: Claude <noreply@anthropic.com>

Explain why passing w0[0] (first type pointer) works: NEP_CPU allocates all atom-type weights contiguously in a single block. Verified with HfO2 2-element system — energy bit-exact vs CPU. Co-Authored-By: Claude <noreply@anthropic.com>

Generated from 101_NEP_HfO2 (24 atoms) by replicating 2x in each lattice direction. 64 Hf + 128 O atoms. Lattice vectors scaled 2x. INPUT set to 200 MD steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>

512 Hf + 1024 O atoms, lattice vectors scaled 4x. INPUT set to 200 MD NVE steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>

esolver_nep.h included neighbor_nep.h under #ifdef __CUDA, but this header is only available when NEP_CPU library is installed. CI environments with CUDA but without NEP would fail at the include stage (before even reaching the compile). Move the include to esolver_nep.cpp where it is actually used, keeping the header dependency minimal. Co-Authored-By: Claude <noreply@anthropic.com>

- Remove hardcoded 'device gpu' from supercell test INPUTs - Change md_type to NVE (NPT fails with zero initial velocities) - Set init_vel=0 for supercell tests - Add README for S4 test case Co-Authored-By: Claude <noreply@anthropic.com>

CI has CUDA enabled but no NEP_CPU library installed. The neighbor_nep.h header requires NEP, so guard the include with both __CUDA and __NEP. Co-Authored-By: Claude <noreply@anthropic.com>

Donetella and others added 22 commits May 30, 2026 16:53

Add NEP CUDA postprocess prototype

fd2f72c

add auth.json to log in codex

1ae5850

do some test and remove root file

a0a9daf

test by GPU

54c0cc8

Add DP NEP CUDA acceleration refactor

2548ef8

Add NEP CUDA core compute (4 kernels: descriptor+ANN, radial force, a…

2dfad46

…ngular force, ZBL) with 15 device functions and fine-grained cudaEvent timing. Comprehensive report v2 with phase 2 details and test results.

fix: force CUDA C++14 standard for GCC 11 compat

4206398

- Set CMAKE_CUDA_STANDARD as CACHE variable (regular set() not honored) - Add -Xcompiler=-std=c++14 to CMAKE_CUDA_FLAGS before CUDA objects Co-Authored-By: Claude <noreply@anthropic.com>

fix: match NEP API - ann→annmb, w0/b0/w1 are const double*[94]

5e81919

The remote NEP_CPU library uses different naming than expected: - nep.ann → nep.annmb (ANN annmb member) - w0/b0/w1 are const double* [94] arrays, pass w0[0] etc. Co-Authored-By: Claude <noreply@anthropic.com>

fix: add nep_cuda_compute declaration to .cuh header

e6a2c69

The function was defined in .cu but never declared, causing 'not declared in this scope' when esolver_nep.cpp includes the header. Co-Authored-By: Claude <noreply@anthropic.com>

fix: add missing stdio.h/stdlib.h for fprintf/stderr/exit

bebca14

fix: auto-detect GCC-10 for CUDA 11.x host compiler

482522c

When nvcc 11.x runs with GCC 11+ host, template errors occur in std::function. Auto-find gcc-10/g++-10 and pass -ccbin to nvcc. Co-Authored-By: Claude <noreply@anthropic.com>

fix: use CMAKE_CUDA_HOST_COMPILER instead of -ccbin flag

66ae0e2

CMAKE_CUDA_FLAGS is not applied during cmake's CUDA compiler ID test. CMAKE_CUDA_HOST_COMPILER is the proper variable and must be set before enable_language(CUDA). Co-Authored-By: Claude <noreply@anthropic.com>

test: add 2x2x2 HfO2 supercell (192 atoms) for GPU benchmark

8e49abe

Generated from 101_NEP_HfO2 (24 atoms) by replicating 2x in each lattice direction. 64 Hf + 128 O atoms. Lattice vectors scaled 2x. INPUT set to 200 MD steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>

test: add 4x4x4 HfO2 supercell (1536 atoms) for GPU scaling

8c4cc5e

512 Hf + 1024 O atoms, lattice vectors scaled 4x. INPUT set to 200 MD NVE steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>

fix: guard neighbor_nep.h include with __NEP macro

bfbdc16

CI has CUDA enabled but no NEP_CPU library installed. The neighbor_nep.h header requires NEP, so guard the include with both __CUDA and __NEP. Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526

feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526
lijianing-sudo wants to merge 22 commits into
deepmodeling:developfrom
Audrey-777:fix/cuda-gpu-integration

lijianing-sudo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lijianing-sudo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants