feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526
Open
lijianing-sudo wants to merge 22 commits into
Open
feat: integrate NEP CUDA kernels into MD runner with complete GPU compute path#7526lijianing-sudo wants to merge 22 commits into
lijianing-sudo wants to merge 22 commits into
Conversation
…ngular force, ZBL) with 15 device functions and fine-grained cudaEvent timing. Comprehensive report v2 with phase 2 details and test results.
- Fix ABACUS_BIN_NAME unset when ENABLE_MPI=OFF (both PW and LCAO paths) - Fix cudart/nvToolsExt → CUDA::cudart/CUDA::nvToolsExt for imported targets - Set default CMAKE_CUDA_STANDARD=14 for GCC 11 + CUDA 11.5 compat - Add nep_cuda_compute.cu to source_esolver build list - Integrate GPU compute path in esolver_nep::runner() with neighbor list - Add CUDA neighbor list buffers to ESolver_NEP class Co-Authored-By: Claude <noreply@anthropic.com>
- Set CMAKE_CUDA_STANDARD as CACHE variable (regular set() not honored) - Add -Xcompiler=-std=c++14 to CMAKE_CUDA_FLAGS before CUDA objects Co-Authored-By: Claude <noreply@anthropic.com>
The -Xcompiler=-std=c++14 flag and CMAKE_CUDA_STANDARD must be set before enable_language(CUDA) or CMake's compiler detection uses the host's default C++17, causing std::function pack errors with GCC 11 + CUDA 11.5. Co-Authored-By: Claude <noreply@anthropic.com>
The remote NEP_CPU library uses different naming than expected: - nep.ann → nep.annmb (ANN annmb member) - w0/b0/w1 are const double* [94] arrays, pass w0[0] etc. Co-Authored-By: Claude <noreply@anthropic.com>
The function was defined in .cu but never declared, causing 'not declared in this scope' when esolver_nep.cpp includes the header. Co-Authored-By: Claude <noreply@anthropic.com>
nep_cuda_compute.cuh contains __device__ syntax that g++ cannot parse. Replace #include with extern forward declaration for the host-callable function, which is the only symbol esolver_nep.cpp needs from it. Co-Authored-By: Claude <noreply@anthropic.com>
Kernel signature was changed (added n_max_radial, dim_angular params) but call sites at lines 736 and 929 were not updated — wrong order, wrong types, missing arguments. Reorder to match kernel declaration. Co-Authored-By: Claude <noreply@anthropic.com>
When nvcc 11.x runs with GCC 11+ host, template errors occur in std::function. Auto-find gcc-10/g++-10 and pass -ccbin to nvcc. Co-Authored-By: Claude <noreply@anthropic.com>
CMAKE_CUDA_FLAGS is not applied during cmake's CUDA compiler ID test. CMAKE_CUDA_HOST_COMPILER is the proper variable and must be set before enable_language(CUDA). Co-Authored-By: Claude <noreply@anthropic.com>
Explain why passing w0[0] (first type pointer) works: NEP_CPU allocates all atom-type weights contiguously in a single block. Verified with HfO2 2-element system — energy bit-exact vs CPU. Co-Authored-By: Claude <noreply@anthropic.com>
Generated from 101_NEP_HfO2 (24 atoms) by replicating 2x in each lattice direction. 64 Hf + 128 O atoms. Lattice vectors scaled 2x. INPUT set to 200 MD steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>
512 Hf + 1024 O atoms, lattice vectors scaled 4x. INPUT set to 200 MD NVE steps + device gpu. Co-Authored-By: Claude <noreply@anthropic.com>
esolver_nep.h included neighbor_nep.h under #ifdef __CUDA, but this header is only available when NEP_CPU library is installed. CI environments with CUDA but without NEP would fail at the include stage (before even reaching the compile). Move the include to esolver_nep.cpp where it is actually used, keeping the header dependency minimal. Co-Authored-By: Claude <noreply@anthropic.com>
- Remove hardcoded 'device gpu' from supercell test INPUTs - Change md_type to NVE (NPT fails with zero initial velocities) - Set init_vel=0 for supercell tests - Add README for S4 test case Co-Authored-By: Claude <noreply@anthropic.com>
CI has CUDA enabled but no NEP_CPU library installed. The neighbor_nep.h header requires NEP, so guard the include with both __CUDA and __NEP. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked Issue
N/A — new feature: GPU acceleration for NEP potential in MD module.
Unit Tests and/or Case Tests for my changes
atoms)
What's changed?
This PR connects the previously-written NEP CUDA kernels (nep_cuda_compute.cu, three kernels: descriptor+ANN, radial
force, angular force) into ESolver_NEP::runner(). Previously these kernels existed but were never called — the runner
used CPU nep.compute() followed by GPU postprocessing only. This change completes the GPU path: CPU neighbor list → GPU
neural network forward pass → GPU postprocessing.
Key changes:
runtime
incompatibility via auto-detected GCC-10, NEP API member name correction, extern declaration for .cuh header, kernel
argument reordering)
Performance on Tesla T4 (200 NVE steps):
GPU catches up from -20% at 24 atoms to parity. The current NEP model (ANN=30-30-1, 1921 params) is too small for GPU to
outperform — the framework is correct and ready for larger models.
Any changes of core modules? (ignore if not applicable)
guarded by #ifdef __CUDA
postprocess_outputs(); existing CPU path unchanged