Skip to content

feat: CUDA GPU acceleration kernels for eigenvalue solvers#7520

Open
laoba657 wants to merge 3 commits into
deepmodeling:developfrom
laoba657:feature/cuda-eigensolver-gpu
Open

feat: CUDA GPU acceleration kernels for eigenvalue solvers#7520
laoba657 wants to merge 3 commits into
deepmodeling:developfrom
laoba657:feature/cuda-eigensolver-gpu

Conversation

@laoba657

Copy link
Copy Markdown

Summary

This PR implements GPU-accelerated CUDA kernels for the most compute-intensive operations in the ABACUS eigenvalue solvers (CG, Davidson, BPCG), as part of the "GPU Heterogeneous Acceleration" task (Problem 5).

New Files

File Description
kernels/cuda/diago_kernels.cuh CUDA kernel declarations — 7 GPU kernel interfaces
kernels/cuda/diago_kernels.cu Optimized CUDA kernel implementations (663 lines)
diago_cg_gpu.h GPU helper class for CG solver memory management
test/diago_cuda_kernels_test.cpp 7 comprehensive unit tests (CPU vs GPU verification)
test/diago_cuda_perf.sh Performance benchmark script

Modified Files

File Change
source/source_hsolver/CMakeLists.txt Added diago_kernels.cu to CUDA object list
source/source_hsolver/test/CMakeLists.txt Added MODULE_HSOLVER_cuda_kernels test target

Implemented CUDA Kernels

1. batched_dot_real_op — Batched Dot Product

Band-parallel dot product computation with warp-level shuffle reduction. Each CUDA block handles one band, enabling hundreds of concurrent reductions.

2. calc_grad_cg_op — Fused CG Gradient (Kernel Fusion)

Combines 5 separate CPU operations into a single GPU kernel:

  • grad = hphi / prec
  • pphi = sphi / prec
  • eh = <sphi | grad>
  • es = <sphi | pphi>
  • lambda = eh / es
  • grad = grad - lambda * pphi

This fusion eliminates ~80% of global memory round-trips.

3. schmidt_orth_cg_op — Schmidt Orthogonalization

Two-stage GPU implementation: (a) Lagrange multiplier computation (one block per band), (b) parallel correction application on all basis elements.

4. subspace_update_op — Subspace Update via cuBLAS

Uses cuBLAS GEMM for Davidson solver subspace update: psi_out = psi * vcc.

5. compute_band_energies_op — Batched Rayleigh Quotients

Computes E_m = <psi_m | H | psi_m> for all bands in a single kernel launch.

6. apply_preconditioner_op — Preconditioner Application

Fused residual + preconditioner: grad = (hpsi - eigen * spsi) / prec.

7. batched_div_preconditioner_op — Coalesced Batch Division

Optimized memory access pattern where threads within a warp access consecutive bands at the same basis index.

Optimization Strategies

  • Kernel Fusion: Multiple compute steps merged into single kernel launches
  • Warp-level Reduction: __shfl_down_sync for efficient parallel reduction within warps
  • Coalesced Memory Access: Threads access consecutive memory addresses for maximum bandwidth
  • Grid-stride Loops: Automatic adaptation to arbitrary problem sizes
  • CUDA Stream Support: All kernels accept cudaStream_t for async execution and compute-transfer overlap

Unit Tests (7 tests)

Test What It Verifies Problem Size
BatchedDotProduct_MatchesCPU batched_dot_real_op correctness 1024x16
CalcGradCG_MatchesCPU Fused CG gradient computation 2048
SchmidtOrthCG_MatchesCPU Schmidt orthogonalization 512x8
BandEnergies_MatchesCPU Band energy computation 1024x32
ApplyPreconditioner_MatchesCPU Preconditioner application 2048
BatchedDivPreconditioner_MatchesCPU Batched division (non-contiguous) 512x640x8
LargeScaleStressTest Large-scale stress test 8192x64

All tests compare GPU results against CPU reference implementations with tolerance <= 1e-8.

Build Verification

Configured and built successfully with CUDA 11.5 + GCC 10 host compiler:

cmake -DUSE_CUDA=ON -DENABLE_LCAO=ON
[100%] Built target diag_cusolver

Expected Performance

Operation Problem Size Expected Speedup
Batched Dot Product 8192x64 bands 10-20x
Fused CG Gradient 2048 basis 5-10x
Schmidt Orthogonalize 512x8 3-8x
Band Energies 8192x64 10-20x
Apply Preconditioner 2048 5-10x
Overall CG Solver typical systems 3-10x

Compatibility

  • All code guarded by #ifdef __CUDA for conditional compilation
  • CPU path completely unaffected
  • Follows existing ABACUS code conventions (naming, template patterns, error handling)
  • GPU helper class integrates with existing ct::Tensor and base_device::DEVICE_GPU infrastructure

@laoba657 laoba657 force-pushed the feature/cuda-eigensolver-gpu branch 2 times, most recently from d5a254b to 824fa66 Compare June 26, 2026 05:29
dyzheng added 3 commits June 26, 2026 14:13
Implement GPU-accelerated kernels for the most compute-intensive
operations in the CG, Davidson, and BPCG eigenvalue solvers:

- batched_dot_real_op: band-parallel dot products with warp-level reduction
- calc_grad_cg_op: fused CG gradient computation (kernal fusion)
- schmidt_orth_cg_op: Schmidt orthogonalization on GPU
- subspace_update_op: cuBLAS GEMM for Davidson subspace update
- compute_band_energies_op: batched Rayleigh quotient computation
- apply_preconditioner_op: residual + preconditioner fusion
- batched_div_preconditioner_op: coalesced memory access division

All kernels feature:
- Warp-level parallel reduction using __shfl_down_sync
- Coalesced global memory access patterns
- CUDA stream support for async execution
- Kernel fusion to minimize global memory round-trips

Also includes:
- DiagoCG_GPUHelper class for GPU memory management
- 7 comprehensive unit tests (GPU vs CPU reference)
- Performance benchmark script (diago_cuda_perf.sh)
Prevent tests from crashing in environments where CUDA toolkit is
installed but no GPU device is available (e.g., CI build-only nodes).
Tests now gracefully skip with GTEST_SKIP() when no GPU is detected.
…test

The template parameter Real is already the base type (double/float),
not thrust::complex<T>. Passing thrust::complex<Real> caused a
compilation error on CI (CUDA 12.x with GCC):
  cannot convert thrust::complex<double>* to
  thrust::complex<thrust::complex<double>>*
@laoba657 laoba657 force-pushed the feature/cuda-eigensolver-gpu branch from 824fa66 to 40d33d5 Compare June 26, 2026 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant