[QDP] [feature] Pr5 implicit hadamard engine by aloha1357 · Pull Request #1390 · apache/mahout

aloha1357 · 2026-06-07T19:34:03Z

Related Issues

related #1385

Changes

Why

As established in the previous Kronecker Decomposition PR, a significant bottleneck in processing high-qubit circuits ($N \ge 14$) is memory. A traditional $O(4^N)$ matrix representation for the full Dense Hadamard transform completely exhausts modern GPU VRAM limits (causing Out-Of-Memory errors).

Even with the Kronecker Decomposition splitting the matrix into smaller blocks, generating and storing the explicit dense $H$ matrices in memory before applying Tensor Core operations is highly inefficient.

We need a way to perform Dense Matrix Multiplications (GEMM) on the Tensor Cores without ever storing the Hadamard Matrix in Global Memory.

How

This PR introduces the Matrix-Free Implicit Hadamard Ozaki Engine.

Implicit Matrix Generation: The ImplicitHadamardOzakiEngine leverages the structural properties of the Hadamard matrix ($h_{i,j} = (-1)^{\text{popc}(i & j)}$) to calculate the matrix elements on-the-fly directly inside Shared Memory.
Ozaki Multi-pass Tensor Core Execution: Using the Ozaki INT8 scheme, we utilize the .m16n8k32.s8 Tensor Core instructions to perform the GEMM natively in hardware. Because the Hadamard values are always $\pm 1$, we experience absolutely zero quantization error despite using the INT8 pipeline.
Removed Fallback: Replaced the naive_implicit_hadamard_gemm_kernel placeholder from PR 4 with the actual calls to engine.execute_implicit_hadamard.
Build System Fix: Updated build.rs to drop the unsupported sm_75 (Turing) target fallback, as this specific Tensor Core instruction explicitly requires sm_80 (Ampere) or higher.

Checklist

Added or updated unit tests for all changes (Verified passing against existing CI test suite)
Added or updated documentation for all changes (Added explanatory inline comments for PR)

…mputation

…ations

…tecture

aloha1357 added 7 commits June 7, 2026 18:33

feat(qdp): optimize phase kernel divergence and hoist constant mem co…

41c0f33

…mputation

style(qdp): add explanatory comments for phase and iqp kernel optimiz…

ca90282

…ations

feat(qdp): introduce batch throughput optimization scaffolding for TC

c7351cb

feat(qdp): introduce batch throughput optimization scaffolding for TC

60fda91

feat(qdp): introduce shared memory fused FWT for small qubit counts

3060ab9

feat(qdp): restructure FWT into Kronecker decomposition blocked archi…

38ee656

…tecture

feat(qdp): implement Matrix-Free Implicit Hadamard Tensor Core engine

9e575be

aloha1357 requested review from 400Ping, guan404ming and ryankert01 as code owners June 7, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] [feature] Pr5 implicit hadamard engine#1390

[QDP] [feature] Pr5 implicit hadamard engine#1390
aloha1357 wants to merge 7 commits into
apache:mainfrom
aloha1357:pr5-implicit-hadamard-engine

aloha1357 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aloha1357 commented Jun 7, 2026

Related Issues

Changes

Why

How

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant