Introduce Mega-C++ to reduce CPU overhead by zhongbozhu · Pull Request #3099 · NVIDIA/TransformerEngine

zhongbozhu · 2026-06-06T07:38:24Z

Description

Assistant: GPT5.5 codex

Get rid of CPU overhead whenever CUDA Graph is not applicable. Guarded by NVTE_MEGACPP_GROUPED_LINEAR.

Drop-in replace grouped MLP, ie. FC1 - act - FC2. Target BF16 grouped gemm with cublas grouped gemm backend.

In the future, we can extend to mxfp8 / nvfp4 with cublas backend or even cuteDSL grouped gemm and call cute.jit in C++: NVIDIA/cutlass#3289

Recommend CUDA >= 13.2.1

TODO:

E2E training with some multimodal THD packing workloads
Attach before & after screenshots

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

for more information, see https://pre-commit.ci

zhongbozhu · 2026-06-06T07:42:34Z

/te-ci pytorch L1

timmoon10 · 2026-06-08T18:25:25Z

  m.def("te_general_grouped_gemm_for_discrete_out",
        &transformer_engine::pytorch::te_general_grouped_gemm_for_discrete_out,
        "Grouped GEMM for discrete output list");
+  m.def("megacpp_grouped_mlp_forward", &transformer_engine::pytorch::megacpp_grouped_mlp_forward,


We should expose these functions within the tex.grouped_mlp_experimental submodule:

TransformerEngine/transformer_engine/pytorch/csrc/extensions/pybind.cpp

Lines 647 to 650 in 3fffa55

// Experimental fused grouped MLP

auto grouped_mlp_experimental = m.def_submodule(

"grouped_mlp_experimental",

"Experimental helpers for the fused grouped MLP (unstable, may change or disappear).");

timmoon10 · 2026-06-08T18:33:59Z

It would make more sense to organize:

csrc/ ├── extensions/ │ ├── grouped_mlp_experimental/ │ │ ├── megacpp.cpp │ │ └── grouped_mlp_experimental.cpp │ ├── pybind.cpp │ └── ...

If we implement more mega-C++ impls in the future, I don't see a reason why they would be more similar to each other than to the block they are fusing.

timmoon10 · 2026-06-08T18:49:08Z

+    name: str
+    is_scaled: bool
+    is_gated: bool
+    glu_interleave_size: int


Is it worth supporting GLU interleaving in the mega-C++ path? The only benefit is to support the fused GEMM+GLU kernel, and otherwise the unnecessary memory-bound kernel means perf is a lost cause. If we can simplify our optimized code paths, then it's worth it.

The only benefit is to support the fused GEMM+GLU kernel

I do hope in the future we can launch CuteDSL fused kernels in C++ with some TVM-FFI tricks, otherwise we are forced to choose either better kernel fusions or less CPU overhead. Currently the CuteDSL fusion path is very CPU bounded for small models and we rely on CUDA graph and paged stashing for it to work well

timmoon10 · 2026-06-08T18:55:16Z

+# Explicit env opt-in gives megacpp first chance. Unsupported recipes intentionally
+# return the ops unchanged so lower-priority recipe-specific fusers remain the
+# fallback path.
+register_forward_fusion(fuse_forward_megacpp_ops, prepend=True)


The GEMM+act fusions provide better GPU perf, so I think they should take higher priority than mega-C++. Basically, I see mega-C++ as "we can't do any better on GPU than the unfused impl, but at least we can make the CPU overhead very small".

Current order is follows:

check env var

env var = 1, then check supported recipe for mega-C++, so bf16 is supported, not mxfp8 / nvfp4

then for mxfp8, nvfp4, mega-C++ does fallback and check for the next fusion.

The reasoning is that, I do not want the compromise of either better fusion or less host bound, so for future mxfp8 support, we can do the following two things:

directly do cuteDSL integration directly with tvm-ffi and do cublas as a backup plan

maybe add a new value to NVTE_MEGACPP_GROUPED_LINEAR=forced, so for users who cannot enable cuda graph for some reason, they can enforce C++ when they know that their training is more host bound

megacpp

afc993a

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 6, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

1120ec9

for more information, see https://pre-commit.ci

timmoon10 reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Mega-C++ to reduce CPU overhead#3099

Introduce Mega-C++ to reduce CPU overhead#3099
zhongbozhu wants to merge 2 commits into
NVIDIA:mainfrom
zhongbozhu:main_megacpp_grouped_mlp

zhongbozhu commented Jun 6, 2026 •

edited

Loading

Uh oh!

zhongbozhu commented Jun 6, 2026

Uh oh!

timmoon10 Jun 8, 2026

Uh oh!

timmoon10 Jun 8, 2026

Uh oh!

timmoon10 Jun 8, 2026

Uh oh!

zhongbozhu Jun 8, 2026

Uh oh!

timmoon10 Jun 8, 2026

Uh oh!

zhongbozhu Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Experimental fused grouped MLP
	auto grouped_mlp_experimental = m.def_submodule(
	"grouped_mlp_experimental",
	"Experimental helpers for the fused grouped MLP (unstable, may change or disappear).");

Conversation

zhongbozhu commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

zhongbozhu commented Jun 6, 2026

Uh oh!

timmoon10 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhongbozhu commented Jun 6, 2026 •

edited

Loading