[Feature] Train infer disaggregated by jiapingW · Pull Request #523 · sgl-project/SpecForge

jiapingW · 2026-04-02T08:07:03Z

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

[✅] Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request introduces a Ray-based distributed architecture for SpecForge, enabling both colocated and disaggregated (训推分离) training modes. The changes include new Ray-based worker groups for rollout and training, a centralized orchestrator, and support for NCCL-based GPU-to-GPU data transfer. My feedback highlights performance bottlenecks in the rollout dispatch logic, potential runtime errors in the DataCollator initialization, risks associated with clearing the global device mesh, the need for robust error handling when waiting for distributed workers, and unnecessary synchronization in the data transfer utility.

gemini-code-assist · 2026-04-02T08:10:04Z

+            for dp_idx in range(dp_size):
+                data_batch, actual_count = self._fetch_multi_local(
+                    self._rollout_batch_size
+                )
+                if data_batch is None:
+                    break
+                per_dp_count = actual_count
+
+                send_ref = self.rollout_group.generate_and_send_single(
+                    tp_idx, data_batch, [sp_leader_ranks[dp_idx]]
+                )
+                send_refs.append(send_ref)


The current implementation performs dp_size separate forward passes on the target model per logical training step. Since the target model is typically much larger than the draft model, this creates a significant performance bottleneck, especially as the number of DP groups increases.

Consider batching all dp_size requests into a single forward pass on the RolloutWorkerGroup (with a total batch size of dp_size * rollout_batch_size), then sharding and sending the results to the respective TrainWorker groups. This would leverage GPU parallelism much more effectively for the target model inference.

gemini-code-assist · 2026-04-02T08:10:04Z

+    def __init__(self, sp_degree=None, ulysses_degree=None):
+        if sp_degree is not None:
+            self.sp_degree = sp_degree
+        else:
+            self.sp_degree = torch.distributed.get_world_size(get_draft_sp_group())
+        if ulysses_degree is not None:
+            self.ulysses_degree = ulysses_degree
+        else:
+            self.ulysses_degree = torch.distributed.get_world_size(
+                get_sp_ulysses_group()
+            )


Calling torch.distributed.get_world_size() in the constructor of DataCollatorWithPadding will raise a RuntimeError if the collator is instantiated in a process where torch.distributed is not yet initialized (e.g., the driver process during dataset pre-building or in the orchestrator before workers are launched).

While the current RayOrchestrator passes these values explicitly, other utility functions like prepare_dp_dataloaders use the default constructor, which could lead to crashes if called outside a distributed context. Consider deferring the world size check until the first call to __call__ or providing safe defaults.

gemini-code-assist · 2026-04-02T08:10:05Z

+    _SP_RING_GROUP = PROCESS_GROUP.RING_PG if sp_size > 1 else my_draft_sp_group
+    _TP_DEVICE_MESH = dist.DeviceMesh.from_group(my_tp_group, device_type="cuda")
+    _DP_DEVICE_MESH = dist.DeviceMesh.from_group(my_dp_group, device_type="cuda")
+    _DEVICE_MESH = None  # 2D mesh not available in subgroup mode


Setting _DEVICE_MESH = None in init_distributed_from_subgroup might cause failures in other parts of the codebase that rely on get_device_mesh(). While 1D meshes (_TP_DEVICE_MESH, _DP_DEVICE_MESH) are initialized, some FSDP configurations or monitoring tools in the existing codebase might expect the global 2D mesh to be present.

gemini-code-assist · 2026-04-02T08:10:05Z

+        if self._enable_perf:
+            t3 = time.perf_counter()
+
+        metrics = ray.get(train_refs[0])


ray.get(train_refs[0]) only waits for the first worker (rank 0) to complete. If any other worker in the distributed group encounters an error or is significantly slower, the orchestrator may proceed to the next step prematurely or hang in subsequent collective operations, making debugging difficult.

It is safer to wait for all workers to ensure consistency and catch exceptions occurring on non-zero ranks.

Suggested change

metrics = ray.get(train_refs[0])

metrics_list = ray.get(train_refs)

metrics = metrics_list[0]

gemini-code-assist · 2026-04-02T08:10:05Z

+        position_ids=_to(batch.position_ids),
+    )
+    if needs_sync:
+        torch.cuda.synchronize()


torch.cuda.synchronize() is a heavy operation that stalls the CPU until all GPU tasks are finished, which can reduce the benefits of using non_blocking=True for overlapping transfers.

Since this is called immediately before the forward pass, you can rely on the default stream's serialization or use CUDA events for more fine-grained synchronization if multiple streams are involved.

FrankLeeeee · 2026-04-02T15:26:35Z

need to add ray to pyproject.toml.

moehanabi · 2026-06-29T07:37:57Z

Hi @jiapingW — apologies, I only came across this PR today, otherwise I'd have chimed in earlier 😂. I'd independently been working on the same train/inference disaggregation problem in #573.

The two take fairly different approaches: yours is built on Ray with orchestrated worker groups, while #573 is Ray-free and centers on a remote target-serving design with an async prefetch pipeline (configurable depth + multi-server round-robin scheduling) to overlap target inference with draft training — we measured up to ~2.37x speedup with dual-server prefetch. It also adds GPU-direct NCCL transfer with a wire-format fallback, TP broadcast, and all-to-all sharding (committed at https://github.com/moehanabi/SpecForge/tree/remote_train_sharded_nccl and has not merged to my pr now).

jiapingW · 2026-06-29T16:17:14Z

Hi @jiapingW — apologies, I only came across this PR today, otherwise I'd have chimed in earlier 😂. I'd independently been working on the same train/inference disaggregation problem in #573.

The two take fairly different approaches: yours is built on Ray with orchestrated worker groups, while #573 is Ray-free and centers on a remote target-serving design with an async prefetch pipeline (configurable depth + multi-server round-robin scheduling) to overlap target inference with draft training — we measured up to ~2.37x speedup with dual-server prefetch. It also adds GPU-direct NCCL transfer with a wire-format fallback, TP broadcast, and all-to-all sharding (committed at https://github.com/moehanabi/SpecForge/tree/remote_train_sharded_nccl and has not merged to my pr now).

Hi, it's a good job! Now we are developing train and infer disaggreation. We are refactoring the code to make it more maintainable. This feature will be completed in the next two days, and we welcome your further optimizations at that time.

jiapingW · 2026-06-29T16:20:00Z

We also hope to decouple the system and improve the efficiency of online training by running the model via an SGL server instead of an SGL model instance.

moehanabi · 2026-06-30T06:16:06Z

We also hope to decouple the system and improve the efficiency of online training by running the model via an SGL server instead of an SGL model instance.

great work! hope I can see it soon!

moehanabi · 2026-06-30T06:25:23Z

We also hope to decouple the system and improve the efficiency of online training by running the model via an SGL server instead of an SGL model instance.

I saw many runtime-related pr such as #618 . Are they all about this "refactoring the code"? Are you working together for it?

jiapingW added 2 commits April 1, 2026 03:36

draft:support train infer disaggre

fb03a7c

support train eagle3/dflash online disaggregated|colocated

98b9adb

jiapingW requested review from FlamingoPg, shuaills, sleepcoo and zyksir as code owners April 2, 2026 08:07

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into train_infer_disaggre

a01862a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Train infer disaggregated#523

[Feature] Train infer disaggregated#523
jiapingW wants to merge 3 commits into
mainfrom
train_infer_disaggre

jiapingW commented Apr 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

FrankLeeeee commented Apr 2, 2026

Uh oh!

moehanabi commented Jun 29, 2026

Uh oh!

jiapingW commented Jun 29, 2026

Uh oh!

jiapingW commented Jun 29, 2026

Uh oh!

moehanabi commented Jun 30, 2026

Uh oh!

moehanabi commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	metrics = ray.get(train_refs[0])
	metrics_list = ray.get(train_refs)
	metrics = metrics_list[0]

Uh oh!

Conversation

jiapingW commented Apr 2, 2026

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

FrankLeeeee commented Apr 2, 2026

Uh oh!

moehanabi commented Jun 29, 2026

Uh oh!

jiapingW commented Jun 29, 2026

Uh oh!

jiapingW commented Jun 29, 2026

Uh oh!

moehanabi commented Jun 30, 2026

Uh oh!

moehanabi commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

moehanabi commented Jun 30, 2026 •

edited

Loading