Skip to content

Move proc_states from HostMeshRef to ProcMeshRef (#4235)#4235

Open
thomasywang wants to merge 15 commits into
meta-pytorch:mainfrom
thomasywang:export-D108359910
Open

Move proc_states from HostMeshRef to ProcMeshRef (#4235)#4235
thomasywang wants to merge 15 commits into
meta-pytorch:mainfrom
thomasywang:export-D108359910

Conversation

@thomasywang

@thomasywang thomasywang commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary:

HostMeshRef::proc_states did not read self at all, so there was no reason why ProcMeshRef::proc_states had to delegate to it instead of just owning the logic itself.

This diff is just a move of HostMeshRef::proc_states into ProcMeshRef::proc_states

Differential Revision: D108359910

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2026
@meta-codesync

meta-codesync Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@thomasywang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108359910.

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jun 12, 2026
Summary:
Pull Request resolved: meta-pytorch#4235

Relocate the supervision poll's per-proc state query from `HostMeshRef::proc_states` to `ProcMeshRef::proc_states`, collapsing the thin `ProcMeshRef` wrapper that just delegated to it. `proc_states` used no `HostMeshRef` state (it dials each proc's host agent directly via `host_agent_ref`), so this is a pure, behavior-preserving move: same per-proc `GetState`/`KeepaliveGetState` direct posts, same recv-timeout padding with `Timeout` states, same rank-ordered `ValueMesh`, and it still returns `None` when the proc mesh has no backing host mesh. Also drops the now-unused `host_agent::ProcState` import from `host_mesh.rs`. Prep for casting this query through the host-agent cast tree in a follow-up.

Differential Revision: D108359910
@meta-codesync meta-codesync Bot changed the title Move proc_states from HostMeshRef to ProcMeshRef Move proc_states from HostMeshRef to ProcMeshRef (#4235) Jun 12, 2026
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jun 12, 2026
Summary:
Pull Request resolved: meta-pytorch#4235

Relocate the supervision poll's per-proc state query from `HostMeshRef::proc_states` to `ProcMeshRef::proc_states`, collapsing the thin `ProcMeshRef` wrapper that just delegated to it. `proc_states` used no `HostMeshRef` state (it dials each proc's host agent directly via `host_agent_ref`), so this is a pure, behavior-preserving move: same per-proc `GetState`/`KeepaliveGetState` direct posts, same recv-timeout padding with `Timeout` states, same rank-ordered `ValueMesh`, and it still returns `None` when the proc mesh has no backing host mesh. Also drops the now-unused `host_agent::ProcState` import from `host_mesh.rs`. Prep for casting this query through the host-agent cast tree in a follow-up.

Differential Revision: D108359910
thomasywang and others added 15 commits June 13, 2026 14:56
Summary: We want `rank` to be part of the messaging interface for messages that require it rather than inferring from message headers so CastActor will need to write the rank

Differential Revision: D108306614
Summary: Replace `ActorMeshRef` storage of `ProcMeshRef` with `Region` plus a materialized `hyperactor_cast::CastDomainRef`, route `All` casts through that cast domain, and preserve `Choose` by materializing a singleton slice before recursing into the `All` path. `ProcMeshRef` now materializes a `CastDomainRef` for each spawned actor mesh after spawn succeeds; `ProcAgent` control-plane messages stay on a direct per-rank helper for now so `resource::Rank` rebinding remains correct until generic rank rebinding exists in `hyperactor_cast`.

Differential Revision: D105983258
Summary:
Pull Request resolved: meta-pytorch#4097

Add ActorMesh e2e coverage for slice cast behavior: casts reach only slice members, slice refs keep eagerly materialized cast-domain ids across clones and nested slices, and delivered `CAST_POINT` values are slice-local

Differential Revision: D105983253
Summary: Remove the legacy `CommActor` bootstrap from `ProcMesh::create` now that spawned actor meshes materialize their own `CastDomainRef` for full-mesh casts. `ProcMeshRef` no longer carries temporary root-region or root-comm-actor fields, and slicing only remaps the `ProcRef` ranks it owns.

Differential Revision: D105983252
Summary: Spawn and bind the well-known `CastActor` on host service procs alongside `HostAgent` in bootstrap, local host mesh creation, and global local-host initialization. This prepares HostAgent cast domains without changing HostMesh control-plane delivery yet.

Differential Revision: D105983255
Summary: Store a materialized `ActorMeshRef<HostAgent>` on `HostMeshRef` so host meshes have the same descriptor shape as actor meshes without yet changing HostMesh control-plane send paths. Materialization now happens at HostMeshRef construction/slicing, and HostMesh identity ignores the derived cast-domain state.

Differential Revision: D105983250
Differential Revision: D107589984
Summary:
Pull Request resolved: meta-pytorch#4101

Use the materialized HostAgent cast domain for `DrainHost` and `ShutdownHost` barriers. These messages now implement `Bind`/`Unbind` and use `OncePortRef<()>` replies so they can be carried through `CastDomainRef::cast` and acknowledged through the unit accumulator.

Differential Revision: D105983256
Summary: Send `SetClientConfig` through the HostAgent cast domain and wait on a unit-reducer acknowledgement barrier. The attach path still returns `ConfigPushError` on cast, timeout, or receiver-close failure; collective failures conservatively report every host whose acknowledgement was not observed.

Differential Revision: D105983254
Summary: Reuse the HostAgent cast domain for best-effort `HostMeshShutdownGuard` drop cleanup instead of iterating direct `HostRef::shutdown` sends. This removes the direct shutdown helper while keeping explicit `HostMesh::shutdown` as the preferred deterministic teardown path.

Differential Revision: D105983251
Summary:
Pull Request resolved: meta-pytorch#4104

Replace the `SpawnProcs` shape with casted `CreateOrUpdate<ProcSpec>` messages. `HostMeshRef::spawn` now sends one cast per per-host proc slot, and each `HostAgent` derives its concrete proc name and absolute proc rank from `CAST_POINT` plus `ProcMeshSpawnContext`, while preserving the direct `CreateOrUpdate<ProcSpec>` path for already-concrete updates.

Differential Revision: D106402900
Differential Revision: D106402902
Summary: Empircally we found that RLE-ing the seqs makes a significant difference at scale

Differential Revision: D108359903
Differential Revision: D108359902
Summary:
Pull Request resolved: meta-pytorch#4235

`HostMeshRef::proc_states` did not read `self` at all, so there was no reason why `ProcMeshRef::proc_states` had to delegate to it instead of just owning the logic itself.

This diff is just a move of `HostMeshRef::proc_states` into `ProcMeshRef::proc_states`

Differential Revision: D108359910
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. meta-exported module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant