Skip to content

Cast proc_states via GetHostProcStates through the host tree (#4238)#4238

Open
thomasywang wants to merge 16 commits into
meta-pytorch:mainfrom
thomasywang:export-D108359901
Open

Cast proc_states via GetHostProcStates through the host tree (#4238)#4238
thomasywang wants to merge 16 commits into
meta-pytorch:mainfrom
thomasywang:export-D108359901

Conversation

@thomasywang

@thomasywang thomasywang commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary:

Convert ProcMeshRef::proc_states (the supervision poll's per-proc state query) from per-proc direct GetState/KeepaliveGetState posts -- O(procs) direct dials from the client -- into a single cast over the host-agent mesh, so the read fans in through the cast tree instead.

Add GetHostProcStates: the caller casts ONE of these; each HostAgent combines its cast-stamped rank with num_per_host to re-derive the proc resource ids for its slots (via proc_name) and replies with a batch of their states. The client thus sees O(hosts) batched replies reduced up the tree (to cast actor 0) instead of O(procs) individual ones dialed directly. The per-proc State builder is factored into HostAgent::proc_state, shared by the GetState and GetHostProcStates handlers. GetState<ProcState> cannot be cast this way because it carries a fully-resolved id; this message re-derives the ids host-side. If keepalive is Some, each proc's expiry is extended (same orphan-protection semantics as KeepaliveGetState). Builds on the prep commit that relocated proc_states to ProcMeshRef.

Differential Revision: D108359901

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2026
@meta-codesync

meta-codesync Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@thomasywang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108359901.

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jun 12, 2026
…torch#4238)

Summary:
Pull Request resolved: meta-pytorch#4238

Convert `ProcMeshRef::proc_states` (the supervision poll's per-proc state query) from per-proc direct `GetState`/`KeepaliveGetState` posts -- O(procs) direct dials from the client -- into a single cast over the host-agent mesh, so the read fans in through the cast tree instead.

Add `GetHostProcStates`: the caller casts ONE of these; each `HostAgent` combines its cast-stamped `rank` with `num_per_host` to re-derive the proc resource ids for its slots (via `proc_name`) and replies with a batch of their states. The client thus sees O(hosts) batched replies reduced up the tree (to cast actor 0) instead of O(procs) individual ones dialed directly. The per-proc `State` builder is factored into `HostAgent::proc_state`, shared by the `GetState` and `GetHostProcStates` handlers. `GetState<ProcState>` cannot be cast this way because it carries a fully-resolved id; this message re-derives the ids host-side. If `keepalive` is `Some`, each proc's expiry is extended (same orphan-protection semantics as `KeepaliveGetState`). Builds on the prep commit that relocated `proc_states` to `ProcMeshRef`.

Differential Revision: D108359901
@meta-codesync meta-codesync Bot changed the title Cast proc_states via GetHostProcStates through the host tree Cast proc_states via GetHostProcStates through the host tree (#4238) Jun 12, 2026
thomasywang and others added 16 commits June 12, 2026 14:25
Summary: We want `rank` to be part of the messaging interface for messages that require it rather than inferring from message headers so CastActor will need to write the rank

Differential Revision: D108306614
Summary: Replace `ActorMeshRef` storage of `ProcMeshRef` with `Region` plus a materialized `hyperactor_cast::CastDomainRef`, route `All` casts through that cast domain, and preserve `Choose` by materializing a singleton slice before recursing into the `All` path. `ProcMeshRef` now materializes a `CastDomainRef` for each spawned actor mesh after spawn succeeds; `ProcAgent` control-plane messages stay on a direct per-rank helper for now so `resource::Rank` rebinding remains correct until generic rank rebinding exists in `hyperactor_cast`.

Differential Revision: D105983258
Summary:
Pull Request resolved: meta-pytorch#4097

Add ActorMesh e2e coverage for slice cast behavior: casts reach only slice members, slice refs keep eagerly materialized cast-domain ids across clones and nested slices, and delivered `CAST_POINT` values are slice-local

Differential Revision: D105983253
Summary: Remove the legacy `CommActor` bootstrap from `ProcMesh::create` now that spawned actor meshes materialize their own `CastDomainRef` for full-mesh casts. `ProcMeshRef` no longer carries temporary root-region or root-comm-actor fields, and slicing only remaps the `ProcRef` ranks it owns.

Differential Revision: D105983252
Summary: Spawn and bind the well-known `CastActor` on host service procs alongside `HostAgent` in bootstrap, local host mesh creation, and global local-host initialization. This prepares HostAgent cast domains without changing HostMesh control-plane delivery yet.

Differential Revision: D105983255
Summary: Store a materialized `ActorMeshRef<HostAgent>` on `HostMeshRef` so host meshes have the same descriptor shape as actor meshes without yet changing HostMesh control-plane send paths. Materialization now happens at HostMeshRef construction/slicing, and HostMesh identity ignores the derived cast-domain state.

Differential Revision: D105983250
Differential Revision: D107589984
Summary:
Pull Request resolved: meta-pytorch#4101

Use the materialized HostAgent cast domain for `DrainHost` and `ShutdownHost` barriers. These messages now implement `Bind`/`Unbind` and use `OncePortRef<()>` replies so they can be carried through `CastDomainRef::cast` and acknowledged through the unit accumulator.

Differential Revision: D105983256
Summary: Send `SetClientConfig` through the HostAgent cast domain and wait on a unit-reducer acknowledgement barrier. The attach path still returns `ConfigPushError` on cast, timeout, or receiver-close failure; collective failures conservatively report every host whose acknowledgement was not observed.

Differential Revision: D105983254
Summary: Reuse the HostAgent cast domain for best-effort `HostMeshShutdownGuard` drop cleanup instead of iterating direct `HostRef::shutdown` sends. This removes the direct shutdown helper while keeping explicit `HostMesh::shutdown` as the preferred deterministic teardown path.

Differential Revision: D105983251
Summary:
Pull Request resolved: meta-pytorch#4104

Replace the `SpawnProcs` shape with casted `CreateOrUpdate<ProcSpec>` messages. `HostMeshRef::spawn` now sends one cast per per-host proc slot, and each `HostAgent` derives its concrete proc name and absolute proc rank from `CAST_POINT` plus `ProcMeshSpawnContext`, while preserving the direct `CreateOrUpdate<ProcSpec>` path for already-concrete updates.

Differential Revision: D106402900
Differential Revision: D106402902
Summary: Empircally we found that RLE-ing the seqs makes a significant difference at scale

Differential Revision: D108359903
Differential Revision: D108359902
Summary: Relocate the supervision poll's per-proc state query from `HostMeshRef::proc_states` to `ProcMeshRef::proc_states`, collapsing the thin `ProcMeshRef` wrapper that just delegated to it. `proc_states` used no `HostMeshRef` state (it dials each proc's host agent directly via `host_agent_ref`), so this is a pure, behavior-preserving move: same per-proc `GetState`/`KeepaliveGetState` direct posts, same recv-timeout padding with `Timeout` states, same rank-ordered `ValueMesh`, and it still returns `None` when the proc mesh has no backing host mesh. Also drops the now-unused `host_agent::ProcState` import from `host_mesh.rs`. Prep for casting this query through the host-agent cast tree in a follow-up.

Differential Revision: D108359910
…torch#4238)

Summary:
Pull Request resolved: meta-pytorch#4238

Convert `ProcMeshRef::proc_states` (the supervision poll's per-proc state query) from per-proc direct `GetState`/`KeepaliveGetState` posts -- O(procs) direct dials from the client -- into a single cast over the host-agent mesh, so the read fans in through the cast tree instead.

Add `GetHostProcStates`: the caller casts ONE of these; each `HostAgent` combines its cast-stamped `rank` with `num_per_host` to re-derive the proc resource ids for its slots (via `proc_name`) and replies with a batch of their states. The client thus sees O(hosts) batched replies reduced up the tree (to cast actor 0) instead of O(procs) individual ones dialed directly. The per-proc `State` builder is factored into `HostAgent::proc_state`, shared by the `GetState` and `GetHostProcStates` handlers. `GetState<ProcState>` cannot be cast this way because it carries a fully-resolved id; this message re-derives the ids host-side. If `keepalive` is `Some`, each proc's expiry is extended (same orphan-protection semantics as `KeepaliveGetState`). Builds on the prep commit that relocated `proc_states` to `ProcMeshRef`.

Differential Revision: D108359901
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant