Move proc_states from HostMeshRef to ProcMeshRef (#4235)#4235
Open
thomasywang wants to merge 15 commits into
Open
Move proc_states from HostMeshRef to ProcMeshRef (#4235)#4235thomasywang wants to merge 15 commits into
thomasywang wants to merge 15 commits into
Conversation
Contributor
|
@thomasywang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108359910. |
thomasywang
added a commit
to thomasywang/monarch-1
that referenced
this pull request
Jun 12, 2026
Summary: Pull Request resolved: meta-pytorch#4235 Relocate the supervision poll's per-proc state query from `HostMeshRef::proc_states` to `ProcMeshRef::proc_states`, collapsing the thin `ProcMeshRef` wrapper that just delegated to it. `proc_states` used no `HostMeshRef` state (it dials each proc's host agent directly via `host_agent_ref`), so this is a pure, behavior-preserving move: same per-proc `GetState`/`KeepaliveGetState` direct posts, same recv-timeout padding with `Timeout` states, same rank-ordered `ValueMesh`, and it still returns `None` when the proc mesh has no backing host mesh. Also drops the now-unused `host_agent::ProcState` import from `host_mesh.rs`. Prep for casting this query through the host-agent cast tree in a follow-up. Differential Revision: D108359910
a39b16f to
1b6a680
Compare
thomasywang
added a commit
to thomasywang/monarch-1
that referenced
this pull request
Jun 12, 2026
Summary: Pull Request resolved: meta-pytorch#4235 Relocate the supervision poll's per-proc state query from `HostMeshRef::proc_states` to `ProcMeshRef::proc_states`, collapsing the thin `ProcMeshRef` wrapper that just delegated to it. `proc_states` used no `HostMeshRef` state (it dials each proc's host agent directly via `host_agent_ref`), so this is a pure, behavior-preserving move: same per-proc `GetState`/`KeepaliveGetState` direct posts, same recv-timeout padding with `Timeout` states, same rank-ordered `ValueMesh`, and it still returns `None` when the proc mesh has no backing host mesh. Also drops the now-unused `host_agent::ProcState` import from `host_mesh.rs`. Prep for casting this query through the host-agent cast tree in a follow-up. Differential Revision: D108359910
1b6a680 to
08ea718
Compare
Summary: We want `rank` to be part of the messaging interface for messages that require it rather than inferring from message headers so CastActor will need to write the rank Differential Revision: D108306614
Summary: Replace `ActorMeshRef` storage of `ProcMeshRef` with `Region` plus a materialized `hyperactor_cast::CastDomainRef`, route `All` casts through that cast domain, and preserve `Choose` by materializing a singleton slice before recursing into the `All` path. `ProcMeshRef` now materializes a `CastDomainRef` for each spawned actor mesh after spawn succeeds; `ProcAgent` control-plane messages stay on a direct per-rank helper for now so `resource::Rank` rebinding remains correct until generic rank rebinding exists in `hyperactor_cast`. Differential Revision: D105983258
Summary: Pull Request resolved: meta-pytorch#4097 Add ActorMesh e2e coverage for slice cast behavior: casts reach only slice members, slice refs keep eagerly materialized cast-domain ids across clones and nested slices, and delivered `CAST_POINT` values are slice-local Differential Revision: D105983253
Summary: Remove the legacy `CommActor` bootstrap from `ProcMesh::create` now that spawned actor meshes materialize their own `CastDomainRef` for full-mesh casts. `ProcMeshRef` no longer carries temporary root-region or root-comm-actor fields, and slicing only remaps the `ProcRef` ranks it owns. Differential Revision: D105983252
Summary: Spawn and bind the well-known `CastActor` on host service procs alongside `HostAgent` in bootstrap, local host mesh creation, and global local-host initialization. This prepares HostAgent cast domains without changing HostMesh control-plane delivery yet. Differential Revision: D105983255
Summary: Store a materialized `ActorMeshRef<HostAgent>` on `HostMeshRef` so host meshes have the same descriptor shape as actor meshes without yet changing HostMesh control-plane send paths. Materialization now happens at HostMeshRef construction/slicing, and HostMesh identity ignores the derived cast-domain state. Differential Revision: D105983250
Differential Revision: D107589984
Summary: Pull Request resolved: meta-pytorch#4101 Use the materialized HostAgent cast domain for `DrainHost` and `ShutdownHost` barriers. These messages now implement `Bind`/`Unbind` and use `OncePortRef<()>` replies so they can be carried through `CastDomainRef::cast` and acknowledged through the unit accumulator. Differential Revision: D105983256
Summary: Send `SetClientConfig` through the HostAgent cast domain and wait on a unit-reducer acknowledgement barrier. The attach path still returns `ConfigPushError` on cast, timeout, or receiver-close failure; collective failures conservatively report every host whose acknowledgement was not observed. Differential Revision: D105983254
Summary: Reuse the HostAgent cast domain for best-effort `HostMeshShutdownGuard` drop cleanup instead of iterating direct `HostRef::shutdown` sends. This removes the direct shutdown helper while keeping explicit `HostMesh::shutdown` as the preferred deterministic teardown path. Differential Revision: D105983251
Summary: Pull Request resolved: meta-pytorch#4104 Replace the `SpawnProcs` shape with casted `CreateOrUpdate<ProcSpec>` messages. `HostMeshRef::spawn` now sends one cast per per-host proc slot, and each `HostAgent` derives its concrete proc name and absolute proc rank from `CAST_POINT` plus `ProcMeshSpawnContext`, while preserving the direct `CreateOrUpdate<ProcSpec>` path for already-concrete updates. Differential Revision: D106402900
Differential Revision: D106402902
Summary: Empircally we found that RLE-ing the seqs makes a significant difference at scale Differential Revision: D108359903
Differential Revision: D108359902
Summary: Pull Request resolved: meta-pytorch#4235 `HostMeshRef::proc_states` did not read `self` at all, so there was no reason why `ProcMeshRef::proc_states` had to delegate to it instead of just owning the logic itself. This diff is just a move of `HostMeshRef::proc_states` into `ProcMeshRef::proc_states` Differential Revision: D108359910
08ea718 to
3eb4437
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
HostMeshRef::proc_statesdid not readselfat all, so there was no reason whyProcMeshRef::proc_stateshad to delegate to it instead of just owning the logic itself.This diff is just a move of
HostMeshRef::proc_statesintoProcMeshRef::proc_statesDifferential Revision: D108359910