[Proposal] TCOD — extending slime's On-Policy Distillation to multi-turn agents

### Your Question

I'd like to bring **TCOD** (a published method, arXiv:2604.24005) to slime: on-policy distillation for **multi-turn agents** with a temporal curriculum — essentially the multi-turn extension of your existing single-turn OPD example (`examples/on_policy_distillation`). It is not yet implemented in any RL framework, so I'm asking where it best fits before writing any slime-specific code, rather than showing up with an unsolicited recipe PR. Two concrete questions:

1. **Does slime support multi-turn agent rollouts** (LLM → env step → repeat) that TCOD's curriculum would hook into? Your current OPD example is single-turn (math), so if multi-turn agent rollout isn't a supported path yet, TCOD probably belongs in a standalone repo instead of the core examples.
2. **If multi-turn rollout is supported**, would you accept TCOD as an example sibling to `examples/on_policy_distillation` (lightweight, CI/run-verifiable)? **If not**, would you be open to linking a standalone TCOD repo from the slime README — the path CONTRIBUTING points to for algorithm-style projects?

Either way I'm not proposing large refactors or new abstractions into the core.

### What I've Tried

- Read CONTRIBUTING and checked TCOD against your scope, which is why I'm opening a question first instead of a PR.
- Studied `examples/on_policy_distillation`: a Qwen3-8B student imitates a Qwen3-32B teacher by matching token-level log-probs as a KL penalty on top of the advantage estimator (Math500 76% → 94%), single-turn, with sglang / megatron teacher modes. TCOD reuses this same on-policy KL-to-teacher objective.
- TCOD itself (in the paper): single-turn OPD applied to multi-turn agents is unstable — errors compound across turns, per-turn KL grows with turn index, trajectory KL escalates and success rate collapses. TCOD keeps the standard KL-to-teacher objective but grows the trajectory depth `k` exposed to the student short→long (`k = min(k_start + floor(n/η), k_max)`), in two variants: **F2B** (student rolls out only the first `k` steps, drop-in) and **B2F** (teacher replays the first `L-k` steps to seed the student, then student takes the remaining `k`). Reported gains: +up to ~15 SR over vanilla OPD, stable KL, ~32% less training time.


### Environment (if relevant)

_No response_

### Additional Context

- Author: @kokolerk (author of the TCOD paper)
- TCOD paper: arXiv:2604.24005
- slime OPD example: https://github.com/THUDM/slime/tree/main/examples/on_policy_distillation
- slime CONTRIBUTING: https://github.com/THUDM/slime/blob/main/CONTRIBUTING.md
- Standalone TCOD repo: https://github.com/kokolerk/TCOD

### Pre-submission Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/THUDM/slime/blob/main/CONTRIBUTING.md) and understand the collaboration scope.
- [x] I have read the [documentation](https://thudm.github.io/slime/) and [FAQ](https://thudm.github.io/slime/en/get_started/qa.html) and my question is not answered there.
- [x] I have searched for [existing issues](https://github.com/THUDM/slime/issues) and my question has not been asked before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] TCOD — extending slime's On-Policy Distillation to multi-turn agents #2002

Your Question

What I've Tried

Environment (if relevant)

Additional Context

Pre-submission Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal] TCOD — extending slime's On-Policy Distillation to multi-turn agents #2002

Description

Your Question

What I've Tried

Environment (if relevant)

Additional Context

Pre-submission Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions