Skip to content

[Proposal] TCOD — extending slime's On-Policy Distillation to multi-turn agents #2002

@kokolerk

Description

@kokolerk

Your Question

I'd like to bring TCOD (a published method, arXiv:2604.24005) to slime: on-policy distillation for multi-turn agents with a temporal curriculum — essentially the multi-turn extension of your existing single-turn OPD example (examples/on_policy_distillation). It is not yet implemented in any RL framework, so I'm asking where it best fits before writing any slime-specific code, rather than showing up with an unsolicited recipe PR. Two concrete questions:

  1. Does slime support multi-turn agent rollouts (LLM → env step → repeat) that TCOD's curriculum would hook into? Your current OPD example is single-turn (math), so if multi-turn agent rollout isn't a supported path yet, TCOD probably belongs in a standalone repo instead of the core examples.
  2. If multi-turn rollout is supported, would you accept TCOD as an example sibling to examples/on_policy_distillation (lightweight, CI/run-verifiable)? If not, would you be open to linking a standalone TCOD repo from the slime README — the path CONTRIBUTING points to for algorithm-style projects?

Either way I'm not proposing large refactors or new abstractions into the core.

What I've Tried

  • Read CONTRIBUTING and checked TCOD against your scope, which is why I'm opening a question first instead of a PR.
  • Studied examples/on_policy_distillation: a Qwen3-8B student imitates a Qwen3-32B teacher by matching token-level log-probs as a KL penalty on top of the advantage estimator (Math500 76% → 94%), single-turn, with sglang / megatron teacher modes. TCOD reuses this same on-policy KL-to-teacher objective.
  • TCOD itself (in the paper): single-turn OPD applied to multi-turn agents is unstable — errors compound across turns, per-turn KL grows with turn index, trajectory KL escalates and success rate collapses. TCOD keeps the standard KL-to-teacher objective but grows the trajectory depth k exposed to the student short→long (k = min(k_start + floor(n/η), k_max)), in two variants: F2B (student rolls out only the first k steps, drop-in) and B2F (teacher replays the first L-k steps to seed the student, then student takes the remaining k). Reported gains: +up to ~15 SR over vanilla OPD, stable KL, ~32% less training time.

Environment (if relevant)

No response

Additional Context

Pre-submission Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions