Problem
Current Prime-RL recipes are single-agent by convention: one rollout client/model/sampling triple is threaded through the orchestrator, and multi-agent support has to treat that as the implicit learner target while adding per-member fixed targets beside it. That works, but the model is asymmetric: single-agent recipes are not represented as the same generation-plan object used by multi-agent recipes.
This is acceptable for a first multi-agent cut, but it is a smell before upstreaming because it makes the default path invisible and forces bridge code to know which fields are legacy fallbacks.
Proposed transition
- Introduce a runtime-internal
GenerationPlan as the canonical compiled object for all rollouts.
- Compile existing single-agent configs into a one-target plan at the ingestion boundary. Do not require old TOMLs to change in the first pass.
- Compile multi-agent configs into the same plan shape, keyed by Verifiers
member_id.
- Keep user-facing config minimal: existing single-agent TOMLs stay as-is; multi-agent TOMLs only add role policy (
train_one) and named fixed targets.
- Once this is stable, move code that reads legacy client/model/sampling fields behind the compiler and make scheduler/env code consume only the compiled plan.
Non-goals
- Do not add endpoint routing to Verifiers. Verifiers should carry
member_id; Prime-RL owns deployment topology.
- Do not require all existing recipes to grow multi-agent-style config blocks.
- Do not introduce a separate transport abstraction when the existing client config plus model/sampling target is sufficient.
Acceptance criteria
- Existing recipes run unchanged.
- Multi-agent train-one/fixed-opponent/fixed-judge recipes compile to the same internal plan type.
- Scheduler and env-server tests cover the compiled plan path end to end.
- Runtime modules depend on cohesive config modules, not the monolithic orchestrator schema.
Problem
Current Prime-RL recipes are single-agent by convention: one rollout client/model/sampling triple is threaded through the orchestrator, and multi-agent support has to treat that as the implicit learner target while adding per-member fixed targets beside it. That works, but the model is asymmetric: single-agent recipes are not represented as the same generation-plan object used by multi-agent recipes.
This is acceptable for a first multi-agent cut, but it is a smell before upstreaming because it makes the default path invisible and forces bridge code to know which fields are legacy fallbacks.
Proposed transition
GenerationPlanas the canonical compiled object for all rollouts.member_id.train_one) and named fixed targets.Non-goals
member_id; Prime-RL owns deployment topology.Acceptance criteria