Skip to content

WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv#965

Draft
nph4rd wants to merge 21 commits into
PrimeIntellect-ai:mainfrom
nph4rd:multiagent-no-opponent-conditioning
Draft

WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv#965
nph4rd wants to merge 21 commits into
PrimeIntellect-ai:mainfrom
nph4rd:multiagent-no-opponent-conditioning

Conversation

@nph4rd
Copy link
Copy Markdown

@nph4rd nph4rd commented Feb 27, 2026

Description

Adds foundational abstractions for multi-agent turn-based environments with support for heterogeneous (per-agent) rewards:

Core abstractions

  • Agent: Dataclass representing a participant (id, system_prompt, is_trainable)
  • Protocol: ABC defining turn order (get_initial_agent, get_next_agent)
  • RoundRobinProtocol: Concrete implementation for sequential turns
  • MultiAgentEnv: Base class extending MultiTurnEnv with multi-agent support

Heterogeneous rewards:

  • MultiAgentRewardFunc: New reward function type returning dict[str, float] (agent_id → reward)
  • Per-agent reward aggregation in Rubric.score_group()
  • Per-agent advantages stored in trajectory steps

Additional fix: get_prompt_ids() now correctly handles multi-agent tokenization by finding the previous turn for the current agent instead of using the global last trajectory step. This prevents empty message errors when agents interleave turns during training.

Design goals:

  • Protocol is required to encourage reusable turn-order logic
  • Minimal but extensible - can add simultaneous moves, state splitting, harness etc. later

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

Note

Add multi-agent support with Agent, Protocol, and MultiAgentEnv abstractions

  • Introduces MultiAgentEnv extending MultiTurnEnv with protocol-driven turn-taking, per-agent prompt building, and per-step agent metadata tagging during rollout.
  • Adds Agent dataclass (id, system_prompt, is_trainable) and a Protocol interface with a RoundRobinProtocol implementation for turn ordering.
  • Extends Rubric to support reward functions returning dict[str, float] per-agent rewards, aggregating them into state['agent_rewards'] and computing per-agent advantages.
  • Propagates an actor_models dict through the full stack: Environment, EnvGroup, EnvClient, EnvWorker, and serve request types, allowing per-agent model routing.
  • Behavioral Change: score_rollout and score_group now check for multi-agent reward functions before group functions; non-multi-agent paths are unchanged.

Macroscope summarized f430b7f. (Automatic summaries will resume when PR exits draft mode or review begins).

@nph4rd nph4rd force-pushed the multiagent-no-opponent-conditioning branch from 5a0cd7d to f430b7f Compare May 20, 2026 20:16
nph4rd added 2 commits May 20, 2026 20:08
Extends Protocol with should_spawn/get_spawn_specs. MultiAgentEnv.rollout()
runs each spawned child env concurrently, scores it via its own rubric, and
embeds the child trajectory steps into the parent's trajectory tagged with
the spec's agent_id and is_trainable. Existing per-agent advantage
computation (Rubric.score_group), trajectory splitting
(interleave_rollout(split_by_agent=True)), and per-actor trainer metrics
(MicroBatch.actor_ids) work without further changes.

The use case is proposer-solver style envs where one agent's turn fans out
into N rollouts of another agent — e.g. PrimeIntellect's general-agent
synth-solver loop where the synthesizer creates tasks and the solver
attempts them, with the synthesizer's reward depending on the solver's
pass rate.

RoundRobinProtocol is unchanged — the spawn branch is gated on
isinstance(protocol, SpawningProtocol).
Functions defined under `from __future__ import annotations` have their
return annotations stored as strings (e.g. 'dict[str, float]'), so
`inspect.signature(func).return_annotation` returns the literal string
and `get_origin(...)` returns None. The previous check classified such
functions as INDIVIDUAL reward funcs, so calls were routed through
`_call_individual_reward_func` which coerces the dict return to 0.0
via a failing `float()` conversion.

Fix: use `typing.get_type_hints(func)` to resolve string annotations to
their actual types before the dict-origin check. Falls back to the raw
annotation if get_type_hints raises (rare — unresolved forward refs).

Caught while running general-agent-coevolve on a real model: every
solver rollout was correctly producing reward=1.0 in its own rubric,
but the parent's aggregator returned 0 because both
`solver_verify_reward` and `synth_goldilocks_reward` were declared
under `from __future__ import annotations` in the env package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant