WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv#965
Draft
nph4rd wants to merge 21 commits into
Draft
Conversation
e30195f to
9f15872
Compare
754639e to
615187b
Compare
5a0cd7d to
f430b7f
Compare
Extends Protocol with should_spawn/get_spawn_specs. MultiAgentEnv.rollout() runs each spawned child env concurrently, scores it via its own rubric, and embeds the child trajectory steps into the parent's trajectory tagged with the spec's agent_id and is_trainable. Existing per-agent advantage computation (Rubric.score_group), trajectory splitting (interleave_rollout(split_by_agent=True)), and per-actor trainer metrics (MicroBatch.actor_ids) work without further changes. The use case is proposer-solver style envs where one agent's turn fans out into N rollouts of another agent — e.g. PrimeIntellect's general-agent synth-solver loop where the synthesizer creates tasks and the solver attempts them, with the synthesizer's reward depending on the solver's pass rate. RoundRobinProtocol is unchanged — the spawn branch is gated on isinstance(protocol, SpawningProtocol).
Functions defined under `from __future__ import annotations` have their return annotations stored as strings (e.g. 'dict[str, float]'), so `inspect.signature(func).return_annotation` returns the literal string and `get_origin(...)` returns None. The previous check classified such functions as INDIVIDUAL reward funcs, so calls were routed through `_call_individual_reward_func` which coerces the dict return to 0.0 via a failing `float()` conversion. Fix: use `typing.get_type_hints(func)` to resolve string annotations to their actual types before the dict-origin check. Falls back to the raw annotation if get_type_hints raises (rare — unresolved forward refs). Caught while running general-agent-coevolve on a real model: every solver rollout was correctly producing reward=1.0 in its own rubric, but the parent's aggregator returned 0 because both `solver_verify_reward` and `synth_goldilocks_reward` were declared under `from __future__ import annotations` in the env package.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds foundational abstractions for multi-agent turn-based environments with support for heterogeneous (per-agent) rewards:
Core abstractions
Heterogeneous rewards:
Additional fix:
get_prompt_ids()now correctly handles multi-agent tokenization by finding the previous turn for the current agent instead of using the global last trajectory step. This prevents empty message errors when agents interleave turns during training.Design goals:
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Add multi-agent support with Agent, Protocol, and MultiAgentEnv abstractions
MultiAgentEnvextendingMultiTurnEnvwith protocol-driven turn-taking, per-agent prompt building, and per-step agent metadata tagging during rollout.Agentdataclass (id, system_prompt, is_trainable) and aProtocolinterface with aRoundRobinProtocolimplementation for turn ordering.Rubricto support reward functions returningdict[str, float]per-agent rewards, aggregating them intostate['agent_rewards']and computing per-agent advantages.actor_modelsdict through the full stack:Environment,EnvGroup,EnvClient,EnvWorker, and serve request types, allowing per-agent model routing.score_rolloutandscore_groupnow check for multi-agent reward functions before group functions; non-multi-agent paths are unchanged.Macroscope summarized f430b7f. (Automatic summaries will resume when PR exits draft mode or review begins).