WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv by nph4rd · Pull Request #965 · PrimeIntellect-ai/verifiers

nph4rd · 2026-02-27T05:27:35Z

Description

Adds foundational abstractions for multi-agent turn-based environments with support for heterogeneous (per-agent) rewards:

Core abstractions

Agent: Dataclass representing a participant (id, system_prompt, is_trainable)
Protocol: ABC defining turn order (get_initial_agent, get_next_agent)
RoundRobinProtocol: Concrete implementation for sequential turns
MultiAgentEnv: Base class extending MultiTurnEnv with multi-agent support

Heterogeneous rewards:

MultiAgentRewardFunc: New reward function type returning dict[str, float] (agent_id → reward)
Per-agent reward aggregation in Rubric.score_group()
Per-agent advantages stored in trajectory steps

Additional fix: get_prompt_ids() now correctly handles multi-agent tokenization by finding the previous turn for the current agent instead of using the global last trajectory step. This prevents empty message errors when agents interleave turns during training.

Design goals:

Protocol is required to encourage reusable turn-order logic
Minimal but extensible - can add simultaneous moves, state splitting, harness etc. later

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Add multi-agent support with Agent, Protocol, and MultiAgentEnv abstractions

Introduces MultiAgentEnv extending MultiTurnEnv with protocol-driven turn-taking, per-agent prompt building, and per-step agent metadata tagging during rollout.
Adds Agent dataclass (id, system_prompt, is_trainable) and a Protocol interface with a RoundRobinProtocol implementation for turn ordering.
Extends Rubric to support reward functions returning dict[str, float] per-agent rewards, aggregating them into state['agent_rewards'] and computing per-agent advantages.
Propagates an actor_models dict through the full stack: Environment, EnvGroup, EnvClient, EnvWorker, and serve request types, allowing per-agent model routing.
Behavioral Change: score_rollout and score_group now check for multi-agent reward functions before group functions; non-multi-agent paths are unchanged.

^{Macroscope summarized f430b7f. (Automatic summaries will resume when PR exits draft mode or review begins).}

Extends Protocol with should_spawn/get_spawn_specs. MultiAgentEnv.rollout() runs each spawned child env concurrently, scores it via its own rubric, and embeds the child trajectory steps into the parent's trajectory tagged with the spec's agent_id and is_trainable. Existing per-agent advantage computation (Rubric.score_group), trajectory splitting (interleave_rollout(split_by_agent=True)), and per-actor trainer metrics (MicroBatch.actor_ids) work without further changes. The use case is proposer-solver style envs where one agent's turn fans out into N rollouts of another agent — e.g. PrimeIntellect's general-agent synth-solver loop where the synthesizer creates tasks and the solver attempts them, with the synthesizer's reward depending on the solver's pass rate. RoundRobinProtocol is unchanged — the spawn branch is gated on isinstance(protocol, SpawningProtocol).

Functions defined under `from __future__ import annotations` have their return annotations stored as strings (e.g. 'dict[str, float]'), so `inspect.signature(func).return_annotation` returns the literal string and `get_origin(...)` returns None. The previous check classified such functions as INDIVIDUAL reward funcs, so calls were routed through `_call_individual_reward_func` which coerces the dict return to 0.0 via a failing `float()` conversion. Fix: use `typing.get_type_hints(func)` to resolve string annotations to their actual types before the dict-origin check. Falls back to the raw annotation if get_type_hints raises (rare — unresolved forward refs). Caught while running general-agent-coevolve on a real model: every solver rollout was correctly producing reward=1.0 in its own rubric, but the parent's aggregator returned 0 because both `solver_verify_reward` and `synth_goldilocks_reward` were declared under `from __future__ import annotations` in the env package.

nph4rd mentioned this pull request Feb 27, 2026

WIP: Support per-agent rewards in multi-agent setups PrimeIntellect-ai/prime-rl#1910

Closed

nph4rd force-pushed the multiagent-no-opponent-conditioning branch 2 times, most recently from e30195f to 9f15872 Compare March 7, 2026 06:00

nph4rd force-pushed the multiagent-no-opponent-conditioning branch from 754639e to 615187b Compare March 22, 2026 00:03

nph4rd mentioned this pull request May 20, 2026

WIP: Support per-agent rewards in multi-agent setups PrimeIntellect-ai/prime-rl#2575

Draft

nph4rd added 19 commits May 20, 2026 12:18

add MultiAgentEnv for turn-based multi-agent environments

0f7810e

rename Actor to Agent, add Protocol abstraction

a4fea39

require Protocol in MultiAgentEnv, simplify docstrings

8dab76d

update docstrings

2708169

add multi-agent reward functions for heterogeneous rewards

e8c04dc

compute per-agent advantages for multi-agent rewards

65c2853

include all rewards in per-agent rewards for multi-agent training

2ca7c72

add opponent-conditioned baselines for multi-agent advantage estimation

0c8ce5e

add debug logging for opponent-conditioned baselines

b426660

add trajectory structure debug

5bc1468

debug extras and state keys

2b37080

remove opponent-conditioned baselines for comparison test

dce56cd

add per-agent baselines for multi-agent advantage computation

5425ebb

fix score_rollout to support multi-agent reward functions

0034333

normalize messages from build_agent_prompt before storing in trajectory

902e3f7

add per-agent reward metrics for multi-agent environments

c8d3715

add per-agent model routing for multi-policy lora training

e80aab6

point textarena to fork with kuhn poker fixes

616bed6

fix rubric rollout score import after rebase

f430b7f

nph4rd force-pushed the multiagent-no-opponent-conditioning branch from 5a0cd7d to f430b7f Compare May 20, 2026 20:16

nph4rd added 2 commits May 20, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv#965

WIP: Multi-agent support: Agent, Protocol, MultiAgentEnv#965
nph4rd wants to merge 21 commits into
PrimeIntellect-ai:mainfrom
nph4rd:multiagent-no-opponent-conditioning

nph4rd commented Feb 27, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nph4rd commented Feb 27, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Core abstractions

Heterogeneous rewards:

Type of Change

Testing

Checklist

Additional Notes

Add multi-agent support with Agent, Protocol, and MultiAgentEnv abstractions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nph4rd commented Feb 27, 2026 •

edited by macroscopeapp Bot

Loading