EnvGroup nested in outer EnvGroup silently misroutes all tasks to envs[0]

## Bug: Nested EnvGroup task routing is silently broken

When an `EnvGroup` is wrapped inside another `EnvGroup` (as prime-rl does when it wraps user envs), the outer group overwrites the `task` column to a single name. The inner `EnvGroup` then fails to route because its `env_map` uses the original task names — and **silently falls back to `envs[0]`** for rollouts, or returns `reward=0.0` for scoring.

This means **only the first sub-environment's rubric ever executes**, and the user gets no error or warning that the rest of their tasks are being silently misrouted or scored as zero.

## Impact

If a user builds a multi-task environment using `EnvGroup` (e.g., 17 tasks with different rubrics) and registers it as a single `[[orchestrator.env]]` in prime-rl, only the first task type gets meaningful reward signal. The model trains with zero reward on all other task types. There is no error or warning — the only symptom is that wandb metrics only show the first rubric's reward name.

## Problematic code path

### 1. Task name overwriting (`EnvGroup.__init__`, lines 172-182)

```python
for env, name in zip(self.envs, self.env_names):
    add_task = make_add_task_fn(name)
    env_dataset = env.build_dataset()
    if env_dataset is not None:
        if "task" in env_dataset.column_names:
            env_dataset = env_dataset.remove_columns(["task"])  # destroys inner task names
        env_dataset = env_dataset.map(add_task, **map_kwargs)   # overwrites with outer name
```

When the sub-env is itself an `EnvGroup`, its dataset already has meaningful per-task names. The outer `EnvGroup` unconditionally destroys these and replaces them all with a single name.

### 2. Silent fallback in `get_env_for_task` (line 318-319)

```python
def get_env_for_task(self, task: str) -> vf.Environment:
    return self.env_map.get(task, self.envs[0])  # silent fallback!
```

When the inner `EnvGroup` receives the outer name (which doesn't exist in its `env_map`), it silently falls back to `envs[0]`. All rollouts go to the first sub-environment. No warning is logged.

### 3. Scoring returns zeros for unknown tasks (`EnvGroupRubric.score_group`, lines 86-94)

```python
task = states[0].get("task", "default")
env = self.env_map.get(task)
if env is None:
    self.logger.warning(f"No environment found for task '{task}'")
    for state in states:
        state["reward"] = 0.0  # all tasks scored as zero
```

The inner `EnvGroupRubric` can't find the outer name in its `env_map`, so all states get `reward=0.0`.

## Proposed fix

### A. Preserve inner task names when sub-env is an EnvGroup

In `EnvGroup.__init__`, when a sub-env is itself an `EnvGroup`, preserve its internal task names instead of overwriting them. Register each inner task name in the outer `env_map` pointing to the sub-env:

```python
for env, name in zip(self.envs, self.env_names):
    add_task = make_add_task_fn(name)
    env_dataset = env.build_dataset()
    if env_dataset is not None:
        if isinstance(env, EnvGroup) and "task" in env_dataset.column_names:
            # Preserve inner EnvGroup's task names for correct routing
            for inner_name in env.env_names:
                self.env_map[inner_name] = env
            # Don't overwrite task column — inner tasks are already set
        else:
            if "task" in env_dataset.column_names:
                env_dataset = env_dataset.remove_columns(["task"])
            env_dataset = env_dataset.map(add_task, **map_kwargs)
        datasets.append(env_dataset)
```

Same pattern for `eval_dataset` handling.

This way:
- Inner task names are preserved in the dataset
- Outer `env_map` maps each inner task name to the inner `EnvGroup`
- Routing works: outer gets a task name, routes to inner `EnvGroup`, inner routes to correct sub-env
- `EnvGroupRubric` also gets the updated `env_map`, so scoring routes correctly
- `results_df.task.nunique() > 1`, enabling per-task logging in orchestrators like prime-rl

### B. Remove silent fallback in `get_env_for_task`

```python
def get_env_for_task(self, task: str) -> vf.Environment:
    env = self.env_map.get(task)
    if env is None:
        available = list(self.env_map.keys())
        raise ValueError(
            f"No environment found for task '{task}'. "
            f"Available tasks: {available}"
        )
    return env
```

The current fallback to `envs[0]` silently masks routing failures. An explicit error makes misconfigurations immediately visible.

## Test changes

### 1. Update `test_get_env_for_task` — unknown task should raise, not fallback

```python
def test_get_env_for_task(self, mock_openai_client):
    # ... (same setup as current) ...
    env_group = EnvGroup(envs=[env1, env2], env_names=["math", "code"])

    assert env_group.get_env_for_task("math") == env1
    assert env_group.get_env_for_task("code") == env2
    # Unknown task should raise, not silently fallback
    with pytest.raises(ValueError, match="No environment found for task"):
        env_group.get_env_for_task("unknown")
```

### 2. Add `test_nested_env_group_preserves_inner_tasks`

```python
def test_nested_env_group_preserves_inner_tasks(self, mock_openai_client):
    """Test that wrapping an EnvGroup in another EnvGroup preserves inner task names."""
    env1 = SingleTurnEnv(
        client=mock_openai_client,
        model="test-model",
        dataset=Dataset.from_dict({"question": ["q1"], "answer": ["a1"]}),
        rubric=Rubric(),
    )
    env2 = SingleTurnEnv(
        client=mock_openai_client,
        model="test-model",
        dataset=Dataset.from_dict({"question": ["q2"], "answer": ["a2"]}),
        rubric=Rubric(),
    )

    inner_group = EnvGroup(envs=[env1, env2], env_names=["math", "code"])
    outer_group = EnvGroup(envs=[inner_group], env_names=["my_env"])

    # Inner task names should be preserved in the dataset
    dataset = outer_group.get_dataset()
    tasks = dataset["task"]
    assert "math" in tasks
    assert "code" in tasks
    assert "my_env" not in tasks

    # Routing should work through both levels
    assert outer_group.get_env_for_task("math") == inner_group
    assert outer_group.get_env_for_task("code") == inner_group
```

### 3. Add `test_nested_env_group_rubric_scoring`

```python
@pytest.mark.asyncio
async def test_nested_env_group_rubric_scoring(self, mock_openai_client, make_input):
    """Test that scoring routes correctly through nested EnvGroups."""
    def math_reward(completion, **kwargs):
        return 0.8

    def code_reward(completion, **kwargs):
        return 0.6

    env1 = SingleTurnEnv(
        client=mock_openai_client,
        model="test-model",
        dataset=Dataset.from_dict({"question": ["q1"], "answer": ["a1"]}),
        rubric=Rubric(funcs=[math_reward], weights=[1.0]),
    )
    env2 = SingleTurnEnv(
        client=mock_openai_client,
        model="test-model",
        dataset=Dataset.from_dict({"question": ["q2"], "answer": ["a2"]}),
        rubric=Rubric(funcs=[code_reward], weights=[1.0]),
    )

    inner_group = EnvGroup(envs=[env1, env2], env_names=["math", "code"])
    outer_group = EnvGroup(envs=[inner_group], env_names=["my_env"])

    # Score a "code" task — should route through outer -> inner -> env2's rubric
    state = State(input=make_input(prompt="Test", answer="ans", task="code"))
    state["completion"] = "Test completion"
    state["trajectory"] = []
    state["timing"] = {"generation_ms": 0.0, "scoring_ms": 0.0, "total_ms": 0.0, "start_time": 0.0}
    state["is_completed"] = False
    state["stop_condition"] = None
    state["oai_tools"] = []
    state["reward"] = None
    state["metrics"] = None

    await outer_group.rubric.score_rollout(state)

    assert state["reward"] == 0.6  # code_reward, not math_reward
    assert state["metrics"]["code_reward"] == 0.6
    assert state["metrics"]["math_reward"] == 0.0
```

## Reproduction

Any environment that returns an `EnvGroup` from its `load_environment()` and is registered as a single `[[orchestrator.env]]` in prime-rl will hit this. The inner task names get overwritten, routing breaks silently, and only the first sub-env's rubric ever scores anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnvGroup nested in outer EnvGroup silently misroutes all tasks to envs[0] #1008

Bug: Nested EnvGroup task routing is silently broken

Impact

Problematic code path

1. Task name overwriting (`EnvGroup.init`, lines 172-182)

2. Silent fallback in `get_env_for_task` (line 318-319)

3. Scoring returns zeros for unknown tasks (`EnvGroupRubric.score_group`, lines 86-94)

Proposed fix

A. Preserve inner task names when sub-env is an EnvGroup

B. Remove silent fallback in `get_env_for_task`

Test changes

1. Update `test_get_env_for_task` — unknown task should raise, not fallback

2. Add `test_nested_env_group_preserves_inner_tasks`

3. Add `test_nested_env_group_rubric_scoring`

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EnvGroup nested in outer EnvGroup silently misroutes all tasks to envs[0] #1008

Description

Bug: Nested EnvGroup task routing is silently broken

Impact

Problematic code path

1. Task name overwriting (EnvGroup.__init__, lines 172-182)

2. Silent fallback in get_env_for_task (line 318-319)

3. Scoring returns zeros for unknown tasks (EnvGroupRubric.score_group, lines 86-94)

Proposed fix

A. Preserve inner task names when sub-env is an EnvGroup

B. Remove silent fallback in get_env_for_task

Test changes

1. Update test_get_env_for_task — unknown task should raise, not fallback

2. Add test_nested_env_group_preserves_inner_tasks

3. Add test_nested_env_group_rubric_scoring

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Task name overwriting (`EnvGroup.init`, lines 172-182)

2. Silent fallback in `get_env_for_task` (line 318-319)

3. Scoring returns zeros for unknown tasks (`EnvGroupRubric.score_group`, lines 86-94)

B. Remove silent fallback in `get_env_for_task`

1. Update `test_get_env_for_task` — unknown task should raise, not fallback

2. Add `test_nested_env_group_preserves_inner_tasks`

3. Add `test_nested_env_group_rubric_scoring`