Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
c7fd27c
exp: general-agent
mikasenghaas May 17, 2026
b8c33de
docs: update PR description guidance
mikasenghaas May 17, 2026
001711d
Revert "docs: update PR description guidance"
mikasenghaas May 17, 2026
1017b19
feat(general-agent): add RLM behavior learning configs
mikasenghaas May 18, 2026
2153495
feat(general-agent): tune behavior reward shaping
mikasenghaas May 18, 2026
0409c35
feat(general-agent): log behavior judge state for audits
mikasenghaas May 18, 2026
818957d
feat(general-agent): document split behavior rewards
mikasenghaas May 18, 2026
b814b5c
feat(general-agent): document pruned behavior metrics
mikasenghaas May 18, 2026
fabcacc
docs(general-agent): record behavior judge calibration runs
mikasenghaas May 18, 2026
a560650
feat(general-agent): define behavior learning ablations
mikasenghaas May 18, 2026
c1d7e07
chore(general-agent): move behavior learning configs
mikasenghaas May 18, 2026
fd80d03
exp(behavior-learning): add labels, prime monitor, remove cp from abl…
mikasenghaas May 18, 2026
abbe7ec
exp(behavior-learning): source ~/.env before run to fix auth
mikasenghaas May 18, 2026
3703dc0
fix(slurm): source ~/.env before uv run in single-node RL template
mikasenghaas May 18, 2026
995fd00
exp(behavior-learning): rename prefix, pin rlm_ref, prime sandbox kil…
mikasenghaas May 18, 2026
94f1d05
revert(slurm): drop source ~/.env from single-node RL template
mikasenghaas May 18, 2026
f230472
chore(deps): bump verifiers submodule to main (dd89b5e9)
mikasenghaas May 18, 2026
9f43bee
exp(behavior-learning): switch model to Qwen3-4B-Thinking-2507
mikasenghaas May 18, 2026
4cad791
exp(behavior-learning): log behavior metrics in baseline without affe…
mikasenghaas May 18, 2026
13d721c
exp(behavior-learning): log behavior metrics in prompt run without af…
mikasenghaas May 18, 2026
d5fb965
exp(behavior-learning): cap ablations at 200 steps, log behavior metr…
mikasenghaas May 18, 2026
8c4f5a8
chore(deps): bump research-environments to include behavior-judge tea…
mikasenghaas May 18, 2026
58dc920
chore(deps): bump research-environments to include final_reward fix
mikasenghaas May 18, 2026
cb0fae2
exp(behavior-learning): bump context to 65K with cp=2 on all configs
mikasenghaas May 18, 2026
ba08c3f
chore(deps): bump research-environments to log un-gated behavior_reward
mikasenghaas May 18, 2026
26347cb
exp(behavior-learning): extend max_steps to 400 for phase-3 continuation
mikasenghaas May 19, 2026
55ddc54
exp: move general-agent and behavior-learning configs into private su…
mikasenghaas May 21, 2026
1c2f61d
Merge remote-tracking branch 'origin/main' into exp/general-agent
mikasenghaas May 21, 2026
fa63b17
chore(deps): bump configs/private to include phase-3 behavior-learnin…
mikasenghaas May 22, 2026
bda0d00
exp(general-agent): bump research-envs, add public RLM configs, renam…
mikasenghaas May 22, 2026
89156ce
exp(general-agent): rename 0p6b config, point all configs at general-…
mikasenghaas May 22, 2026
8a6c42c
chore(deps): bump configs/private to expand behavior-learning RESULTS
mikasenghaas May 22, 2026
503861e
chore(deps): bump configs/private for tightened behavior-learning RES…
mikasenghaas May 22, 2026
54aee40
exp(general-agent): drop num_workers/max_retries/tool_call_parser, se…
mikasenghaas May 22, 2026
278ed64
Merge branch 'main' into exp/general-agent
mikasenghaas May 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,10 @@ Write tests as plain functions with pytest fixtures. Don't use class-based tests

## Git

- **Branch prefixes**: use the following prefixes for branches: `feat/`, `fix/`, `chore/`
- **Branch prefixes**: use `feat/`, `fix/`, `chore/`; use `exp/` for experiment branches (configs, run summaries, pins, notes).

## GitHub

- **Draft PRs**: always create PRs as drafts (`gh pr create --draft`) to avoid triggering CI unnecessarily.
- **Pull requests**: do not include a "test plan" section in PR descriptions unless you actually ran tests to verify the changes or the user explicitly asked for one.
- **Keep PR descriptions in sync**: every time you push commits to a PR, also update the PR description (`gh pr edit <num> --body-file ...`) so it reflects the current state of the branch — not just what was true when the PR was opened. Preserve any auto-generated blocks (e.g. `<!-- CURSOR_SUMMARY -->`).

29 changes: 29 additions & 0 deletions configs/general_agent/rl_qwen3_0p6b.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
max_steps = 5
seq_len = 8192

[wandb]
project = "general-agent-debug"
name = "qwen3-0p6b-rlm"

[model]
name = "Qwen/Qwen3-0.6B"

[orchestrator]
batch_size = 16
rollouts_per_example = 4

[orchestrator.train.sampling]
max_completion_tokens = 4096

[[orchestrator.train.env]]
id = "general-agent-solver-rlm"

[trainer]

[inference]

[inference.model]
max_model_len = 8192

[inference.parallel]
dp = 1
52 changes: 52 additions & 0 deletions configs/general_agent/rl_qwen3_30b_a3b.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
max_steps = 400
seq_len = 32768

[slurm]
job_name = "general-agent-qwen3-30b-a3b-rlm"

[deployment]
type = "multi_node"
num_train_nodes = 1
num_infer_nodes = 1

[wandb]
project = "general-agent-debug"
name = "qwen3-30b-a3b-rlm"

[ckpt]
interval = 50
keep_last = 1

[model]
name = "Qwen/Qwen3-30B-A3B-Instruct-2507"

[trainer]

[trainer.model]
cp = 2

[trainer.model.ac]
freq = 1

[trainer.model.compile]

[orchestrator]
batch_size = 512
rollouts_per_example = 16
max_off_policy_steps = 32

[[orchestrator.train.env]]
id = "general-agent-solver-rlm"

[orchestrator.train.env.args]
min_tier = 1

[inference]
gpu_memory_utilization = 0.85

[inference.model]
max_model_len = 32768

[inference.parallel]
dp = 2
tp = 4
44 changes: 44 additions & 0 deletions configs/general_agent/rl_qwen3_4b.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
max_steps = 200
seq_len = 32768

[deployment]
num_train_gpus = 4
num_infer_gpus = 4

[wandb]
project = "general-agent-debug"
name = "qwen3-4b-rlm"

[ckpt]
interval = 100
keep_last = 1

[model]
name = "Qwen/Qwen3-4B-Instruct-2507"

[trainer]

[trainer.model]
cp = 2

[trainer.model.ac]
freq = 1

[trainer.model.compile]

[orchestrator]
batch_size = 512
rollouts_per_example = 8
max_off_policy_steps = 32

[[orchestrator.train.env]]
id = "general-agent-solver-rlm"

[inference]
gpu_memory_utilization = 0.85

[inference.model]
max_model_len = 32768

[inference.parallel]
dp = 4
2 changes: 1 addition & 1 deletion configs/private
2 changes: 1 addition & 1 deletion deps/research-environments
Submodule research-environments updated 38 files
+3 −0 .github/workflows/publish-envs.yaml
+1 −0 environments/ddbc/README.md
+2 −0 environments/ddbc/ddbc/ddbc.py
+45 −6 environments/ddbc/ddbc/open_one.py
+1 −1 environments/ddbc/pyproject.toml
+1 −0 environments/ddbc_rlm/README.md
+2 −0 environments/ddbc_rlm/ddbc_rlm/ddbc_rlm.py
+45 −6 environments/ddbc_rlm/ddbc_rlm/open_one.py
+1 −1 environments/ddbc_rlm/pyproject.toml
+3 −0 environments/deepdive/README.md
+11 −0 environments/deepdive/deepdive/config.py
+4 −0 environments/deepdive/deepdive/deepdive.py
+45 −6 environments/deepdive/deepdive/open_one.py
+1 −1 environments/deepdive/pyproject.toml
+2 −0 environments/deepdive_rlm/README.md
+11 −0 environments/deepdive_rlm/deepdive_rlm/config.py
+4 −0 environments/deepdive_rlm/deepdive_rlm/deepdive_rlm.py
+45 −8 environments/deepdive_rlm/deepdive_rlm/open_one.py
+1 −1 environments/deepdive_rlm/pyproject.toml
+25 −0 environments/general_agent/README.md
+495 −0 environments/general_agent/general_agent/solver/rlm/behavior.py
+63 −0 environments/general_agent/general_agent/solver/rlm/env.py
+150 −0 environments/general_agent/general_agent/solver/rlm/prompts/behavior.md
+1 −1 environments/general_agent/general_agent/solver/rubric.py
+1 −1 environments/general_agent/pyproject.toml
+3 −0 environments/opencode_deepdive/README.md
+11 −1 environments/opencode_deepdive/opencode_deepdive/opencode_deepdive.py
+1 −1 environments/opencode_deepdive/pyproject.toml
+3 −0 environments/rlm_deepdive/README.md
+1 −1 environments/rlm_deepdive/pyproject.toml
+11 −0 environments/rlm_deepdive/rlm_deepdive/rlm_deepdive.py
+64 −0 environments/rlm_swe/README.md
+1 −1 environments/rlm_swe/pyproject.toml
+665 −0 environments/rlm_swe/rlm_swe/behavior.py
+220 −0 environments/rlm_swe/rlm_swe/prompts/behavior.md
+1 −0 environments/rlm_swe/rlm_swe/prompts/venv_hint.md
+75 −0 environments/rlm_swe/rlm_swe/rlm_swe.py
+2 −1 skills/env-sync-push/SKILL.md
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ envs = [
"code-env",
"color-codeword",
"deepdive",
"general-agent",
"gpqa",
"hle",
"ifeval",
Expand Down Expand Up @@ -118,6 +119,7 @@ dev = [
"ipywidgets>=8.1.7",
"pre-commit>=4.2.0",
"pytest>=8.4.1",
"pytest-asyncio>=0.23",
"ruff>=0.12.1",
]

Expand All @@ -137,6 +139,7 @@ members = [
"deps/research-environments/environments/code_env",
"deps/research-environments/environments/color_codeword",
"deps/research-environments/environments/deepdive",
"deps/research-environments/environments/general_agent",
"deps/research-environments/environments/gpqa",
"deps/research-environments/environments/hle",
"deps/research-environments/environments/ifeval",
Expand Down Expand Up @@ -203,6 +206,7 @@ alphabet-sort = { workspace = true }
code-env = { workspace = true }
color-codeword = { workspace = true }
deepdive = { workspace = true }
general-agent = { workspace = true }
gpqa = { workspace = true }
hle = { workspace = true }
ifeval = { workspace = true }
Expand Down
11 changes: 11 additions & 0 deletions tests/unit/test_configs.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import tomllib
from pathlib import Path
from typing import Annotated, Literal

Expand Down Expand Up @@ -33,9 +34,19 @@ def get_config_files() -> list[Path]:
return config_files + example_files


def is_eval_config(path: Path) -> bool:
"""vf-eval TOMLs live under configs but are not prime-rl entrypoint configs."""
with path.open("rb") as f:
data = tomllib.load(f)
return isinstance(data.get("eval"), list)


@pytest.mark.parametrize("config_file", get_config_files(), ids=lambda x: x.as_posix())
def test_load_configs(config_file: Path):
"""Tests that all config files can be loaded by at least one config class."""
if is_eval_config(config_file):
pytest.skip("vf-eval TOML files are not prime-rl entrypoint configs")

could_parse = []
for config_cls in CONFIG_CLASSES:
try:
Expand Down
41 changes: 39 additions & 2 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading