exp: general-agent#2525
Merged
Merged
Conversation
This reverts commit b8c33de.
…ation configs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l in pre-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Source ~/.env directly in the shell before uv run rl instead; env vars propagate to sbatch via --export=ALL. Reverts 3703dc0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up #1395 which fixes the ParsedToolCall subscription bug in renderer_client.from_native_response — previously raised 'ParsedToolCall' object is not subscriptable on every rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4B-Instruct hallucinated tool names and gave up after a few errors (reward ~0.4% at step 2). Try the thinking variant which is better at structured tool-use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cting reward Set behavior_judge_model + behavior_reward_alpha=0.0 so the judge runs and behavior_<key> metrics get logged, but final_reward stays equal to task_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fecting reward Same change as baseline: enable judge for metrics but alpha=0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ics in prompt run - prompt run: enable judge with alpha=0.0 so behavior_<key> metrics get logged but final_reward stays equal to task_reward (same setup as baseline) - all four configs: max_steps 1000 → 200 to keep ablations bounded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-id fix Picks up PrimeIntellect-ai/research-environments@769298b1 which forwards PRIME_TEAM_ID as X-Prime-Team-ID on behavior judge requests, so the judge bills the team balance instead of the user's personal balance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up eaaabf3c which makes final_reward use state.get() so judge failures don't zero out task_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous runs (cp=1, 32K) saw 6-9% truncation rate and output_tokens hitting the 32K cap on long trajectories. Double the seq_len and max_model_len; cp=2 keeps per-rank activation memory flat under the 2x context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up fdca6d76 which logs behavior_reward as the raw judge mean (independent of task_reward) and moves the solution gate into final_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 4 phase-2 runs completed step 200 with promising trajectories. Extend max_steps to 400 to continue training from the step_200 checkpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bmodule The general-agent + behavior-learning configs are not meant for the public prime-rl repo. Move them into the research-configs submodule mounted at configs/private/ so they share access controls with the rest of our internal experiment configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # deps/verifiers
…g RESULTS.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e eval-config helper
- Bump deps/research-environments to origin/main HEAD (general-agent 0.1.4)
- Add configs/general_agent/{rl_qwen3_0p6b_debug,rl_qwen3_4b,rl_qwen3_30b_a3b}.toml
using the general-agent-solver-rlm env
- Rename is_vf_eval_config -> is_eval_config in tests/unit/test_configs.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agent-debug wandb project Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ULTS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t step counts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # configs/private
samsja
approved these changes
May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
general-agenttopyproject.toml(envs list, workspace members, uv sources) and pullpytest-asynciointo the dev group so the env's tests are runnable.configs/general_agent/with three RLM configs usinggeneral-agent-solver-rlm, all logging to thegeneral-agent-debugwandb project:rl_qwen3_0p6b.toml— single-GPU smoke testrl_qwen3_4b.toml— 4 train + 4 infer GPUs,max_steps=200rl_qwen3_30b_a3b.toml— multi-node (1 train + 1 infer, dp=2 / tp=4),max_steps=400deps/research-environmentstoc752781(wasorigin/mainat time of write;mainhas since advanced — re-bump before merge if a refresh is wanted). Env version bumps:ddbc0.1.1 → 0.1.2ddbc_rlm0.1.5 → 0.1.6deepdive0.2.7 → 0.2.9deepdive_rlm0.2.11 → 0.2.13general_agent0.1.0 → 0.1.4opencode_deepdive0.1.15 → 0.1.16rlm_deepdive0.2.3 → 0.2.4rlm_swe0.3.4 → 0.4.2configs/privatesubmodule pointer (now a merge commit70c3503that joins the PR's behavior-learning RESULTS writeups with main'srlm5X-Session-IDheader cleanup).evallist) intests/unit/test_configs.py::test_load_configsso non-entrypoint configs don't fail validation.exp/as the branch prefix for experiment branches inAGENTS.md.Verification
uv sync --all-extrasrebuildsgeneral-agent==0.1.4; entry pointgeneral-agent-solver-rlmresolves andvf.load_environment("general-agent-solver-rlm")returns aComposableEnv.uv run pytest tests/unit/test_configs.py— 106 passed (covers all three newconfigs/general_agent/*.toml).Note
Low Risk
Mostly new TOMLs, dependency wiring, and a targeted config-test skip; no changes to core training or auth paths in this diff.
Overview
Wires the
general-agentresearch environment into the repo and adds RL experiment configs forgeneral-agent-solver-rlmon Qwen3 at 0.6B (smoke), 4B, and 30B-A3B scales, all targeting thegeneral-agent-debugW&B project.Packaging:
general-agentis added to theenvsextra, uv workspace members/sources, anduv.lock(new editablegeneral-agent==0.1.4; lock also bumpsdeepdive/opencode-deepdiveversions shown in the diff). Dev deps gainpytest-asynciofor async env tests.Configs: New
configs/general_agent/rl_qwen3_{0p6b,4b,30b_a3b}.tomltune steps, seq length, GPU/deployment layout, orchestrator batch/rollouts, and inference parallelism for each model size.Tests / docs:
tests/unit/test_configs.pyskips TOMLs with a top-levelevallist (vf-eval, not prime-rl entrypoints).AGENTS.mddocumentsexp/branch prefix for experiment work.Reviewed by Cursor Bugbot for commit 278ed64. Bugbot is set up for automated code reviews on this repo. Configure here.