Skip to content

Commit bf56be3

Browse files
committed
Add training results
1 parent dfb0420 commit bf56be3

1 file changed

Lines changed: 11 additions & 0 deletions

File tree

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,17 @@ The Agent-Diff benchmark comprises **224 tasks** across four enterprise services
149149

150150
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
151151

152+
### Training on Agent-Diff
153+
154+
Agent-Diff environments double as training infrastructure. We used the benchmark to generate rollouts and fine-tune models on API tool-calling tasks:
155+
156+
| Method | Model | Base | Trained (eval set) | Delta |
157+
|---|---|---|---|---|
158+
| RL | [Qwen3-30B-A3B](https://app.primeintellect.ai/training/shared/ww6raxtlj4hduqksmulmcmji) | 0.31 | 0.55 | **+77%** |
159+
| SFT (LoRA) | [Ministral-3-14B](https://huggingface.co/hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA) | 0.28 | 0.35 | **+24%** |
160+
161+
The SFT pipeline filters high-reward Devstral rollouts (reward > 0.8), applies command flattening and error turn removal, and trains a LoRA adapter (rank 64) with prompt-level train/val splits. The RL run uses Agent-Diff as a live verifier environment on [Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench).
162+
152163
## Run Agent-Diff Bench
153164

154165
- **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required

0 commit comments

Comments
 (0)