feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories) by ilblackdragon · Pull Request #9 · nearai/benchmarks

ilblackdragon · 2026-03-30T05:24:33Z

Summary

Add ironclaw-v2 benchmark dataset: 85 tasks across 150 turns in 8 categories, testing the next-gen agent loop (feat(engine): Unified Thread-Capability-CodeAct execution engine (v2 architecture) ironclaw#1557) and commitment system (feat(skills): commitments system — active intake for personal AI assistant ironclaw#1736)
Update ironclaw dependency to staging branch (v0.22.0) and adapt harness to new API surface
Fix NEAR AI auth: skip interactive login when NEARAI_API_KEY env var is set

Categories

Category	Tasks	What it tests
01-thread-lifecycle	10	Thread state machine, budget enforcement, compaction, error recovery
02-codeact	10	Code execution, FINAL(), sub-agents, error self-correction, safety
03-capabilities	10	Capability leases, policy engine, tool restriction, skill chain install
04-missions	9	Mission CRUD, cron/event triggers, self-improvement, skill extraction
05-commitments	14	Signal detection, commitment lifecycle, delegation, decisions, triage
06-memory	12	Save/recall, structured data, fuzzy search, workspace navigation
07-skill-activation	10	Missed skill triggers, /command learning, activation keyword improvement
08-expected-behavior	10	/expected root cause classification (PROMPT_GAP, WRONG_TOOL_CHOICE, etc.)

Baseline Result

Qwen 3.5-122B on ironclaw staging 8acdd08:

Pass rate: 29.4% | Avg score: 0.615 | Tasks: 85/85 | Cost: $0.96 | Time: 1241s

25 tasks pass, 56 partial, 4 timeout
All assertions hardened: 0% pass rate when LLM is unavailable (no vacuous passes)

Harness Changes

Cargo.toml: ironclaw → staging branch
src/main.rs: SessionConfig::default(), string-based backend check, async create_llm_provider, skip auth with API key
src/runner.rs: New AgentConfig/AgentDeps fields for v0.22, owner_id alignment, Arc<ChannelManager>
src/channel.rs: Handle new StatusUpdate variants
Suppress dead code warnings on unused-but-useful methods

Test plan

cargo build — zero errors, zero warnings
cargo run -- run --suite trajectory --config suites/ironclaw-v2.toml — 85/85 tasks complete
Baseline recorded in baselines/ironclaw-v2/qwen3.5-122b-8acdd08/
Run with additional models for comparison

🤖 Generated with Claude Code

Comprehensive benchmark dataset for testing ironclaw's next-gen agent loop (PR #1557) and commitment system (PR #1736). Covers thread lifecycle, CodeAct execution, capability leases, missions, commitments, memory operations, skill activation self-learning, and /expected behavior gap analysis. Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96. Harness changes: - Update ironclaw dependency to staging branch (v0.22.0) - Adapt to new AgentConfig/AgentDeps API surface - Skip interactive NEAR AI auth when NEARAI_API_KEY is set - Handle new StatusUpdate variants and async create_llm_provider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
ilblackdragon wants to merge 1 commit intomainfrom
feat/ironclaw-v2-benchmark

ilblackdragon commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilblackdragon commented Mar 30, 2026

Summary

Categories

Baseline Result

Harness Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant