Skip to content

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9

Open
ilblackdragon wants to merge 1 commit intomainfrom
feat/ironclaw-v2-benchmark
Open

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
ilblackdragon wants to merge 1 commit intomainfrom
feat/ironclaw-v2-benchmark

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

Categories

Category Tasks What it tests
01-thread-lifecycle 10 Thread state machine, budget enforcement, compaction, error recovery
02-codeact 10 Code execution, FINAL(), sub-agents, error self-correction, safety
03-capabilities 10 Capability leases, policy engine, tool restriction, skill chain install
04-missions 9 Mission CRUD, cron/event triggers, self-improvement, skill extraction
05-commitments 14 Signal detection, commitment lifecycle, delegation, decisions, triage
06-memory 12 Save/recall, structured data, fuzzy search, workspace navigation
07-skill-activation 10 Missed skill triggers, /command learning, activation keyword improvement
08-expected-behavior 10 /expected root cause classification (PROMPT_GAP, WRONG_TOOL_CHOICE, etc.)

Baseline Result

Qwen 3.5-122B on ironclaw staging 8acdd08:

Pass rate: 29.4% | Avg score: 0.615 | Tasks: 85/85 | Cost: $0.96 | Time: 1241s
  • 25 tasks pass, 56 partial, 4 timeout
  • All assertions hardened: 0% pass rate when LLM is unavailable (no vacuous passes)

Harness Changes

  • Cargo.toml: ironclaw → staging branch
  • src/main.rs: SessionConfig::default(), string-based backend check, async create_llm_provider, skip auth with API key
  • src/runner.rs: New AgentConfig/AgentDeps fields for v0.22, owner_id alignment, Arc<ChannelManager>
  • src/channel.rs: Handle new StatusUpdate variants
  • Suppress dead code warnings on unused-but-useful methods

Test plan

  • cargo build — zero errors, zero warnings
  • cargo run -- run --suite trajectory --config suites/ironclaw-v2.toml — 85/85 tasks complete
  • Baseline recorded in baselines/ironclaw-v2/qwen3.5-122b-8acdd08/
  • Run with additional models for comparison

🤖 Generated with Claude Code

Comprehensive benchmark dataset for testing ironclaw's next-gen agent
loop (PR #1557) and commitment system (PR #1736). Covers thread
lifecycle, CodeAct execution, capability leases, missions,
commitments, memory operations, skill activation self-learning, and
/expected behavior gap analysis.

Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96.

Harness changes:
- Update ironclaw dependency to staging branch (v0.22.0)
- Adapt to new AgentConfig/AgentDeps API surface
- Skip interactive NEAR AI auth when NEARAI_API_KEY is set
- Handle new StatusUpdate variants and async create_llm_provider

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant