feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories) by ilblackdragon · Pull Request #9 · nearai/benchmarks

ilblackdragon · 2026-03-30T05:24:33Z

Summary

Expands the ironclaw-v2 benchmark suite from 85 to 195 tasks across 10 categories (528 total conversation turns), adds A/B testing infrastructure for comparing ironclaw versions, and enriches trace capture for training-grade data.

New benchmark categories

09-learning-system (60 tasks, 6 subcategories): confidence decay, confidence scoring, dedup/correction, cross-project learning, /learn management interface, false-positive learning loop
10-real-world-workflows (50 tasks, 5 subcategories): financial analysis, blockchain ops, sports analytics, long-context stress tests, data engineering pipelines — all multi-turn (5-8 turns) requiring Python code execution and tool composition

A/B benchmarking across ironclaw versions

--ironclaw-rev <ref> flag: patches Cargo.toml, rebuilds against any branch/tag/commit, re-execs
build.rs auto-resolves ironclaw git SHA from Cargo.lock → framework_version field
scripts/bench-ab.sh for end-to-end two-ref comparison
Per-category breakdown in run.json and compare output
cfg(ironclaw_engine_v2) detection for cross-branch API compatibility

Post-mortem failure analysis

CLI: nearai-bench post-mortem <run-id> — analyzes failures by stage (timeout/setup/execution/scoring), per-category failure rates, recommendations
Runner hook: --auto-post-mortem flag auto-generates post_mortem.json after runs with failures
Cached as post_mortem.json alongside run.json and tasks.jsonl

Training-grade trace capture

Per-LLM-call detail: individual token counts, duration, tool call flag
Tool call inputs/outputs (parameters + result previews, truncated to 2KB)
Full conversation history (all user/assistant turns)
System prompt / identity content
All new fields backward-compatible via #[serde(default)]

Other improvements

Isolated per-task working directory (agent files don't pollute repo root)
Tags persisted on TaskResult for category aggregation
Fixed pre-existing test failures from ironclaw API changes
ironclaw_safety as direct dependency for v2-architecture compatibility
Zero warnings from cargo build, cargo clippy, cargo fmt

Benchmark run included

Run d01ec670: 195 tasks, Qwen3.5-122B, v2-architecture branch
24.6% pass rate, 0.592 avg score, $4.74 cost, 86min
Results + post-mortem in results/ironclaw/d01ec670-*/

Test plan

cargo test — 83 passed, 0 failed
cargo clippy — zero warnings
cargo fmt -- --check — clean
Full benchmark run (195 tasks) completed successfully
nearai-bench results <id> displays per-task and per-category tables
nearai-bench post-mortem <id> generates failure analysis
--ironclaw-rev v2-architecture builds and runs against different branch
Verify site build (cd site && npm ci && npm run build)

🤖 Generated with Claude Code

Comprehensive benchmark dataset for testing ironclaw's next-gen agent loop (PR #1557) and commitment system (PR #1736). Covers thread lifecycle, CodeAct execution, capability leases, missions, commitments, memory operations, skill activation self-learning, and /expected behavior gap analysis. Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96. Harness changes: - Update ironclaw dependency to staging branch (v0.22.0) - Adapt to new AgentConfig/AgentDeps API surface - Skip interactive NEAR AI auth when NEARAI_API_KEY is set - Handle new StatusUpdate variants and async create_llm_provider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New 09-learning-system category testing learning capabilities: - confidence-decay (10): staleness detection, decay-aware recall - confidence-scoring (10): 1-10 scores, adjustment, thresholds - dedup-correction (10): key supersession, no ghost entries - cross-project (10): shared learnings with consent/isolation - learn-management (10): search, prune, export, stats - fp-learning-loop (10): FP tracking, dismissal, re-evaluation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- build.rs extracts ironclaw git SHA from Cargo.lock at compile time - Auto-populate framework_version with resolved SHA (was always empty) - --ironclaw-rev flag: patches Cargo.toml, rebuilds, re-execs with new binary - Per-category breakdown in run.json (pass_rate, avg_score, cost, time per tag) - Tags persisted on TaskResult for category aggregation - Compare output shows per-category deltas and framework versions - scripts/bench-ab.sh for end-to-end two-ref comparison - Fix pre-existing test failures from ironclaw API changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New 10-real-world-workflows category with complex, multi-turn scenarios requiring Python code execution and tool composition: - financial-analysis (10): earnings analysis, portfolio rebalance, stock screening, revenue forecasting, expense anomaly detection, options payoff, invoice reconciliation, tax optimization - blockchain-ops (10): tx parsing, wallet tracking, DeFi yields, gas optimization, NFT analysis, contract audit, bridge monitoring, MEV detection, DAO treasury - sports-analytics (10): player trends, fantasy draft, game prediction, salary cap, injury impact, trade values, season simulation, draft picks, betting odds, coaching strategy - long-context-stress (10): massive log triage, multi-doc synthesis, codebase review marathon (8 turns), meeting consolidation, compliance audit, incident postmortem, contract review, KB cleanup, resume screening with bias resistance, context overflow recovery - data-engineering (10): CSV insights, API ETL, log parsing, data quality, time series, schema migration, fuzzy multi-source join, report generation, PII anonymization, streaming data simulator Total ironclaw-v2 suite: 195 tasks, 528 turns across 10 categories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add ironclaw_safety as direct dependency (SafetyLayer moved to separate crate in v2-architecture) - Update imports: ironclaw::safety::SafetyLayer -> ironclaw_safety::SafetyLayer - Patch both ironclaw and ironclaw_safety deps when using --ironclaw-rev - Switch --ironclaw-rev to debug builds to avoid OOM on constrained machines - Add .cargo/config.toml with jobs=2 and split-debuginfo to reduce memory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix 10 tasks with response_matches as array (must be string) - Fix duplicate response_matches key in data-quality-report.json - build.rs detects engine_v2 field in AgentConfig via cargo git cache - Sets cfg(ironclaw_engine_v2) so runner.rs conditionally includes it - Register expected cfg in Cargo.toml lints to suppress warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Clone state before move in set_state() logging - Replace BenchError::Other (nonexistent) with BenchError::Config - Fix Vec<str> type error in extract_error_patterns - Prefix unused parameter with underscore - Use explicit PathBuf type annotation for directory entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Suppress dead_code warnings on post_mortem_mission module - Fix unused variable (_deployment_id) - Apply clippy suggestion: map_or -> is_none_or - Run cargo fmt across all source files Build, clippy, and fmt are now all clean with zero warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLI: `nearai-bench post-mortem <run-id>` analyzes failures from a completed benchmark run. Outputs failure stage breakdown (scoring vs timeout vs setup), per-category failure rates, and recommendations. Supports --format json and --force for re-analysis. Results cached as post_mortem.json alongside run.json. Runner: `--auto-post-mortem` flag (or `auto_post_mortem = true` in TOML config) auto-generates post_mortem.json after any run with failures and prints the summary. Bridge layer converts TaskResult failures into the PostMortemMission event model, categorizing by failure stage (timeout, setup, execution, scoring) and aggregating root causes across all failing tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enrich TaskResult with full data needed for analysis and training: - Per-LLM-call detail: input/output tokens, duration, had_tool_calls (was only aggregate totals before) - Tool call inputs/outputs: capture parameters and result previews (was only name + success boolean) - Full conversation history: all user/assistant turns in order (was only final response) - System prompt field: identity/SOUL.md content used for the task All new fields use #[serde(default, skip_serializing_if)] for backward compatibility with existing results. Channel now tracks tool I/O flow: ToolStarted -> ToolResult (preview) -> ToolCompleted (parameters/error), assembling the full picture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each task now runs with its own temp directory as the shell's working directory. Agent-created files (Python scripts, data outputs, etc.) land in the temp dir instead of polluting the repo root. The runner creates a tempdir per task and registers a ShellTool with with_working_dir() pointing to it, replacing the adapter's default shell. The temp dir is cleaned up automatically when the task completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Run d01ec670: 24.6% pass rate, 0.592 avg score, $4.74 cost, 86min Includes post_mortem.json with failure analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The codebase-review-marathon task had a well-known Stripe test key (sk_live_4eC39HqLyjWD...) in its planted vulnerability data. Replaced with an obviously fake key that won't trigger GitHub's secret scanner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ilblackdragon and others added 4 commits March 29, 2026 22:24

zetyquickly force-pushed the main branch from 0505edd to 4d68bc5 Compare March 30, 2026 17:38

ilblackdragon and others added 10 commits March 30, 2026 23:01

chore: migrate site from Vite to Next.js

c9ce603

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

data: add v2-architecture benchmark run (195 tasks, Qwen3.5-122B)

aab4ade

Run d01ec670: 24.6% pass rate, 0.592 avg score, $4.74 cost, 86min Includes post_mortem.json with failure analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
ilblackdragon wants to merge 14 commits intomainfrom
feat/ironclaw-v2-benchmark

ilblackdragon commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilblackdragon commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New benchmark categories

A/B benchmarking across ironclaw versions

Post-mortem failure analysis

Training-grade trace capture

Other improvements

Benchmark run included

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilblackdragon commented Mar 30, 2026 •

edited

Loading