Skip to content

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9

Open
ilblackdragon wants to merge 14 commits intomainfrom
feat/ironclaw-v2-benchmark
Open

feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
ilblackdragon wants to merge 14 commits intomainfrom
feat/ironclaw-v2-benchmark

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

@ilblackdragon ilblackdragon commented Mar 30, 2026

Summary

Expands the ironclaw-v2 benchmark suite from 85 to 195 tasks across 10 categories (528 total conversation turns), adds A/B testing infrastructure for comparing ironclaw versions, and enriches trace capture for training-grade data.

New benchmark categories

  • 09-learning-system (60 tasks, 6 subcategories): confidence decay, confidence scoring, dedup/correction, cross-project learning, /learn management interface, false-positive learning loop
  • 10-real-world-workflows (50 tasks, 5 subcategories): financial analysis, blockchain ops, sports analytics, long-context stress tests, data engineering pipelines — all multi-turn (5-8 turns) requiring Python code execution and tool composition

A/B benchmarking across ironclaw versions

  • --ironclaw-rev <ref> flag: patches Cargo.toml, rebuilds against any branch/tag/commit, re-execs
  • build.rs auto-resolves ironclaw git SHA from Cargo.lock → framework_version field
  • scripts/bench-ab.sh for end-to-end two-ref comparison
  • Per-category breakdown in run.json and compare output
  • cfg(ironclaw_engine_v2) detection for cross-branch API compatibility

Post-mortem failure analysis

  • CLI: nearai-bench post-mortem <run-id> — analyzes failures by stage (timeout/setup/execution/scoring), per-category failure rates, recommendations
  • Runner hook: --auto-post-mortem flag auto-generates post_mortem.json after runs with failures
  • Cached as post_mortem.json alongside run.json and tasks.jsonl

Training-grade trace capture

  • Per-LLM-call detail: individual token counts, duration, tool call flag
  • Tool call inputs/outputs (parameters + result previews, truncated to 2KB)
  • Full conversation history (all user/assistant turns)
  • System prompt / identity content
  • All new fields backward-compatible via #[serde(default)]

Other improvements

  • Isolated per-task working directory (agent files don't pollute repo root)
  • Tags persisted on TaskResult for category aggregation
  • Fixed pre-existing test failures from ironclaw API changes
  • ironclaw_safety as direct dependency for v2-architecture compatibility
  • Zero warnings from cargo build, cargo clippy, cargo fmt

Benchmark run included

  • Run d01ec670: 195 tasks, Qwen3.5-122B, v2-architecture branch
  • 24.6% pass rate, 0.592 avg score, $4.74 cost, 86min
  • Results + post-mortem in results/ironclaw/d01ec670-*/

Test plan

  • cargo test — 83 passed, 0 failed
  • cargo clippy — zero warnings
  • cargo fmt -- --check — clean
  • Full benchmark run (195 tasks) completed successfully
  • nearai-bench results <id> displays per-task and per-category tables
  • nearai-bench post-mortem <id> generates failure analysis
  • --ironclaw-rev v2-architecture builds and runs against different branch
  • Verify site build (cd site && npm ci && npm run build)

🤖 Generated with Claude Code

ilblackdragon and others added 4 commits March 29, 2026 22:24
Comprehensive benchmark dataset for testing ironclaw's next-gen agent
loop (PR #1557) and commitment system (PR #1736). Covers thread
lifecycle, CodeAct execution, capability leases, missions,
commitments, memory operations, skill activation self-learning, and
/expected behavior gap analysis.

Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96.

Harness changes:
- Update ironclaw dependency to staging branch (v0.22.0)
- Adapt to new AgentConfig/AgentDeps API surface
- Skip interactive NEAR AI auth when NEARAI_API_KEY is set
- Handle new StatusUpdate variants and async create_llm_provider

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 09-learning-system category testing learning capabilities:
- confidence-decay (10): staleness detection, decay-aware recall
- confidence-scoring (10): 1-10 scores, adjustment, thresholds
- dedup-correction (10): key supersession, no ghost entries
- cross-project (10): shared learnings with consent/isolation
- learn-management (10): search, prune, export, stats
- fp-learning-loop (10): FP tracking, dismissal, re-evaluation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- build.rs extracts ironclaw git SHA from Cargo.lock at compile time
- Auto-populate framework_version with resolved SHA (was always empty)
- --ironclaw-rev flag: patches Cargo.toml, rebuilds, re-execs with new binary
- Per-category breakdown in run.json (pass_rate, avg_score, cost, time per tag)
- Tags persisted on TaskResult for category aggregation
- Compare output shows per-category deltas and framework versions
- scripts/bench-ab.sh for end-to-end two-ref comparison
- Fix pre-existing test failures from ironclaw API changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 10-real-world-workflows category with complex, multi-turn scenarios
requiring Python code execution and tool composition:

- financial-analysis (10): earnings analysis, portfolio rebalance, stock
  screening, revenue forecasting, expense anomaly detection, options
  payoff, invoice reconciliation, tax optimization
- blockchain-ops (10): tx parsing, wallet tracking, DeFi yields, gas
  optimization, NFT analysis, contract audit, bridge monitoring, MEV
  detection, DAO treasury
- sports-analytics (10): player trends, fantasy draft, game prediction,
  salary cap, injury impact, trade values, season simulation, draft
  picks, betting odds, coaching strategy
- long-context-stress (10): massive log triage, multi-doc synthesis,
  codebase review marathon (8 turns), meeting consolidation, compliance
  audit, incident postmortem, contract review, KB cleanup, resume
  screening with bias resistance, context overflow recovery
- data-engineering (10): CSV insights, API ETL, log parsing, data quality,
  time series, schema migration, fuzzy multi-source join, report
  generation, PII anonymization, streaming data simulator

Total ironclaw-v2 suite: 195 tasks, 528 turns across 10 categories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon and others added 10 commits March 30, 2026 23:01
- Add ironclaw_safety as direct dependency (SafetyLayer moved to
  separate crate in v2-architecture)
- Update imports: ironclaw::safety::SafetyLayer -> ironclaw_safety::SafetyLayer
- Patch both ironclaw and ironclaw_safety deps when using --ironclaw-rev
- Switch --ironclaw-rev to debug builds to avoid OOM on constrained machines
- Add .cargo/config.toml with jobs=2 and split-debuginfo to reduce memory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix 10 tasks with response_matches as array (must be string)
- Fix duplicate response_matches key in data-quality-report.json
- build.rs detects engine_v2 field in AgentConfig via cargo git cache
- Sets cfg(ironclaw_engine_v2) so runner.rs conditionally includes it
- Register expected cfg in Cargo.toml lints to suppress warnings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clone state before move in set_state() logging
- Replace BenchError::Other (nonexistent) with BenchError::Config
- Fix Vec<str> type error in extract_error_patterns
- Prefix unused parameter with underscore
- Use explicit PathBuf type annotation for directory entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Suppress dead_code warnings on post_mortem_mission module
- Fix unused variable (_deployment_id)
- Apply clippy suggestion: map_or -> is_none_or
- Run cargo fmt across all source files

Build, clippy, and fmt are now all clean with zero warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLI: `nearai-bench post-mortem <run-id>` analyzes failures from a
completed benchmark run. Outputs failure stage breakdown (scoring vs
timeout vs setup), per-category failure rates, and recommendations.
Supports --format json and --force for re-analysis. Results cached
as post_mortem.json alongside run.json.

Runner: `--auto-post-mortem` flag (or `auto_post_mortem = true` in
TOML config) auto-generates post_mortem.json after any run with
failures and prints the summary.

Bridge layer converts TaskResult failures into the PostMortemMission
event model, categorizing by failure stage (timeout, setup, execution,
scoring) and aggregating root causes across all failing tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enrich TaskResult with full data needed for analysis and training:

- Per-LLM-call detail: input/output tokens, duration, had_tool_calls
  (was only aggregate totals before)
- Tool call inputs/outputs: capture parameters and result previews
  (was only name + success boolean)
- Full conversation history: all user/assistant turns in order
  (was only final response)
- System prompt field: identity/SOUL.md content used for the task

All new fields use #[serde(default, skip_serializing_if)] for
backward compatibility with existing results.

Channel now tracks tool I/O flow: ToolStarted -> ToolResult (preview)
-> ToolCompleted (parameters/error), assembling the full picture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each task now runs with its own temp directory as the shell's working
directory. Agent-created files (Python scripts, data outputs, etc.)
land in the temp dir instead of polluting the repo root.

The runner creates a tempdir per task and registers a ShellTool with
with_working_dir() pointing to it, replacing the adapter's default
shell. The temp dir is cleaned up automatically when the task completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run d01ec670: 24.6% pass rate, 0.592 avg score, $4.74 cost, 86min
Includes post_mortem.json with failure analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The codebase-review-marathon task had a well-known Stripe test key
(sk_live_4eC39HqLyjWD...) in its planted vulnerability data. Replaced
with an obviously fake key that won't trigger GitHub's secret scanner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant