feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
Open
ilblackdragon wants to merge 14 commits intomainfrom
Open
feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9ilblackdragon wants to merge 14 commits intomainfrom
ilblackdragon wants to merge 14 commits intomainfrom
Conversation
Comprehensive benchmark dataset for testing ironclaw's next-gen agent loop (PR #1557) and commitment system (PR #1736). Covers thread lifecycle, CodeAct execution, capability leases, missions, commitments, memory operations, skill activation self-learning, and /expected behavior gap analysis. Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96. Harness changes: - Update ironclaw dependency to staging branch (v0.22.0) - Adapt to new AgentConfig/AgentDeps API surface - Skip interactive NEAR AI auth when NEARAI_API_KEY is set - Handle new StatusUpdate variants and async create_llm_provider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 09-learning-system category testing learning capabilities: - confidence-decay (10): staleness detection, decay-aware recall - confidence-scoring (10): 1-10 scores, adjustment, thresholds - dedup-correction (10): key supersession, no ghost entries - cross-project (10): shared learnings with consent/isolation - learn-management (10): search, prune, export, stats - fp-learning-loop (10): FP tracking, dismissal, re-evaluation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- build.rs extracts ironclaw git SHA from Cargo.lock at compile time - Auto-populate framework_version with resolved SHA (was always empty) - --ironclaw-rev flag: patches Cargo.toml, rebuilds, re-execs with new binary - Per-category breakdown in run.json (pass_rate, avg_score, cost, time per tag) - Tags persisted on TaskResult for category aggregation - Compare output shows per-category deltas and framework versions - scripts/bench-ab.sh for end-to-end two-ref comparison - Fix pre-existing test failures from ironclaw API changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 10-real-world-workflows category with complex, multi-turn scenarios requiring Python code execution and tool composition: - financial-analysis (10): earnings analysis, portfolio rebalance, stock screening, revenue forecasting, expense anomaly detection, options payoff, invoice reconciliation, tax optimization - blockchain-ops (10): tx parsing, wallet tracking, DeFi yields, gas optimization, NFT analysis, contract audit, bridge monitoring, MEV detection, DAO treasury - sports-analytics (10): player trends, fantasy draft, game prediction, salary cap, injury impact, trade values, season simulation, draft picks, betting odds, coaching strategy - long-context-stress (10): massive log triage, multi-doc synthesis, codebase review marathon (8 turns), meeting consolidation, compliance audit, incident postmortem, contract review, KB cleanup, resume screening with bias resistance, context overflow recovery - data-engineering (10): CSV insights, API ETL, log parsing, data quality, time series, schema migration, fuzzy multi-source join, report generation, PII anonymization, streaming data simulator Total ironclaw-v2 suite: 195 tasks, 528 turns across 10 categories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ironclaw_safety as direct dependency (SafetyLayer moved to separate crate in v2-architecture) - Update imports: ironclaw::safety::SafetyLayer -> ironclaw_safety::SafetyLayer - Patch both ironclaw and ironclaw_safety deps when using --ironclaw-rev - Switch --ironclaw-rev to debug builds to avoid OOM on constrained machines - Add .cargo/config.toml with jobs=2 and split-debuginfo to reduce memory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix 10 tasks with response_matches as array (must be string) - Fix duplicate response_matches key in data-quality-report.json - build.rs detects engine_v2 field in AgentConfig via cargo git cache - Sets cfg(ironclaw_engine_v2) so runner.rs conditionally includes it - Register expected cfg in Cargo.toml lints to suppress warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clone state before move in set_state() logging - Replace BenchError::Other (nonexistent) with BenchError::Config - Fix Vec<str> type error in extract_error_patterns - Prefix unused parameter with underscore - Use explicit PathBuf type annotation for directory entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Suppress dead_code warnings on post_mortem_mission module - Fix unused variable (_deployment_id) - Apply clippy suggestion: map_or -> is_none_or - Run cargo fmt across all source files Build, clippy, and fmt are now all clean with zero warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLI: `nearai-bench post-mortem <run-id>` analyzes failures from a completed benchmark run. Outputs failure stage breakdown (scoring vs timeout vs setup), per-category failure rates, and recommendations. Supports --format json and --force for re-analysis. Results cached as post_mortem.json alongside run.json. Runner: `--auto-post-mortem` flag (or `auto_post_mortem = true` in TOML config) auto-generates post_mortem.json after any run with failures and prints the summary. Bridge layer converts TaskResult failures into the PostMortemMission event model, categorizing by failure stage (timeout, setup, execution, scoring) and aggregating root causes across all failing tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enrich TaskResult with full data needed for analysis and training: - Per-LLM-call detail: input/output tokens, duration, had_tool_calls (was only aggregate totals before) - Tool call inputs/outputs: capture parameters and result previews (was only name + success boolean) - Full conversation history: all user/assistant turns in order (was only final response) - System prompt field: identity/SOUL.md content used for the task All new fields use #[serde(default, skip_serializing_if)] for backward compatibility with existing results. Channel now tracks tool I/O flow: ToolStarted -> ToolResult (preview) -> ToolCompleted (parameters/error), assembling the full picture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each task now runs with its own temp directory as the shell's working directory. Agent-created files (Python scripts, data outputs, etc.) land in the temp dir instead of polluting the repo root. The runner creates a tempdir per task and registers a ShellTool with with_working_dir() pointing to it, replacing the adapter's default shell. The temp dir is cleaned up automatically when the task completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run d01ec670: 24.6% pass rate, 0.592 avg score, $4.74 cost, 86min Includes post_mortem.json with failure analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The codebase-review-marathon task had a well-known Stripe test key (sk_live_4eC39HqLyjWD...) in its planted vulnerability data. Replaced with an obviously fake key that won't trigger GitHub's secret scanner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expands the ironclaw-v2 benchmark suite from 85 to 195 tasks across 10 categories (528 total conversation turns), adds A/B testing infrastructure for comparing ironclaw versions, and enriches trace capture for training-grade data.
New benchmark categories
/learnmanagement interface, false-positive learning loopA/B benchmarking across ironclaw versions
--ironclaw-rev <ref>flag: patches Cargo.toml, rebuilds against any branch/tag/commit, re-execsbuild.rsauto-resolves ironclaw git SHA from Cargo.lock →framework_versionfieldscripts/bench-ab.shfor end-to-end two-ref comparisonrun.jsonandcompareoutputcfg(ironclaw_engine_v2)detection for cross-branch API compatibilityPost-mortem failure analysis
nearai-bench post-mortem <run-id>— analyzes failures by stage (timeout/setup/execution/scoring), per-category failure rates, recommendations--auto-post-mortemflag auto-generatespost_mortem.jsonafter runs with failurespost_mortem.jsonalongsiderun.jsonandtasks.jsonlTraining-grade trace capture
#[serde(default)]Other improvements
ironclaw_safetyas direct dependency for v2-architecture compatibilitycargo build,cargo clippy,cargo fmtBenchmark run included
d01ec670: 195 tasks, Qwen3.5-122B, v2-architecture branchresults/ironclaw/d01ec670-*/Test plan
cargo test— 83 passed, 0 failedcargo clippy— zero warningscargo fmt -- --check— cleannearai-bench results <id>displays per-task and per-category tablesnearai-bench post-mortem <id>generates failure analysis--ironclaw-rev v2-architecturebuilds and runs against different branchcd site && npm ci && npm run build)🤖 Generated with Claude Code