feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9
Open
ilblackdragon wants to merge 1 commit intomainfrom
Open
feat: ironclaw-v2 benchmark suite (85 tasks, 8 categories)#9ilblackdragon wants to merge 1 commit intomainfrom
ilblackdragon wants to merge 1 commit intomainfrom
Conversation
Comprehensive benchmark dataset for testing ironclaw's next-gen agent loop (PR #1557) and commitment system (PR #1736). Covers thread lifecycle, CodeAct execution, capability leases, missions, commitments, memory operations, skill activation self-learning, and /expected behavior gap analysis. Baseline: Qwen 3.5-122B → 29.4% pass rate, 0.615 avg score, $0.96. Harness changes: - Update ironclaw dependency to staging branch (v0.22.0) - Adapt to new AgentConfig/AgentDeps API surface - Skip interactive NEAR AI auth when NEARAI_API_KEY is set - Handle new StatusUpdate variants and async create_llm_provider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
stagingbranch (v0.22.0) and adapt harness to new API surfaceNEARAI_API_KEYenv var is setCategories
Baseline Result
Qwen 3.5-122B on ironclaw staging
8acdd08:Harness Changes
Cargo.toml: ironclaw → staging branchsrc/main.rs:SessionConfig::default(), string-based backend check, asynccreate_llm_provider, skip auth with API keysrc/runner.rs: NewAgentConfig/AgentDepsfields for v0.22,owner_idalignment,Arc<ChannelManager>src/channel.rs: Handle newStatusUpdatevariantsTest plan
cargo build— zero errors, zero warningscargo run -- run --suite trajectory --config suites/ironclaw-v2.toml— 85/85 tasks completebaselines/ironclaw-v2/qwen3.5-122b-8acdd08/🤖 Generated with Claude Code