Hardcoded to use OpenRouter for both bechmarks and judge

I told my OpenClaw to run this skill using /openai-codex/gpt-5.1-codex-mini as a test run of my thinking parameter #12.  It rather surprisingly ran the tests using "openrouter/openai-codex/gpt-5.1-codex-mini" which actually failed and resulted in an empty result, but it then passed over the empty files to openrouter/opus-4.5 for judging anyway!  (burning some unexpected tokens or I wouldn't have even realized what happened).

I do understand that as an internal tool you want it to be consistent, and hardcoding these things may make sense.  However, if it's going to stay that way, it should be documented better in the readme.md.

My preference, and what I'm working on a PR for right now, is the ability to pass in any provider/model that OpenClaw recognizes to be benchmarked and also any provider/model that the user wants to play judge.

I'd obviously keep your public results consistent, but this would allow users to run whatever tests they wanted. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded to use OpenRouter for both bechmarks and judge #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardcoded to use OpenRouter for both bechmarks and judge #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions