Skip to content

Hardcoded to use OpenRouter for both bechmarks and judge #16

@jb510

Description

@jb510

I told my OpenClaw to run this skill using /openai-codex/gpt-5.1-codex-mini as a test run of my thinking parameter #12. It rather surprisingly ran the tests using "openrouter/openai-codex/gpt-5.1-codex-mini" which actually failed and resulted in an empty result, but it then passed over the empty files to openrouter/opus-4.5 for judging anyway! (burning some unexpected tokens or I wouldn't have even realized what happened).

I do understand that as an internal tool you want it to be consistent, and hardcoding these things may make sense. However, if it's going to stay that way, it should be documented better in the readme.md.

My preference, and what I'm working on a PR for right now, is the ability to pass in any provider/model that OpenClaw recognizes to be benchmarked and also any provider/model that the user wants to play judge.

I'd obviously keep your public results consistent, but this would allow users to run whatever tests they wanted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions