Questions on the leaderboard

1. Can the topic of scaffolding be evaluated? According to SWE-Bench, the top 5 scaffolds are TRAE, Refact, Moatless, OpenHands, and Augment (the next one is SWE-Agent, which is made by the same people as SWE-Bench) https://liveswebench.ai/
2. Most open-weight models (DeepSeek v3.1, GLM 4.5, Qwen3) have thinking modes. Would this feature boost ALE-Bench performance?
3. Also, would you consider their latest updates instead of the initial release (assuming we are going by "live" rules and only include those that came out before September 2025)? 
4. Since Qwen3 80B-A3B came out, would the models with around 72B parameters be added to the benchmark? Others like LiveBench have done so https://livebench.ai/
5. How is ALE different from LiveCodeBench GSO or METR's RE-Bench? https://livecodebench.github.io/gso.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions on the leaderboard #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on the leaderboard #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions