Skip to content

Questions on the leaderboard #14

@BradKML

Description

@BradKML
  1. Can the topic of scaffolding be evaluated? According to SWE-Bench, the top 5 scaffolds are TRAE, Refact, Moatless, OpenHands, and Augment (the next one is SWE-Agent, which is made by the same people as SWE-Bench) https://liveswebench.ai/
  2. Most open-weight models (DeepSeek v3.1, GLM 4.5, Qwen3) have thinking modes. Would this feature boost ALE-Bench performance?
  3. Also, would you consider their latest updates instead of the initial release (assuming we are going by "live" rules and only include those that came out before September 2025)?
  4. Since Qwen3 80B-A3B came out, would the models with around 72B parameters be added to the benchmark? Others like LiveBench have done so https://livebench.ai/
  5. How is ALE different from LiveCodeBench GSO or METR's RE-Bench? https://livecodebench.github.io/gso.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions