-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
- Can the topic of scaffolding be evaluated? According to SWE-Bench, the top 5 scaffolds are TRAE, Refact, Moatless, OpenHands, and Augment (the next one is SWE-Agent, which is made by the same people as SWE-Bench) https://liveswebench.ai/
- Most open-weight models (DeepSeek v3.1, GLM 4.5, Qwen3) have thinking modes. Would this feature boost ALE-Bench performance?
- Also, would you consider their latest updates instead of the initial release (assuming we are going by "live" rules and only include those that came out before September 2025)?
- Since Qwen3 80B-A3B came out, would the models with around 72B parameters be added to the benchmark? Others like LiveBench have done so https://livebench.ai/
- How is ALE different from LiveCodeBench GSO or METR's RE-Bench? https://livecodebench.github.io/gso.html
Metadata
Metadata
Assignees
Labels
No labels