Failure to reproduce AIME score on Sky-T1-32B-Preview #38

RyanMarten · 2025-01-21T03:28:28Z

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

The text was updated successfully, but these errors were encountered:

RyanMarten · 2025-01-21T06:48:10Z

Looks like @fanqiwan similarly got a lower AIME score:

#33 (comment)

2proveit · 2025-01-21T12:31:24Z

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7
{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

my score is similar to yours, 0.333

rucnyz · 2025-01-23T20:08:34Z

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7
{"acc": 0.3667}
The reported score was 43.3, which is significantly different.
my score is similar to yours, 0.333

my score is even worse, 0.30

tyler-griggs · 2025-01-26T00:44:07Z

Hi all, thanks for calling out these issues! We have also discovered some issues in our evaluation framework and are currently working to organize and refactor it at #47.

simplespy · 2025-02-20T20:49:21Z

python -m skythought_evals.eval --model=NovaSky-AI/Sky-T1-32B-Preview --evals=livecodebench --tp=8 --result-dir ./ 2>&1 | tee mylogs.txt

Has anyone successfully reproduced the LiveCodeBench score? My results:

LiveCodeBench Acc: 276 / 511 (0.540)
easy Acc: 159 / 182 (0.874)
medium Acc: 100 / 206 (0.485)
hard Acc: 17 / 123 (0.138)

The reported medium and hard accuracies were 56.8% and 17.9%, much higher than what I got.

kouroshHakha · 2025-02-21T17:17:32Z

@simplespy Can you see if this PR has fixed the issue?
#67

To make sure there are no diffs we realized that you need to evaluate on fp32, otherwise random things like vllm version batch size, etc can impact the kernels chosen during inference and could cause fluctuation the end results.

cc @SumanthRH for more info on this.

SumanthRH · 2025-02-21T17:27:10Z

@simplespy Building on @kouroshHakha 's response, we have found that at half precision, there's quite a bit of variation in scores for inference related settings. We are working on more standardized evaluation settings and we're going to provide more instructions on this soon.

simplespy · 2025-02-21T20:58:54Z

Can you see if this PR has fixed the issue?
#67

@kouroshHakha The results were produced by the latest version of the repo, so I suppose the config mentioned in the PR is already in effect.

Please let me know if any additional instructions are released for the LiveCodeBench reproduction. Thanks!

DachengLi1 assigned lynnliu030 Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

RyanMarten commented Jan 21, 2025

RyanMarten commented Jan 21, 2025

2proveit commented Jan 21, 2025

rucnyz commented Jan 23, 2025

tyler-griggs commented Jan 26, 2025

simplespy commented Feb 20, 2025

kouroshHakha commented Feb 21, 2025 •

edited

Loading

SumanthRH commented Feb 21, 2025

simplespy commented Feb 21, 2025

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Comments

RyanMarten commented Jan 21, 2025

RyanMarten commented Jan 21, 2025

2proveit commented Jan 21, 2025

rucnyz commented Jan 23, 2025

tyler-griggs commented Jan 26, 2025

simplespy commented Feb 20, 2025

kouroshHakha commented Feb 21, 2025 • edited Loading

SumanthRH commented Feb 21, 2025

simplespy commented Feb 21, 2025

kouroshHakha commented Feb 21, 2025 •

edited

Loading