Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

Open
RyanMarten opened this issue Jan 21, 2025 · 8 comments
Open

Failure to reproduce AIME score on Sky-T1-32B-Preview #38

RyanMarten opened this issue Jan 21, 2025 · 8 comments
Assignees

Comments

@RyanMarten
Copy link

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

@RyanMarten
Copy link
Author

Looks like @fanqiwan similarly got a lower AIME score:

#33 (comment)

@2proveit
Copy link

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}

The reported score was 43.3, which is significantly different.

my score is similar to yours, 0.333

@rucnyz
Copy link

rucnyz commented Jan 23, 2025

python eval.py --model NovaSky-AI/Sky-T1-32B-Preview --evals=AIME --tp=8 --output_file=results.txt --temperatures 0.7

{"acc": 0.3667}
The reported score was 43.3, which is significantly different.

my score is similar to yours, 0.333

my score is even worse, 0.30

@tyler-griggs
Copy link
Collaborator

Hi all, thanks for calling out these issues! We have also discovered some issues in our evaluation framework and are currently working to organize and refactor it at #47.

@simplespy
Copy link

python -m skythought_evals.eval --model=NovaSky-AI/Sky-T1-32B-Preview --evals=livecodebench --tp=8 --result-dir ./ 2>&1 | tee mylogs.txt

Has anyone successfully reproduced the LiveCodeBench score? My results:

LiveCodeBench Acc: 276 / 511 (0.540)
easy Acc: 159 / 182 (0.874)
medium Acc: 100 / 206 (0.485)
hard Acc: 17 / 123 (0.138)

The reported medium and hard accuracies were 56.8% and 17.9%, much higher than what I got.

@kouroshHakha
Copy link
Collaborator

kouroshHakha commented Feb 21, 2025

@simplespy Can you see if this PR has fixed the issue?
#67

To make sure there are no diffs we realized that you need to evaluate on fp32, otherwise random things like vllm version batch size, etc can impact the kernels chosen during inference and could cause fluctuation the end results.

cc @SumanthRH for more info on this.

@SumanthRH
Copy link
Collaborator

@simplespy Building on @kouroshHakha 's response, we have found that at half precision, there's quite a bit of variation in scores for inference related settings. We are working on more standardized evaluation settings and we're going to provide more instructions on this soon.

@simplespy
Copy link

Can you see if this PR has fixed the issue?
#67

@kouroshHakha The results were produced by the latest version of the repo, so I suppose the config mentioned in the PR is already in effect.

Please let me know if any additional instructions are released for the LiveCodeBench reproduction. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants