-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to reproduce AIME score on Sky-T1-32B-Preview #38
Comments
Looks like @fanqiwan similarly got a lower AIME score: |
my score is similar to yours, 0.333 |
my score is even worse, 0.30 |
Hi all, thanks for calling out these issues! We have also discovered some issues in our evaluation framework and are currently working to organize and refactor it at #47. |
Has anyone successfully reproduced the LiveCodeBench score? My results:
The reported medium and hard accuracies were 56.8% and 17.9%, much higher than what I got. |
@simplespy Can you see if this PR has fixed the issue? To make sure there are no diffs we realized that you need to evaluate on fp32, otherwise random things like vllm version batch size, etc can impact the kernels chosen during inference and could cause fluctuation the end results. cc @SumanthRH for more info on this. |
@simplespy Building on @kouroshHakha 's response, we have found that at half precision, there's quite a bit of variation in scores for inference related settings. We are working on more standardized evaluation settings and we're going to provide more instructions on this soon. |
@kouroshHakha The results were produced by the latest version of the repo, so I suppose the config mentioned in the PR is already in effect. Please let me know if any additional instructions are released for the LiveCodeBench reproduction. Thanks! |
{"acc": 0.3667}
The reported score was 43.3, which is significantly different.
The text was updated successfully, but these errors were encountered: