Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

Open
kouroshHakha opened this issue Jan 27, 2025 · 4 comments

Comments

@kouroshHakha
Copy link
Collaborator

math_equal method has doesn't perfectly generalize to the answers expected to get for Minerva Math. Most of the math problems we have had so far are simple answers within the box (float / integer, etc) not complex expressions. We need to construct a more robust implementation.

@DachengLi1
Copy link
Collaborator

Very interesting finding! @kouroshHakha

We also recently found a similar problem, and raised a similar issue here.

"We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%." from this report built on us.

Maybe we should merge these two issues and implement a LLM check?

@kouroshHakha
Copy link
Collaborator Author

yeah I agree. My hesitation is that for eval llm as a judge can be expensive in these verifiable domains. Maybe there is a smarter way of writing the symbolic checks.

@DachengLi1
Copy link
Collaborator

Make sense! But I am not sure about whether there are smarter symbolic checks, given qwen team has tried quite hard.

@tyler-griggs
Copy link
Collaborator

Maybe huggingface's new math verify repo could be helpful here, thought I haven't tried it out: https://github.com/huggingface/Math-Verify

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants