math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

kouroshHakha · 2025-01-27T01:05:19Z

math_equal method has doesn't perfectly generalize to the answers expected to get for Minerva Math. Most of the math problems we have had so far are simple answers within the box (float / integer, etc) not complex expressions. We need to construct a more robust implementation.

The text was updated successfully, but these errors were encountered:

DachengLi1 · 2025-01-27T01:19:21Z

Very interesting finding! @kouroshHakha

We also recently found a similar problem, and raised a similar issue here.

"We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%." from this report built on us.

Maybe we should merge these two issues and implement a LLM check?

kouroshHakha · 2025-01-27T01:27:33Z

yeah I agree. My hesitation is that for eval llm as a judge can be expensive in these verifiable domains. Maybe there is a smarter way of writing the symbolic checks.

DachengLi1 · 2025-01-27T03:56:02Z

Make sense! But I am not sure about whether there are smarter symbolic checks, given qwen team has tried quite hard.

tyler-griggs · 2025-01-27T19:21:03Z

Maybe huggingface's new math verify repo could be helpful here, thought I haven't tried it out: https://github.com/huggingface/Math-Verify

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

kouroshHakha commented Jan 27, 2025

DachengLi1 commented Jan 27, 2025

kouroshHakha commented Jan 27, 2025

DachengLi1 commented Jan 27, 2025

tyler-griggs commented Jan 27, 2025

math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

math_equal method is not robust for minerva-math. Need a better symbolic validation. #50

Comments

kouroshHakha commented Jan 27, 2025

DachengLi1 commented Jan 27, 2025

kouroshHakha commented Jan 27, 2025

DachengLi1 commented Jan 27, 2025

tyler-griggs commented Jan 27, 2025