You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
math_equal method has doesn't perfectly generalize to the answers expected to get for Minerva Math. Most of the math problems we have had so far are simple answers within the box (float / integer, etc) not complex expressions. We need to construct a more robust implementation.
The text was updated successfully, but these errors were encountered:
We also recently found a similar problem, and raised a similar issue here.
"We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%." from this report built on us.
Maybe we should merge these two issues and implement a LLM check?
yeah I agree. My hesitation is that for eval llm as a judge can be expensive in these verifiable domains. Maybe there is a smarter way of writing the symbolic checks.
math_equal
method has doesn't perfectly generalize to the answers expected to get for Minerva Math. Most of the math problems we have had so far are simple answers within the box (float / integer, etc) not complex expressions. We need to construct a more robust implementation.The text was updated successfully, but these errors were encountered: