Eval code

Hi,

Thanks for the interesting work! I was wondering if you could share the pipeline code used for evaluation. For fair comparison, I’d like to follow the procedure described in your paper, as mentioned 'employing Qwen2.5-32B-Instruct [51] to evaluate answer correctness by comparing extracted responses with ground-truth answers (often involving complex mathematical expressions).'