result discrepency

I ran the code as it for 0.6B and 1.7B model but the final results were lower than the one reported in the paper. Also, there is no code for the evaluation of the base model. I evaluated myself with 8192 gen length (greedy and non greedy) and both are much higher than the baseline reported in the paper.

I was wondering if you can provide a more complete code base? that include the baseline evaluation code, and the exact config you trained the relayLLM with?


here are the non-greedy and greedy baseline results for 0.6b that I got:
folder	aime2024_score	aime2025_score	gsm8k_score	math_score	minerva_score	olympiad_score
qwen3_0.6b_5_run	5.4	11.29	78.53	50.8	14.49	23.11
qwen3_0.6b_greedy	0.21	9.9	72.1	48.6	15.81	19.56



here are the non-greedy and greedy baseline results for 1.7b that I got:
folder	aime2024_score	aime2025_score	gsm8k_score	math_score	minerva_score	olympiad_score
qwen3_1.7b_5_run	20.94	20.1	89.81	58.08	21.4	30.07
qwen3_1.7b_greedy	23.12	23.23	89.16	58.0	22.06	28.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

result discrepency #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

result discrepency #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions