Skip to content

result discrepency #1

@amirsatt

Description

@amirsatt

I ran the code as it for 0.6B and 1.7B model but the final results were lower than the one reported in the paper. Also, there is no code for the evaluation of the base model. I evaluated myself with 8192 gen length (greedy and non greedy) and both are much higher than the baseline reported in the paper.

I was wondering if you can provide a more complete code base? that include the baseline evaluation code, and the exact config you trained the relayLLM with?

here are the non-greedy and greedy baseline results for 0.6b that I got:
folder aime2024_score aime2025_score gsm8k_score math_score minerva_score olympiad_score
qwen3_0.6b_5_run 5.4 11.29 78.53 50.8 14.49 23.11
qwen3_0.6b_greedy 0.21 9.9 72.1 48.6 15.81 19.56

here are the non-greedy and greedy baseline results for 1.7b that I got:
folder aime2024_score aime2025_score gsm8k_score math_score minerva_score olympiad_score
qwen3_1.7b_5_run 20.94 20.1 89.81 58.08 21.4 30.07
qwen3_1.7b_greedy 23.12 23.23 89.16 58.0 22.06 28.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions