reference: https://github.com/tianlwang/eval_gsm8k. This is an implementation of batch evaluation for GSM8K.
The 8-shot prompt is from the lm-evaluation-harness gsm8k-cot
python eval_gsm8k.py --model <model_name>
Model | Accuracy | Harness Accuracy |
---|---|---|
Mistral-7B-v0.1 | ||
Llama-3-8b-hf | 0.42 |
python eval_gsm8k.py --model <model_name> --use_majority_vote --temp 0.2 --n_votes 8
Model | Accuracy | Harness Accuracy |
---|---|---|
Mistral-7B-v0.1 |
python eval_gsm8k.py --model <model_name> --use_majority_vote --temp 0.4 --n_votes 8
Model | Accuracy |
---|---|
Mistral-7B-v0.1 |
use the Chain of Thought prompt "Let's think step by step." before answering the question.
python eval_gsm8k.py --model <model_name> --cot
Model | Accuracy | Harness Accuracy |
---|---|---|
Mistral-7B-v0.1 |
python eval_gsm8k.py --model <model_name> --zero-shot
Model | Accuracy |
---|---|
Mistral-7B-v0.1 |