| ArXiv | Models | Data | Code |
- [2024/6/22] Revised the article and added attempts on automatic theorem proving tasks. Codes are in MMOS-F2F.
- [2024/3/30] Update result on MMOS-Code 34B and MMOS-LLEMMA 34B Notice the vllm and transformers version.
- [2024/3/8] 🔥🔥🔥Models MMOS-DeepSeekMath 7B show nice performence with self-consistency and k=50 !!
- [2024/2/28] 🔥 Models MMOS-DeepSeekMath 7B show nice performence and released at MMOS-DeepSeekMath 7B !!
- [2024/2/27] 🔥 Models MMOS-LLEMMA 7B show nice performence and released at MMOS-LLEMMA 7B !!
- [2024/2/27] 🔥 Models MMOS-CODE 13B and MMOS-CODE 34B released at MMOS-CODE 13B and MMOS-CODE 34B !!
- [2024/2/27] 🔥 Models MMOS-CODE 7B released at MMOS-CODE 7B !!
- [2024/2/26] 🔥🔥🔥 Dataset MMOS released at 😊 HuggingFace !!
- [2024/2/23] 🔥🔥🔥Arxiv released at An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning ~
Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two aspects, higher performance and lower construction costs on math reasoning.
Model | Size | GSM8K | SVAMP | ASDiv | MATH | Size | GSM8K | SVAMP | ASDiv | MATH | Size | GSM8K | SVAMP | ASDiv | MATH |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WizardMath | 7B | 54.9 | 57.3 | 59.1 | 10.7 | 13B | 63.9 | 64.3 | 65.8 | 14.0 | 34B | - | - | - | - |
MAMMOTH | 7B | 53.6 | 67.7 | 31.5 | - | 13B | 62.0 | 72.4 | - | 34.2 | 34B | - | - | - | - |
MetaMath | 7B | 66.5 | - | - | 19.8 | 13B | 72.3 | - | - | 22.4 | 34B | - | - | - | - |
MathCoder-L | 7B | 64.2 | 71.5 | - | 23.3 | 13B | 72.6 | 76.9 | - | 29.9 | 34B | - | - | - | - |
MathCoder-CL | 7B | 67.8 | 70.7 | - | 30.2 | 13B | 74.1 | 78.0 | - | 35.9 | 34B | - | - | - | - |
TORA | 7B | 68.8 | 68.2 | 73.9 | 40.1 | 13B | 72.7 | 72.9 | 77.2 | 43.0 | 34B | - | - | - | - |
TORA-CODE | 7B | 72.6 | 70.4 | 78.7 | 44.6 | 13B | 75.8 | 75.7 | 81.4 | 48.1 | 34B | 80.7 | 80.5 | 84.2 | 50.8 |
MMOS | 7B | 69.9 | 73.4 | 76.8 | 40.2 | 13B | 74.8 | 77.0 | 80.0 | 43.2 | 34B | - | - | - | - |
MMOS-CODE | 7B | 73.9 | 76.4 | 78.6 | 44.3 | 13B | 77.1 | 77.5 | 81.9 | 48.1 | 34B | 81.7 | 81.9 | 82.8 | 48.8 |
MMOS-MinCODE | 7B | 70.3 | 72.5 | 76.7 | 44.6 | 13B | - | - | - | - | 34B | - | - | - | - |
MMOS-LLEMMA | 7B | 76.5 | 77.7 | 81.4 | 48.8 | 13B | - | - | - | - | 34B | 82.8 | 81.8 | 84.8 | 51.3 |
MMOS-DeepSeekMath | 7B | 80.5 | 79.3 | 87.6 | 55.0 | 13B | - | - | - | - | 34B | - | - | - | - |
MMOS-DeepSeekMath(SC,k=50) | 7B | 87.2 | - | - | 63.7 | 13B | - | - | - | - | 34B | - | - | - | - |
git clone https://github.com/cyzhh/MMOS.git
cd MMOS
conda create -n MMOS python=3.10
conda activate MMOS
pip install -r requirements.txt
To identify the minimal optimal set, we follow these steps:
- Sample a sufficient number of correct reasoning paths to form initial set.
- Implement a deduplication algorithm to obtain its deduplicated subset.
- Conduct a statistical analysis on the upper limit of reasoning paths per question k with the subset data amount N.
- Perform SFT on several subsets to analyze the impact of removing duplicates and keeping varied reasoning paths.
We use ToRA series to generate QA-pairs from open source dataset GSM8K, MATH, TAL-SCQ. The QA-pairs are processed by our deduplication algorithm, resulting in the dataset MMOS
. The total number of QA-pairs is 135K.
The DATA, which we publish at 😊 HuggingFace, need to be placed under the relative path, ./train_data/MMOS/
.
If you are interested in our work, we will publish details about the data processing aspects after the paper is published.
Following scripts/generate.sh
:
- Prepare your sampling results.
- Combine the results.
- Extract the true cases.
- Dedup the cases.
- (Filter) and rerank.
You can generate a data set for testing the numerical robustness of model performance by executing the following script command:
bash scripts/generate.sh
bash scripts/attack.sh
bash scripts/rerank.sh
Due to resource constraints, we performed supervised fine-tuning on CodeLLaMA 7B, CodeLLaMA 13B and CodeLLaMA 34B using our dataset on A100 40G GPUs. To reproduce our work from CodeLLaMA 7B/13B, you can train according to the following instruction. You can also train the 34B model through DDP script instructions.
bash scripts/train_single.sh codellama 7b
bash scripts/train.sh codellama 34b
bash scripts/infer.sh
If you find this repository helpful, please consider citing our paper:
@misc{chen2024empirical,
title={An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning},
author={Zui Chen and Yezeng Chen and Jiaqi Han and Zhijie Huang and Ji Qi and Yi Zhou},
year={2024},
eprint={2403.00799},
archivePrefix={arXiv},
primaryClass={cs.CL}
}