You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a new conda environment and install the dependencies:
30
+
31
+
```bash
32
+
conda create -n attnrl python=3.10
33
+
conda activate attnrl
34
+
bash scripts/install_vllm_sglang_mcore.sh
35
+
```
36
+
37
+
### Data Preparation
38
+
39
+
The training dataset ([DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset)) is at `data/train/deepscaler_train.parquet`, which contains `40.3k` mathematical reasoning data.
40
+
The evaluation datasets are in `data/eval/` and the suffix `_${K}` indicates the number of duplicate samples for each question.
41
+
42
+
### Training
43
+
44
+
For training AttnRL with [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) backbone on 8 H100 GPUs, run:
Evaluation scripts are the same as the training scripts. `+trainer.val_only=True` should be added to perform evaluation only. We recommend setting `data.max_prompt_length=2048` and `data.max_response_length=32768`.
53
+
17
54
18
55
19
56
## 📝 Citation
@@ -23,7 +60,7 @@ If you find this work helpful, please kindly cite our paper:
23
60
```bibtex
24
61
@article{AttnRL,
25
62
title = {Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models},
26
-
author = {Liu, Runze and Wang, Jiakang and Shi, Yuling and Xie, Zhihui and An, Chenxin and Zhang, Kaiyan and Zhao, Jian and Gu, Xiaodong and Lin, Lei and Hu, Wenping and others},
63
+
author = {Liu, Runze and Wang, Jiakang and Shi, Yuling and Xie, Zhihui and An, Chenxin and Zhang, Kaiyan and Zhao, Jian and Gu, Xiaodong and Lin, Lei and Hu, Wenping and Li, Xiu and Zhang, Fuzheng and Zhou, Guorui and Gai, Kun},
27
64
journal = {arXiv preprint arXiv:2509.26628},
28
65
year = {2025}
29
66
}
@@ -33,5 +70,5 @@ If you find this work helpful, please kindly cite our paper:
33
70
34
71
## 💡 Acknowledgements
35
72
36
-
Our code is based on [verl](https://github.com/volcengine/verl) and [TreeRL](https://github.com/THUDM/TreeRL).
37
-
73
+
Our code is based on [verl](https://github.com/volcengine/verl)([commit](https://github.com/volcengine/verl/commit/83ebd007e01de29bbe353de112d04245b4820b47)) and [TreeRL](https://github.com/THUDM/TreeRL).
74
+
Our training dataset is from [DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) and rule-based verifier is based on [Skywork-OR1](https://github.com/SkyworkAI/Skywork-OR1), and [Archer](https://github.com/wizard-III/ArcherCodeR).
0 commit comments