Skip to content

Commit b4d1588

Browse files
committed
Add a script for testing F1 Score
1 parent 53e65ba commit b4d1588

File tree

5 files changed

+931
-0
lines changed

5 files changed

+931
-0
lines changed

eval/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
## Accuracy testing of Sparse method
2+
3+
### Overview
4+
We use two Chinese subsets of [LongBench](https://huggingface.co/datasets/zai-org/LongBench) to test the accuracy of single-document QA (multifieldqa_zh) and multi-document QA (dureader). The F1 score is adopted to evaluate the accuracy of these sparse methods. For more information about LongBench, please refer to https://github.com/THUDM/LongBench.
5+
6+
### Quick Start
7+
8+
#### Environment Preparation
9+
```shell
10+
pip install jieba fuzzywuzzy rouge
11+
```
12+
#### Test Data Preparation
13+
Dowdload the Longbench dataset
14+
15+
```shell
16+
wget https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip && unzip data.zip
17+
18+
```
19+
20+
#### Configure Specific Sparse Method
21+
22+
Settings for different sparse methods are written in a JSON file, for example:
23+
```python
24+
{"ESA":
25+
{
26+
"init_window_sz": 1,
27+
"local_window_sz": 2,
28+
"min_blocks":4,
29+
"sparse_ratio": 0.2,
30+
"retrieval_stride": 10
31+
}
32+
}
33+
```
34+
35+
Run accuracy testing with:
36+
```shell
37+
cd eval
38+
39+
# Run with default settings: Qwen2.5-14B-Instruct batch=20
40+
bash eval_inference_F1.sh
41+
42+
# Run with custom parameters
43+
# --strip_think: extract the text after </think> from model predictions
44+
# --batch: number of requests processed per batch
45+
bash eval_inference_F1.sh \
46+
--model /home/models/QwQ-32B \
47+
--config ./eval/ucm_sparse_config_esa.json \
48+
--data ./eval/data \
49+
--strip_think 1 \
50+
--batch 1
51+
52+
```
53+
The result files will be saved in the eval/ucm_sparse_predictions folder.
54+
55+
### Results
56+
Test results of Full Attention (Qwen2.5-14B-Instruct):
57+
58+
| Dataset | F1-Score |
59+
|-------|-----------:|
60+
| multifieldqa_zh | 66.6 |
61+
| dureader | 29.33 |
62+

0 commit comments

Comments
 (0)