Skip to content

Commit d5e954e

Browse files
gpengzhiliuyike-xiaomiliuyike3
authored
Merge dev into main
* guievalkit reconstructed (#13) Co-authored-by: liuyike3 <liuyike3@xiaomi.com> * feat profiler (#14) * guievalkit reconstructed * feat: profiler * fix * Update results.md --------- Co-authored-by: liuyike3 <liuyike3@xiaomi.com> --------- Co-authored-by: liuyike-xiaomi <liuyike_xiaomi@163.com> Co-authored-by: liuyike3 <liuyike3@xiaomi.com>
1 parent 14e3cb3 commit d5e954e

128 files changed

Lines changed: 10912 additions & 4557 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ coverage.xml
4848
*.cover
4949
.hypothesis/
5050
.pytest_cache/
51+
test/
5152

5253
# Translations
5354
*.mo
@@ -117,10 +118,11 @@ dmypy.json
117118
# customize
118119
/data/android_control/
119120
/data/android_in_the_zoo/
120-
/data/cagui_agent/
121+
cagui_agent/
121122
/data/gui_odyssey/
122-
/data/hypertrack/
123123
/outputs
124+
/models
125+
/logs
124126
.idea/
125127
.DS_Store
126128
.vscode/

README.md

Lines changed: 126 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,31 @@ GUIEvalKit is an open-source evaluation toolkit for GUI agents, allowing practit
66

77
This work has been tested in the following environment:
88
* `python == 3.10.12`
9-
* `torch == 2.7.1+cu126`
10-
* `transformers == 4.56.1`
11-
* `vllm == 0.10.1`
9+
* `torch == 2.8.1+cu128`
10+
* `transformers == 4.57.1`
11+
* `vllm == 0.11.0`
12+
13+
### Installation
14+
15+
Install the required dependencies:
16+
17+
```bash
18+
pip install uv
19+
20+
pushd ./guievalkit/
21+
uv venv ur_venv
22+
source ur_venv/bin/activate
23+
uv pip install -r requirements.txt -i accessible_url # uv won't read accessible url from pip.conf
24+
```
25+
26+
Make sure you have CUDA 12.8.x installed for GPU acceleration with vLLM.
1227

1328
## Supported Models
1429

1530
| Model | Model Name | Organization |
1631
|---------------------------------------------------------|-----------------------------------------------|--------------|
1732
| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) | `qwen2.5-vl-3/7/32/72b-instruct` | Alibaba |
33+
| [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) | `qwen3-vl-4/8b-instruct/thinking` | Alibaba |
1834
| [GUI-Owl](https://github.com/X-PLUG/MobileAgent) | `gui-owl-7/32b` | Alibaba |
1935
| [UI-Venus](https://github.com/inclusionAI/UI-Venus) | `ui-venus-navi-7/72b` | Ant Group |
2036
| [UI-TARS](https://github.com/bytedance/UI-TARS) | `ui-tars-2/7/72b-sft`, `ui-tars-7/72b-dpo` | Bytedance |
@@ -38,29 +54,118 @@ This work has been tested in the following environment:
3854

3955
Please follow the [instructions](./data/README.md) to download and preprocess the datasets.
4056

41-
## Data & Model Registration
57+
## Development
58+
59+
### Configuration
4260

43-
Please update the files [dataset_info.json](./config/dataset_info.json) and [model_info.json](./config/model_info.json) with your own information.
61+
Please update the configuration files or objs with your own information:
62+
- **[dataset_info.json](./config/dataset_info.json)**: Configure dataset paths and settings
63+
- **[guieval/config.py](./guieval/config.py)**: `DATASET` for clear type notation and static checking
64+
- **[model_paths.json](./config/model_paths.json)**: Configure default model paths for supported models
65+
66+
### Model Core Implementation
67+
- **[ur_model.py](./guieval/models/ur_model.py)**: Implement ur model's core methods
68+
- **[__init__.py](./guieval/models/__init__.py)**: Register ur model
4469

4570
## Evaluation
4671

72+
### Quick Start
73+
74+
You can use the provided `run.sh` script as a template, or run directly with Python:
75+
4776
```bash
48-
python3 run.py \
49-
--model agentcpm-gui-8b \
50-
--dataset cagui_agent \
51-
--mode all \
52-
--outputs outputs/agentcpm-gui-8b/cagui_agent \
53-
--use-vllm
77+
python3 run.py all \
78+
--setup.datasets cagui_agent \
79+
--setup.model.model_name agentcpm-gui-8b \
80+
--setup.eval_mode offline_rule \
81+
--setup.vllm_mode online
5482
```
55-
**Arguments:**
56-
- `--model (str)`: Set the model name that is supported in GUIEvalKit (defined in `config/model_info.json`).
57-
- `--dataset (str)`: Set the benchmark name that is supported in GUIEvalKit (defined in `config/dataset_info.json`).
58-
- `--mode (str, default to 'all', choices are ['all', 'infer', 'eval'])`: When `mode` set to `all`, will perform both inference and evaluation; when set to `infer`, will only perform the inference; when set to `eval`, will only perform the evaluation.
59-
- `--outputs (str, default to './outputs')`: The directory to save evaluation results.
60-
- `--batch-size (int, default to 64)`: The batch size used for inference.
61-
- `--no-think`: Use this argument if you want to disable the thinking mode (if applicable).
62-
- `--use-vllm`: Use this argument if you want to inference with `vllm`, otherwise `transformers` will be adopted.
63-
- `--over-size`: Use this argument for deploying large models on four GPUs and inferring with `vllm`.
83+
84+
### Command Structure
85+
86+
The evaluation command follows this structure:
87+
88+
```bash
89+
python3 run.py <mode> [--setup.<config_path> <value> ...]
90+
```
91+
92+
**Mode Options:**
93+
- `all`: Perform both inference and evaluation (default)
94+
- `infer`: Only perform inference
95+
- `eval`: Only perform evaluation (currently not implemented)
96+
97+
### Configuration Options
98+
99+
#### Dataset Configuration
100+
- `--setup.datasets (str | list)`: Comma-separated list of datasets to evaluate. Supported datasets: `androidcontrol_low`, `androidcontrol_high`, `cagui_agent`, `gui_odyssey`, `aitz`
101+
102+
#### Model Configuration
103+
- `--setup.model.model_name (str)`: Model name from the supported models list (required)
104+
- `--setup.model.model (str)`: Custom model path (optional, defaults to path in `model_paths.json`)
105+
- `--setup.model.model_alias (str)`: Human-readable model identifier for logs (optional, defaults to `model_name`)
106+
- `--setup.model.max_model_len (int)`: Maximum context length (default: 8192)
107+
- `--setup.model.tensor_parallel_size (int)`: Number of GPUs for tensor parallelism (default: 1)
108+
- `--setup.model.data_parallel_size (int)`: Number of GPUs for data parallelism (default: 1)
109+
- `--setup.model.pipeline_parallel_size (int)`: Number of GPUs for pipeline parallelism (default: 1)
110+
- `--setup.model.max_num_batched_tokens (int)`: Maximum batched tokens per inference (default: 4096)
111+
- `--setup.model.max_num_seqs (int)`: Maximum sequences per inference (default: 32)
112+
- `--setup.model.image_limit (int)`: Maximum images per prompt (default: 3)
113+
114+
#### Evaluation Configuration
115+
- `--setup.eval_mode (str)`: Evaluation mode (default: `offline_rule`)
116+
- `offline_rule`: Evaluate with model off-policy based on predefined rules
117+
- `semi_online`: Evaluate on-policy with model's own outputs when task succeeds
118+
- ...
119+
- `--setup.vllm_mode (str)`: vLLM inference mode (default: `online`)
120+
- `online`: Use vLLM online serving for concurrent generation
121+
- `offline`: Use vLLM batched generation
122+
- `--setup.enable_thinking (bool)`: Enable thinking mode for models that support it (default: `true`)
123+
- `--setup.batch_size (int)`: Task Batch size for offline vLLM mode (default: 64)
124+
- `--setup.max_concurrent_tasks (int)`: Maximum concurrent tasks for online vLLM mode (default: 128)
125+
126+
#### Output Configuration
127+
- `--setup.output_dir (str)`: Directory to save evaluation results (default: `./outputs`)
128+
- `--setup.log_dir (str)`: Directory to save logs (default: `./logs/guieval`)
129+
130+
### Example: Using run.sh
131+
132+
You can modify `run.sh` to customize your evaluation:
133+
134+
```bash
135+
datasets=androidcontrol_high,gui_odyssey,cagui_agent
136+
model="ui-tars-1.5-7b"
137+
model_path="None" # or /path/to/specific_model
138+
model_alias="None" # or custom alias
139+
mode=all
140+
vllm_mode=online
141+
max_model_len=40960
142+
tp=1
143+
dp=8
144+
pp=1
145+
tokens_batch_size=16384
146+
seq_box=32
147+
image_limit=1
148+
concurrent=32
149+
eval_mode=offline_rule
150+
enable_thinking=false
151+
152+
python3 run.py ${mode} \
153+
--setup.datasets ${datasets} \
154+
--setup.model.model_name ${model} \
155+
--setup.model.model_alias ${model_alias} \
156+
--setup.model.model ${model_path} \
157+
--setup.model.max_model_len ${max_model_len} \
158+
--setup.model.tensor_parallel_size ${tp} \
159+
--setup.model.data_parallel_size ${dp} \
160+
--setup.model.pipeline_parallel_size ${pp} \
161+
--setup.model.max_num_batched_tokens ${tokens_batch_size} \
162+
--setup.model.max_num_seqs ${seq_box} \
163+
--setup.model.image_limit ${image_limit} \
164+
--setup.eval_mode ${eval_mode} \
165+
--setup.vllm_mode ${vllm_mode} \
166+
--setup.max_concurrent_tasks ${concurrent} \
167+
--setup.enable_thinking ${enable_thinking}
168+
```
64169

65170
**Please check [here](./docs/results.md) for the detailed evaluation results.**
66171

@@ -74,4 +179,4 @@ To add new GUI agents and benchmarks to GUIEvalKit, please refer to the [Develop
74179

75180
## Acknowledgement
76181

77-
This repo benefits from [AgentCPM-GUI/eval](https://github.com/OpenBMB/AgentCPM-GUI/tree/main/eval) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). Thanks for their wonderful works.
182+
This repo benefits from [AgentCPM-GUI/eval](https://github.com/OpenBMB/AgentCPM-GUI/tree/main/eval) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). Thanks for their wonderful works.

config/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import json
2+
3+
from config.config_utils import CONFIG_BASE, model_config_handler
4+
5+
MODEL_PATH_FILE = CONFIG_BASE / 'model_paths.json'
6+
MODEL_PATHS: dict = json.loads(MODEL_PATH_FILE.read_text())
7+
8+
9+
__all__ = ['CONFIG_BASE',
10+
'model_config_handler',
11+
'MODEL_PATH_FILE',
12+
'MODEL_PATHS']

config/config_utils.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
try:
2+
import flash_attn # noqa: F401
3+
hf_attn_implementation = "flash_attention_2"
4+
except Exception:
5+
hf_attn_implementation = "sdpa"
6+
7+
import importlib.resources as res
8+
9+
from pydantic import BaseModel
10+
from typing import Any
11+
from transformers import (AutoProcessor, AutoTokenizer, AutoModelForCausalLM,
12+
Qwen2VLForConditionalGeneration, Qwen2VLProcessor, Qwen2_5_VLForConditionalGeneration,
13+
Qwen3VLForConditionalGeneration,
14+
Glm4vForConditionalGeneration, Glm4vMoeForConditionalGeneration)
15+
16+
CONFIG_BASE = res.files('config')
17+
18+
19+
class ModelConfig(BaseModel):
20+
llm_class: Any
21+
tokenizer_class: Any
22+
attn_implementation: str | Any
23+
24+
25+
MODEL_CONFIGS: dict[tuple[str], ModelConfig] = {
26+
("agentcpm-gui-8b", ): ModelConfig(
27+
llm_class=AutoModelForCausalLM,
28+
tokenizer_class=AutoTokenizer,
29+
attn_implementation="sdpa"
30+
),
31+
("qwen2.5-vl-3b-instruct", "qwen2.5-vl-7b-instruct",
32+
"ui-tars-1.5-7b",
33+
"mimo-vl-7b-sft", "mimo-vl-7b-rl", "mimo-vl-7b-sft-2508", "mimo-vl-7b-rl-2508",
34+
"gui-owl-7b", "gui-owl-32b",
35+
"ui-venus-navi-7b", "ui-venus-navi-72b"): ModelConfig(
36+
llm_class=Qwen2_5_VLForConditionalGeneration,
37+
tokenizer_class=AutoProcessor,
38+
attn_implementation=hf_attn_implementation
39+
),
40+
("qwen3-vl-4b-instruct", "qwen3-vl-4b-thinking", "qwen3-vl-8b-instruct", "qwen3-vl-8b-thinking"): ModelConfig(
41+
llm_class=Qwen3VLForConditionalGeneration,
42+
tokenizer_class=AutoProcessor,
43+
attn_implementation=hf_attn_implementation
44+
),
45+
("ui-tars-2b-sft", "ui-tars-7b-sft", "ui-tars-7b-dpo", "ui-tars-72b-sft", "ui-tars-72b-dpo"): ModelConfig(
46+
llm_class=Qwen2VLForConditionalGeneration,
47+
tokenizer_class=AutoProcessor,
48+
attn_implementation=hf_attn_implementation
49+
),
50+
("glm-4.1v-9b-thinking", ): ModelConfig(
51+
llm_class=Glm4vForConditionalGeneration,
52+
tokenizer_class=AutoProcessor,
53+
attn_implementation=hf_attn_implementation
54+
),
55+
("glm-4.5v", ): ModelConfig(
56+
llm_class=Glm4vMoeForConditionalGeneration,
57+
tokenizer_class=AutoProcessor,
58+
attn_implementation=hf_attn_implementation
59+
),
60+
("magicgui-cpt", "magicgui-rft"): ModelConfig(
61+
llm_class=Qwen2VLForConditionalGeneration,
62+
tokenizer_class=Qwen2VLProcessor,
63+
attn_implementation=hf_attn_implementation
64+
),
65+
}
66+
67+
68+
def model_config_handler(model_name: str) -> ModelConfig:
69+
'''
70+
Get the model config for a given model name.
71+
72+
If the model name is not found, a ValueError will be raised.
73+
'''
74+
for _names, _config in MODEL_CONFIGS.items():
75+
if model_name in _names:
76+
return _config
77+
else:
78+
raise ValueError(f"Model {model_name} not found in {MODEL_CONFIGS}")

config/dataset_info.json

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,27 @@
11
{
22
"aitz": {
3-
"folder_name": "data/android_in_the_zoo",
3+
"folder_name": "android_in_the_zoo",
44
"split": "test",
55
"subset": ["general", "google_apps", "install", "web_shopping"]
66
},
77
"androidcontrol_high":{
8-
"folder_name": "data/android_control",
8+
"folder_name": "android_control",
99
"split": "test",
1010
"subset": ["android_control"]
1111
},
1212
"androidcontrol_low":{
13-
"folder_name": "data/android_control",
13+
"folder_name": "android_control",
1414
"split": "test",
1515
"subset": ["android_control"]
1616
},
1717
"cagui_agent": {
18-
"folder_name": "data/cagui_agent",
18+
"folder_name": "cagui_agent",
1919
"split": "test",
2020
"subset": ["domestic"]
2121
},
2222
"gui_odyssey": {
23-
"folder_name": "data/gui_odyssey",
23+
"folder_name": "gui_odyssey",
2424
"split": "test",
2525
"subset": ["gui_odyssey"]
2626
}
27-
}
27+
}
Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"folder_name": "models/AgentCPM-GUI"
44
},
55
"ui-tars-1.5-7b": {
6-
"folder_name": "models/UI-TARS-1.5-7B"
6+
"folder_name": "models/UI-TARS/UI-TARS-1.5-7B"
77
},
88
"ui-tars-2b-sft": {
99
"folder_name": "models/UI-TARS-2B-SFT"
@@ -26,6 +26,18 @@
2626
"qwen2.5-vl-7b-instruct": {
2727
"folder_name": "models/Qwen2.5-VL-7B-Instruct"
2828
},
29+
"qwen3-vl-4b-instruct": {
30+
"folder_name": "models/Qwen3-VL-4B-Instruct"
31+
},
32+
"qwen3-vl-4b-thinking": {
33+
"folder_name": "models/Qwen3-VL-4B-Thinking"
34+
},
35+
"qwen3-vl-8b-instruct": {
36+
"folder_name": "models/Qwen3-VL-8B-Instruct"
37+
},
38+
"qwen3-vl-8b-thinking": {
39+
"folder_name": "models/Qwen3-VL-8B-Thinking"
40+
},
2941
"mimo-vl-7b-sft": {
3042
"folder_name": "models/MiMo-VL-7B-SFT"
3143
},
@@ -62,4 +74,4 @@
6274
"magicgui-rft":{
6375
"folder_name": "models/MagicGUI_RFT"
6476
}
65-
}
77+
}

0 commit comments

Comments
 (0)