Skip to content

xiaomi-research/guievalkit

Repository files navigation

A Unified Toolkit for Evaluating GUI Agents

GUIEvalKit is an open-source evaluation toolkit for GUI agents, allowing practitioners to easily assess these agents on various (offline) benchmarks. The main goal is to provide an easy-to-use, open-source toolkit that simplifies the evaluation process for researchers and developers, while ensuring that evaluation results can be easily reproduced.

Requirements and Installation

This work has been tested in the following environment:

  • python == 3.10.12
  • torch == 2.8.1+cu128
  • transformers == 4.57.1
  • vllm == 0.11.0

Installation

Install the required dependencies:

pip install uv

pushd ./guievalkit/
uv venv ur_venv
source ur_venv/bin/activate
uv pip install -r requirements.txt -i accessible_url  # uv won't read accessible url from pip.conf

Make sure you have CUDA 12.8.x installed for GPU acceleration with vLLM.

Supported Models

Model Model Name Organization
Qwen2.5-VL qwen2.5-vl-3/7/32/72b-instruct Alibaba
Qwen3-VL qwen3-vl-4/8b-instruct/thinking Alibaba
GUI-Owl gui-owl-7/32b Alibaba
UI-Venus ui-venus-navi-7/72b Ant Group
UI-TARS ui-tars-2/7/72b-sft, ui-tars-7/72b-dpo Bytedance
UI-TARS-1.5 ui-tars-1.5-7b Bytedance
MagicGUI magicgui-cpt/rft Honor
AgentCPM-GUI agentcpm-gui-8b ModelBest
MiMo-VL mimo-vl-7b-sft/rl, mimo-vl-7b-sft/rl-2508 Xiaomi
GLM-V glm-4.1v-9b-thinking, glm-4.5v Zhipu AI

Supported Benchmarks

Dataset Task Name Task Description
AndroidControl androidcontrol_low/high Agent 1680 episodes, (10814 - 653) steps
CAGUI cagui_agent Agent 600 episodes, 4516 steps
GUI Odyssey gui_odyssey Agent 1933 episodes, 29426 steps
AiTZ aitz Agent 506 episodes, 4724 steps

Data Preparation

Please follow the instructions to download and preprocess the datasets.

Development

Configuration

Please update the configuration files or objs with your own information:

Model Core Implementation

Evaluation

Quick Start

You can use the provided run.sh script as a template, or run directly with Python:

python3 run.py all \
    --setup.datasets cagui_agent \
    --setup.model.model_name agentcpm-gui-8b \
    --setup.eval_mode offline_rule \
    --setup.vllm_mode online

Command Structure

The evaluation command follows this structure:

python3 run.py <mode> [--setup.<config_path> <value> ...]

Mode Options:

  • all: Perform both inference and evaluation (default)
  • infer: Only perform inference
  • eval: Only perform evaluation (currently not implemented)

Configuration Options

Dataset Configuration

  • --setup.datasets (str | list): Comma-separated list of datasets to evaluate. Supported datasets: androidcontrol_low, androidcontrol_high, cagui_agent, gui_odyssey, aitz

Model Configuration

  • --setup.model.model_name (str): Model name from the supported models list (required)
  • --setup.model.model (str): Custom model path (optional, defaults to path in model_paths.json)
  • --setup.model.model_alias (str): Human-readable model identifier for logs (optional, defaults to model_name)
  • --setup.model.max_model_len (int): Maximum context length (default: 8192)
  • --setup.model.tensor_parallel_size (int): Number of GPUs for tensor parallelism (default: 1)
  • --setup.model.data_parallel_size (int): Number of GPUs for data parallelism (default: 1)
  • --setup.model.pipeline_parallel_size (int): Number of GPUs for pipeline parallelism (default: 1)
  • --setup.model.max_num_batched_tokens (int): Maximum batched tokens per inference (default: 4096)
  • --setup.model.max_num_seqs (int): Maximum sequences per inference (default: 32)
  • --setup.model.image_limit (int): Maximum images per prompt (default: 3)

Evaluation Configuration

  • --setup.eval_mode (str): Evaluation mode (default: offline_rule)
    • offline_rule: Evaluate with model off-policy based on predefined rules
    • semi_online: Evaluate on-policy with model's own outputs when task succeeds
    • ...
  • --setup.vllm_mode (str): vLLM inference mode (default: online)
    • online: Use vLLM online serving for concurrent generation
    • offline: Use vLLM batched generation
  • --setup.enable_thinking (bool): Enable thinking mode for models that support it (default: true)
  • --setup.batch_size (int): Task Batch size for offline vLLM mode (default: 64)
  • --setup.max_concurrent_tasks (int): Maximum concurrent tasks for online vLLM mode (default: 128)

Output Configuration

  • --setup.output_dir (str): Directory to save evaluation results (default: ./outputs)
  • --setup.log_dir (str): Directory to save logs (default: ./logs/guieval)

Example: Using run.sh

You can modify run.sh to customize your evaluation:

datasets=androidcontrol_high,gui_odyssey,cagui_agent
model="ui-tars-1.5-7b"
model_path="None"  # or /path/to/specific_model
model_alias="None"  # or custom alias
mode=all
vllm_mode=online
max_model_len=40960
tp=1
dp=8
pp=1
tokens_batch_size=16384
seq_box=32
image_limit=1
concurrent=32
eval_mode=offline_rule
enable_thinking=false

python3 run.py ${mode} \
    --setup.datasets ${datasets} \
    --setup.model.model_name ${model} \
    --setup.model.model_alias ${model_alias} \
    --setup.model.model ${model_path} \
    --setup.model.max_model_len ${max_model_len} \
    --setup.model.tensor_parallel_size ${tp} \
    --setup.model.data_parallel_size ${dp} \
    --setup.model.pipeline_parallel_size ${pp} \
    --setup.model.max_num_batched_tokens ${tokens_batch_size} \
    --setup.model.max_num_seqs ${seq_box} \
    --setup.model.image_limit ${image_limit} \
    --setup.eval_mode ${eval_mode} \
    --setup.vllm_mode ${vllm_mode} \
    --setup.max_concurrent_tasks ${concurrent} \
    --setup.enable_thinking ${enable_thinking}

Please check here for the detailed evaluation results.

Acknowledgement

This repo benefits from AgentCPM-GUI/eval and VLMEvalKit. Thanks for their wonderful works.

About

GUIEvalKit: Open-source Evaluation Toolkit for GUI Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors