AI Agents for Computer Use

An awesome list of computer control agents (GUI automation of desktop and mobile devices) 🚀.

Please have a look at our website for more information.

Repository Contents

📄 Paper: Link to Paper (arXiv.2501.16150)
🌐 Website: https://sagerpascal.github.io/computer-control-agents
🤖Agent Overview
📊 Datasets Overview

Agents

**Abukadah et al. ** - [Mapping Natural Language Intents to User Interfaces through Vision-Language Models]
**Bishop et al. ** - [Latent State Estimation Helps UI Agents to Reason]
**Bonatti et al. ** - [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale]
**Branavan et al. ** - [Reinforcement Learning for Mapping Instructions to Actions]
**Chae et al. ** - [Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation]
**Cheng et al. ** - [Seeclick: Harnessing gui grounding for advanced visual gui agents]
**Cho et al. ** - [CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only]
**Deng et al. ** - [Mind2Web: Towards a Generalist Agent for the Web]
**Deng et al. ** - [Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents]
**Deng et al. ** - [On the Multi-turn Instruction Following for Conversational Web Agents]
**Ding et al. ** - [MobileAgent: enhancing mobile control via human-machine interaction and SOP integration]
**Dorka et al. ** - [Training a Vision Language Model as Smartphone Assistant]
**Fereidouni et al. ** - [Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning]
**Furuta et al. ** - [Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web]
**Furuta et al. ** - [Multimodal Web Navigation with Instruction-Finetuned Foundation Models]
**Gao et al. ** - [ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation]
**Guan et al. ** - [Intelligent Virtual Assistants with LLM-based Process Automation]
**Guo et al. ** - [PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion]
**Gur et al. ** - [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis]
**Gur et al. ** - [Environment Generation for Zero-Shot Compositional Reinforcement Learning]
**Gur et al. ** - [Learning to Navigate the Web]
**Gur et al. ** - [Understanding HTML with Large Language Models]
**He at al. ** - [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models]
**Hong et al. ** - [CogAgent: A Visual Language Model for GUI Agents]
**Humphreys et al. ** - [A data-driven approach for learning to control computers]
**Iki et al. ** - [Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts]
**Jia et al. ** - [DOM-Q-NET: Grounded RL on Structured Language]
**Kil et al. ** - [Dual-View Visual Contextualization for Web Navigation]
**Kim et al. ** - [Language Models can Solve Computer Tasks]
**Koh et al. ** - [Tree Search For Language Model Agents]
**Lai et al. ** - [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent]
**Lee et al. ** - [Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation]
**Li ** - [Learning UI Navigation through Demonstrations composed of Macro Actions]
**Li et al. ** - [A Zero-Shot Language Agent for Computer Control with Structured Reflection]
**Li et al. ** - [AppAgent v2: Advanced Agent for Flexible Mobile Interactions]
**Li et al. ** - [Glider: A Reinforcement Learning Approach to Extract UI Scripts from Websites]
**Li et al. ** - [Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations]
**Li et al. ** - [Mapping Natural Language Instructions to Mobile UI Action Sequences]
**Li et al. ** - [On the Effects of Data Scale on Computer Control Agents]
**Li et al. ** - [UINav: A Practical Approach to Train On-Device Automation Agents]
**Lin et al. ** - [Automating Web-based Infrastructure Management via Contextual Imitation Learning]
**Liu et al. ** - [Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration]
**Lu et al. ** - [GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices]
**Lu et al. ** - [OmniParser for Pure Vision Based GUI Agent]
**Lu et al. ** - [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue]
**Lutz et al. ** - [WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents]
**Ma et al. ** - [CoCo-Agent: Comprehensive Cognitive LLM Agent for Smartphone GUI Automation]
**Ma et al. ** - [LASER: LLM Agent with State-Space Exploration for Web Navigation]
**Mazumder et al. ** - [FLIN: A Flexible Natural Language Interface for Web Navigation]
**Murty et al. ** - [BAGEL: Bootstrapping Agents by Guiding Exploration with Language]
**Nakano et al. ** - [WebGPT: Browser-assisted question-answering with human feedback]
**Niu et al. ** - [ScreenAgent: A Vision Language Model-driven Computer Control Agent]
**Nong et al. ** - [MobileFlow: A Multimodal LLM For Mobile GUI Agent]
**Pan et al. ** - [Autonomous Evaluation and Refinement of Digital Agents]
**Putta et al. ** - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents]
**Rahman et al. ** - [V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM]
**Rawles et al. ** - [Android in the Wild: A Large-Scale Dataset for Android Device Control]
**Shaw et al. ** - [From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces]
**Shi et al. ** - [World of Bits: An Open-Domain Platform for Web-Based Agents]
**Sodhi et al. ** - [HeaP: Hierarchical Policies for Web Actions using LLMs]
**Song et al. ** - [MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot]
**Song et al. ** - [Navigating Interfaces with AI for Enhanced User Interaction]
**Song et al. ** - [RestGPT: Connecting Large Language Models with Real-World RESTful APIs]
**Song et al. ** - [VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning]
**Lo et al. ** - [Hierarchical Prompting Assists Large Language Model on Web Navigation]
**Sun et al. ** - [AdaPlanner: Adaptive Planning from Feedback with Language Models]
**Sun et al. ** - [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI]
**Tao et al. ** - [WebWISE: Web Interface Control and Sequential Exploration with Large Language Models]
**Wang et al. ** - [Enabling Conversational Interaction with Mobile UI using Large Language Models]
**Wang et al. ** - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception]
**Wang et al. ** - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation]
**Wen et al. ** - [AutoDroid: LLM-powered Task Automation in Android]
**Wen et al. ** - [DroidBot-GPT: GPT-powered UI Automation for Android]
**Wu et al. ** - [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding]
**Wu et al. ** - [OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT]
**Xu et al. ** - [Grounding Open-Domain Instructions to Automate Web Support Tasks]

Datasets

**Shi et al. ** - [World of Bits: An Open-Domain Platform for Web-Based Agents]
**Liu et al. ** - [Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration]
**Xu et al. ** - [Grounding Open-Domain Instructions to Automate Web Support Tasks]
**Gur et al. ** - [Environment Generation for Zero-Shot Compositional Reinforcement Learning]
**Yao et al. ** - [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents]
**Deng et al. ** - [Mind2Web: Towards a Generalist Agent for the Web]
**Koroglu et al. ** - [QBE: QLearning-Based Exploration of Android Applications]
**Rawles et al. ** - [Android in the Wild: A Large-Scale Dataset for Android Device Control]
**Zhou et al. ** - [WebArena: A Realistic Web Environment for building autonomous agents]
**Li et al. ** - [Mapping Natural Language Instructions to Mobile UI Action Sequences]
**Toyama et al. ** - [AndroidEnv: A Reinforcement Learning Platform for Android]
**Burns et al. ** - [A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility]
**Xie et al. ** - [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments]
**Shvo et al. ** - [AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning]
**Sun et al. ** - [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI]
**Liu et al. ** - [AgentBench: Evaluating LLMs as Agents]
**Chen et al. ** - [WebVLN: Vision-and-Language Navigation on Websites]
**Song et al. ** - [RestGPT: Connecting Large Language Models with Real-World RESTful APIs]
**Koh et el. ** - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks]
**Deng et al. ** - [On the Multi-turn Instruction Following for Conversational Web Agents]
**Kapoor et al. ** - [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web]
**Wen et al. ** - [Empowering LLM to use Smartphone for Intelligent Task Automation]
**Gao et al. ** - [ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation]
**Niu et al. ** - [ScreenAgent: A Vision Language Model-driven Computer Control Agent]
**Drouin et al. ** - [WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?]
**Lai et al. ** - [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent]
**Zhang et al. ** - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents]
**Chen et al. ** - [GUICourse: From General Vision Language Models to Versatile GUI Agents]
**Guo et al. ** - [PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion]
**Venkatesh et al. ** - [UGIF: UI Grounded Instruction Following]
**Zheng et al. ** - [AgentStudio: A Toolkit for Building General Virtual Agents]
**Zhang et al. ** - [Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction]
**Chen et al. ** - [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents]
**Chai et al. ** - [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents]

Citation

If helpful, please cite:

@misc{sager_cca_2025,
      title={AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants}, 
      author={Pascal J. Sager and Benjamin Meyer and Peng Yan and Rebekka von Wartburg-Kottler and Layan Etaiwi and Aref Enayati and Gabriel Nobel and Ahmed Abdulkadir and Benjamin F. Grewe and Thilo Stadelmann},
      year={2025},
      eprint={2501.16150},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.16150}, 
}

Website License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AI Agents for Computer Use

Repository Contents

Agents

Datasets

Citation

Website License

Files

README.md

Latest commit

History

README.md

File metadata and controls

AI Agents for Computer Use

Repository Contents

Agents

Datasets

Citation

Website License