GPT-Monkey: Enhancing Automated GUI Testing for Android Apps via LLM-Driven Interface Understanding and Function Segmentation
With the rising popularity of mobile apps, automated Graphical User Interface (GUI) testing is crucial for quality assurance. However, both the traditional Monkey and model-based or learning-based optimized GUI testing methods still lack deep GUI understanding and user test requirement analysis, wasting numerous test events when only specific functions need to be tested. To address this problem, we propose GPT-Monkey, which integrates Monkey’s randomness with the powerful understanding capability of LLMs. GPT-Monkey establishes global interface associations, performs cross-modal alignment between the layout and screenshot of the target function’s entry interface, and then encodes them along with user requirements for iterative feedback. GPT-Monkey enables function segmentation and tailored parameter generation guided by Parameter-RAG, which are subsequently decoded into executable Monkey-based scripts to perform the targeted testing. Compared to the optimal baseline, GPT-Monkey improves crash detection by 7.7% and testing efficiency by 34.5%, achieves 95% function segmentation accuracy on 1000 Google Play apps, and uncovers 397 crashes.
- Zhanhui Yuan – School of Cryptography Engineering, PLA Information Engineering University, Zhengzhou, China.
- Kai Chen – Institute of Information Engineering, CAS, Beijing, China.
- Zhi Yang – School of Cryptography Engineering, PLA Information Engineering University, Zhengzhou, China.
- Dongxue Jiang – CRIStAL, Centrale Lille Institut, Villeneuve d’Ascq, France.
- Jinglei Tan – School of Cryptography Engineering, PLA Information Engineering University, Zhengzhou, China.
- Hongqi Zhang – School of Cryptography Engineering, PLA Information Engineering University, Zhengzhou, China.
GPT-Monkey is implemented as a fully automated GUI testing tool and has been evaluated on Windows with Android emulators. Below is a step-by-step guide to deploy and run GPT-Monkey on a Windows machine with an Android emulator:
-
System Requirements A Windows 10/11 (64-bit) PC is recommended. Ensure you have Python 3.8+ installed. GPT-Monkey was tested on a Windows 11 machine with an Intel Core i7 CPU.
-
Android Emulator/Device Install the Android SDK (via Android Studio or command-line tools) to get an emulator and ADB (Android Debug Bridge). GPT-Monkey supports Android 5.0 to 11.0 devices/emulators. You can create an emulator (AVD) using Android Studio – for example, a Google Pixel 6 emulator running Android 7.1.1 (as used in our experiments). Allocate sufficient resources (e.g. 2 GB RAM, 1 GB SD card, 1080×2400 resolution) for the emulator. Launch the emulator and ensure it appears in
adb devices. (For real devices, enable USB debugging and connect via ADB.) -
Clone the Repository
git clone https://github.com/Project-YZH/GPT-Monkey.git cd GPT-Monkey -
Install Dependencies GPT-Monkey’s Python dependencies are listed in
requirements.txt. Install them usingpip:pip install -r requirements.txt
This will install all required packages (e.g. OpenAI API client, UIAutomator2, etc.). The key dependencies include:
lxml==5.4.0 numpy==2.2.4 openai==0.27.8 pandas==2.2.3 Requests==2.32.3 scikit_learn==1.7.0 uiautomator2==2.16.25Ensure the installation succeeds with no errors.
-
Configuration GPT-Monkey uses the OpenAI GPT-4 model by default. You may need an OpenAI API key or relevant access – check the repository documentation for how to configure API credentials (if required). Also, make sure
ADBis in yourPATHso that the tool can invoke Monkey/ADB commands. -
Connect to the Emulator In the GPT-Monkey interface, make sure the device/emulator is connected. Some tools (like
uiautomator2) require usingadb connect. If needed, follow the repository’s instructions to connect the UIAutomator2 client to the emulator. -
Running GPT-Monkey Launch the GPT-Monkey GUI application. The repo provides a launch script:
python gpt_monkey.py
This should open the GPT-Monkey interface. Note: The tool may require the emulator/device to be running and the target app to be installed and open to the desired screen before you begin testing.
Following these steps, GPT-Monkey should be ready to use. You will interact with its GUI to specify testing requirements and initiate tests. The tool will utilize UIAutomator2 to obtain the UI layout and screenshot, DroidBot to gather global interface structures, and a Monkey-based engine (we use Maxim, a modified Monkey) to execute tests.
GPT-Monkey provides a user-friendly GUI with three main interfaces (screens) for configuring and running tests:
This is the starting screen for GPT-Monkey, where the user specifies the test goal in natural language. Interface 1 contains a text input field for the test requirement and a Confirm button to submit it. For example, the user might input:
“I want to test the add note function on the current interface for 2 minutes.”
Once the user clicks Confirm, GPT-Monkey will capture the current UI layout and screenshot, retrieve the global app interface graph, and compile this information with the user’s request into a prompt for the LLM. Interface 1 – Requirement Input
After LLM analysis, GPT-Monkey moves to Interface 2.
In this interface, GPT-Monkey displays the basic test parameters generated by the LLM based on the user’s request. These parameters typically include the target device ID, test duration, log level/verbosity, output report path, whether to perform function segmentation, and the target function name or identifier.
The user can review these suggested parameters and modify any if needed (e.g., adjust the duration or toggle the “segment function” option). This interactive step allows the tester to correct any misunderstandings or refine the test configuration. Once the parameters look good, the user clicks Confirm in Interface 2, and GPT-Monkey will send the updated parameters along with the UI info back to the LLM for a second round of analysis. Interface 2 – Parameter Editing
In this second LLM call, the model uses the confirmed parameters to perform function segmentation and finalize the Monkey test instructions.
The final interface presents the segmentation results and test instructions. Here, GPT-Monkey shows the identified function entry coordinates (the bounding box of the UI element that starts the target function) and the list of associated interfaces (activities/screens) that belong to that function. It also displays the constructed Monkey command based on the tailored parameters.
The user can review this information, then press the Send Instruction button to launch the Monkey-based testing on the target function. GPT-Monkey will execute the Monkey test (via ADB) with the specified parameters and focus, and the results (logs, crashes, coverage, etc.) are collected. After the test run, Interface 3 will update to show any crash reports or observations from the run.
Uniquely, GPT-Monkey’s LLM will analyze the test logs behind the scenes and automatically suggest an optimized set of parameters for the next iteration, which the interface presents to the user as a new Monkey instruction (the user can then simply click Send again for another round). This iterative cycle can continue, forming a feedback loop to improve testing. Interface 3 – Results & Instruction
Under the hood: GPT-Monkey uses Android’s UI Automator to extract the UI hierarchy XML and screenshot of the current screen. It uses DroidBot to traverse the app and build a global state graph of interfaces (this helps identify associated screens for the target function). The Monkey-based engine Maxim is then driven by a generated script that directs events preferentially to the target function’s UI. All these details are abstracted away by the three interfaces, making GPT-Monkey easy to use for testers without requiring them to write any code or scripts.
A cornerstone of GPT-Monkey is interface segmentation, which means isolating the part of the app relevant to the user-specified function. In this context, a function is defined as a set of related interfaces that together achieve a certain user goal (for example, “Add Note” might involve a note list screen and an edit-note screen). GPT-Monkey represents a function as a tuple (E, A), where E is the entry UI element (the coordinates of the GUI component that triggers the function) and A is the set of associated interface screens/activities that belong to that function’s workflow.
To perform segmentation, GPT-Monkey leverages the LLM’s understanding of the app’s UI content and structure. It first uses UIAutomator to capture the current screen’s layout hierarchy and a screenshot, and simplifies this UI tree to essential attributes (component type, text, resource ID, bounds) to make it easier for the LLM to parse. It then establishes a global interface association graph for the app using DroidBot, which crawls the app to discover what screens are reachable from where. Given the user’s target description, GPT-Monkey’s LLM analyzes the current interface layout and screenshot (with a cross-modal alignment of text and vision) to determine which UI element is likely the function’s entry point. It identifies that element’s bounding box (E) and uses the global interface graph to collect all interfaces likely to be involved when that function is executed (A). The result of segmentation is essentially focusing the test on interfaces E + A and ignoring unrelated parts of the app.
Specifying the right Monkey parameters for a given test scenario is crucial for effectiveness. GPT-Monkey introduces a Parameter Retrieval-Augmented Generation (Parameter-RAG) module to enhance the LLM’s ability to generate valid and optimal test parameters.
GPT-Monkey maintains an external knowledge base describing Monkey’s command-line parameters and their effects in natural language. When the LLM is tasked with producing test parameters, the Parameter-RAG module retrieves relevant knowledge snippets (using embedding-based similarity search) and feeds them into the LLM’s prompt. This ensures the generated Monkey command is both syntactically correct and semantically appropriate, reducing invalid or nonsensical parameters.
GPT-Monkey’s testing process involves two feedback loops that refine the test parameters:
-
Interactive Feedback Loop: After the LLM first suggests basic parameters, the user reviews them (Interface 2) and can adjust anything that looks off. The updated parameters are fed back into the LLM to generate the final optimized configuration (including function segmentation and final Monkey script).
-
Test-Log Analysis Feedback Loop: After a test run, GPT-Monkey analyzes execution logs, crash reports, and coverage information. If the target function was not adequately exercised, the system adjusts parameters (e.g., event counts, navigation events). The LLM then generates a refined Monkey instruction for the next iteration.
Together, these loops make GPT-Monkey’s testing process adaptive and reliable, mitigating LLM errors and coping with unexpected app behaviors.
Here we demonstrate how GPT-Monkey can be used to test a specific app function using natural language:
-
Natural Language Prompt (Interface 1): “I want to test the add note function on the current interface for 2 minutes.”
-
LLM-Generated Parameters (Interface 2): GPT-Monkey displays a draft configuration based on the prompt (e.g., target package name, ~120s duration, segmentation enabled, target function “Add Note”, suggested event percentages such as 60% touch, 30% motion, 10% other). The user reviews/tweaks and confirms.
-
Generated Monkey Command (Interface 3): After final analysis, GPT-Monkey produces a tailored Monkey command. For instance:
adb shell CLASSPATH=/sdcard/monkey.jar:/sdcard/framework.jar exec app process /system/bin tv.panda.test.monkey.Monkey -p com.ichi2.anki --uiautomatormix --runing-minutes 2 --pct-touch 60 --pct-motion 30 --pct-syskeys 5 --ignore-crashes --ignore-timeouts --throttle 200--output-directory C:/output/This command starts a custom Monkey (Maxim) runner via
app_process, targetscom.ichi2.anki, runs for 2 minutes, focuses on touch/swipe events, and ignores crashes/timeouts to continue execution. The “entry area” coordinates for Add Note are prioritized by the generated script. -
During Execution: Monkey (Maxim) drives the app, primarily exercising the Add Note path. Results are collected after completion or upon crash.
-
Results and Next Steps: If the first run misses some sub-features, GPT-Monkey analyzes logs and adjusts parameters (e.g., more navigation events or specific actions). The user can run the new instruction for another round; subsequent runs may uncover crashes (e.g., saving a note with an empty title).
To recap, GPT-Monkey requires a Python 3 environment with the following key libraries installed (from requirements.txt): lxml, numpy, openai, pandas, requests, scikit-learn, and uiautomator2 (see the Installation & Setup section above for exact versions). Ensure that you have these dependencies, as well as the Android SDK platform-tools (for ADB) and an emulator or device running Android 5–11. The tool is designed to be OS-independent, but primary testing was on Windows 10/11.
In general, the workflow can be automated after configuring environment variables (like OPENAI_API_KEY for the LLM) and ensuring an Android device is connected. You can invoke GPT-Monkey’s main module to perform a test and output results to log files.
If you prefer not to use the LLM-driven process, you can still use the Monkey command that GPT-Monkey generates (as seen in Interface 3). You can copy that command and run it manually via ADB or a shell script for regression testing. GPT-Monkey’s enhancements (Parameter-RAG and feedback loops) can integrate with traditional testing pipelines by providing better Monkey parameters.
The GPT-Monkey source code is openly available on GitHub: Project-YZH/GPT-Monkey. You can find detailed documentation, example config files, and issues/solutions discussed by the community there. Contributions and feedback are welcome via the repository.
If you use GPT-Monkey in your research or testing projects, please consider citing our work.
@article{yuan2025gptmonkey,
title={{GPT-Monkey}: Enhancing Automated GUI Testing for Android Apps via LLM-Driven Interface Understanding and Function Segmentation},
author={Yuan, Zhanhui and Chen, Kai and Yang, Zhi and Jiang, Dongxue and Tan, Jinglei and Zhang, Hongqi},
journal={Manuscript submitted for publication},
year={2025}
}Yuan et al. (2025), "GPT-Monkey: Enhancing Automated GUI Testing for Android Apps via LLM-Driven Interface Understanding and Function Segmentation."
We hope GPT-Monkey proves useful for your GUI testing needs. For any questions or support, please open an issue on the GitHub repository. Happy testing!