Routine-Gen is a crucial part for LLM-based agents, enabling them to effectively orchestrate IT assets, such as APIs, to accomplish tasks. However, there is a lack of systems to evaluate the effectiveness of LLMs in routine generation in a business context.
The purpose of 4-One Bench is to create a lightweight evaluation system that can help users quickly assess the Routine-Gen capabilities of LLMs.
We have conducted an evaluation of models from OpenAI, Zhipu, Ali Cloud, and Doubao. Below are the results regarding their Routine-Gen accuracy:
The architecture of 4-One Bench utilize a Generator-Verifier design pattern. In this setup, the Generator converts tasks into routines based on predefined knowledge graphs , while the Verifier employs LLMs to validate generated routines
Furthermore, the evaluation system has 4 "One" features:
Our dataset includes 51 "one-sentence tasks".
Based on the tasks, a knowledge graph that describes IT asset relationships will guide LLMs to generate routines. Additionally, users can define their own knowledge graph.
Considering the critical factors of response time and accuracy in real business environments, 4-One Bench only evaluates the likelihood that LLMs can successfully generate a Routine in a single attempt.
4-One Bench has developed a proprietary syntax for defining the orchestration of IT assets. Users also have the flexibility to create their own custom syntax.
- OpenAI: gpt-4o, gpt-4o-mini;
- Zhipu: glm-4-plus, glm-4-0520, glm-4-flash, glm-4-air;
- Ali Cloud: qwen-max, qwen-plus
- Bytedance: doubao-pro-32k
- Deepseek: deepseek-chat
- Create and activate a new virtual environment
python -m venv .venv
source .venv/bin/activate
- Install dependencies
pip install -r requirements.txt
- Use the following script to launch the evaluation:
streamlit run app/app.py
Follow my public WeChat account to stay updated:
or send email to [email protected] for any questions.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.