4onebench

Routine-Gen is a crucial part for LLM-based agents, enabling them to effectively orchestrate IT assets, such as APIs, to accomplish tasks. However, there is a lack of systems to evaluate the effectiveness of LLMs in routine generation in a business context.

The purpose of 4-One Bench is to create a lightweight evaluation system that can help users quickly assess the Routine-Gen capabilities of LLMs.

Evaluation

We have conducted an evaluation of models from OpenAI, Zhipu, Ali Cloud, and Doubao. Below are the results regarding their Routine-Gen accuracy:

Features and Architecture

The architecture of 4-One Bench utilize a Generator-Verifier design pattern. In this setup, the Generator converts tasks into routines based on predefined knowledge graphs , while the Verifier employs LLMs to validate generated routines

Furthermore, the evaluation system has 4 "One" features:

One-Query

Our dataset includes 51 "one-sentence tasks".

One-Knowledge Graph

Based on the tasks, a knowledge graph that describes IT asset relationships will guide LLMs to generate routines. Additionally, users can define their own knowledge graph.

One-Shot

Considering the critical factors of response time and accuracy in real business environments, 4-One Bench only evaluates the likelihood that LLMs can successfully generate a Routine in a single attempt.

One-Syntax

4-One Bench has developed a proprietary syntax for defining the orchestration of IT assets. Users also have the flexibility to create their own custom syntax.

LLMs Supported

OpenAI: gpt-4o, gpt-4o-mini;
Zhipu: glm-4-plus, glm-4-0520, glm-4-flash, glm-4-air;
Ali Cloud: qwen-max, qwen-plus
Bytedance: doubao-pro-32k
Deepseek: deepseek-chat

Demo

Launch the evaluation

Create and activate a new virtual environment

python -m venv .venv
source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Use the following script to launch the evaluation:

streamlit run app/app.py

Contact

Follow my public WeChat account to stay updated:

or send email to uncleyu89@gmail.com for any questions.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

4onebench

Evaluation

Features and Architecture

One-Query

One-Knowledge Graph

One-Shot

One-Syntax

LLMs Supported

Demo

Launch the evaluation

Contact

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

4onebench

Evaluation

Features and Architecture

One-Query

One-Knowledge Graph

One-Shot

One-Syntax

LLMs Supported

Demo

Launch the evaluation

Contact

License