4onebench

Routine-Gen is a crucial part for LLM-based agents, enabling them to effectively orchestrate IT assets, such as APIs, to accomplish tasks. However, there is a lack of systems to evaluate the effectiveness of LLMs in routine generation in a business context.

The purpose of 4-One Bench is to create a lightweight evaluation system that can help users quickly assess the Routine-Gen capabilities of LLMs.

Evaluation

We have conducted an evaluation of models from OpenAI, Zhipu, Ali Cloud, and Doubao. Below are the results regarding their Routine-Gen accuracy:

Features and Architecture

The architecture of 4-One Bench utilize a Generator-Verifier design pattern. In this setup, the Generator converts tasks into routines based on predefined knowledge graphs , while the Verifier employs LLMs to validate generated routines

Furthermore, the evaluation system has 4 "One" features:

One-Query

Our dataset includes 51 "one-sentence tasks".

One-Knowledge Graph

Based on the tasks, a knowledge graph that describes IT asset relationships will guide LLMs to generate routines. Additionally, users can define their own knowledge graph.

One-Shot

Considering the critical factors of response time and accuracy in real business environments, 4-One Bench only evaluates the likelihood that LLMs can successfully generate a Routine in a single attempt.

One-Syntax

4-One Bench has developed a proprietary syntax for defining the orchestration of IT assets. Users also have the flexibility to create their own custom syntax.

LLMs Supported

OpenAI: gpt-4o, gpt-4o-mini;
Zhipu: glm-4-plus, glm-4-0520, glm-4-flash, glm-4-air;
Ali Cloud: qwen-max, qwen-plus
Bytedance: doubao-pro-32k
Deepseek: deepseek-chat

Demo

Launch the evaluation

Create and activate a new virtual environment

python -m venv .venv
source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Use the following script to launch the evaluation:

streamlit run app/app.py

Contact

Follow my public WeChat account to stay updated:

or send email to [email protected] for any questions.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
app		app
images		images
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

4onebench

Evaluation

Features and Architecture

One-Query

One-Knowledge Graph

One-Shot

One-Syntax

LLMs Supported

Demo

Launch the evaluation

Contact

License

About

Releases

Packages

Languages

License

Laoyu84/4onebench

Folders and files

Latest commit

History

Repository files navigation

4onebench

Evaluation

Features and Architecture

One-Query

One-Knowledge Graph

One-Shot

One-Syntax

LLMs Supported

Demo

Launch the evaluation

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages