This is the leaderboard repository of the benchmark proposed in AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (ACL 2024).
The project's main repository is here, and leaderboard UI is here. This repository stores bundled (encrypted) experiment outputs from participanting models (including our baselines) and the raw leaderboard data JSON which is dynamically rendered in the UI. You can use this repository to:
- Download and locally view experiment outputs from other participanting methods.
- Submit your own agent's experiment outputs to be included on the leaderboard via a PR.
For both cases, you first need to
- Install
appworld
:pip install appworld && appworld install
. - Install Git LFS and clone the
appworld-leaderboard
repository.
First, pack your agent's test_normal
and test_challenge
experiment outputs:
::Click:: Experiment outputs refresher
Your experiment outputs are located in ./experiments/outputs/{experiment_name}
relative to the APPWORLD_ROOT
, which as we discussed earlier, defaults to .
, but can be configured by passing APPWORLD_ROOT
environment variable or --root
in CLI.
For the leaderboard, experiment names must be alphanumeric with optional hyphens and underscores, and they must end with the dataset name, i.e., _test_normal
or _test_challenge
, e.g., react_gpt4o_test_normal
. You should have two experiment outputs, one for each dataset. The prefix portion of their names must be the same, e.g., react_gpt4o_test_normal
and react_gpt4o_test_challenge
. Rename the directory accordingly if necessary.
Now, pack
the two experiments, individually, with the following commands, and same metadata.
appworld pack {test_normal_experiment_name|test_challenge_experiment_name} \
# The method name used in the experiment:
--method_name METHOD_NAME \
# A brief additional note about the method:
--method_tooltip METHOD_TOOLTIP \
# The LLM name used in the experiment:
--llm_name LLM_NAME \
# A brief additional note about the LLM
--llm_tooltip \
# URL to find more information about this submission
-url URL
# Example:
# appworld pack react_gpt4o_test_normal \
# --method_name react
# --method_tooltip 'Reason + Act'
# --llm_name 'GPT4-o'
# --url 'https://appworld.dev/'
The pack command compresses and encrypts the experiment outputs in leaderboard.bundle
files within your experiment output directories. E.g., ./experiments/outputs/react_gpt4o_test_normal/leaderboard.bundle
.
Caution
Do NOT put your experiment outptus in an unencrypted format publicly accessible on the internet.
Next, Copy the two bundle files at the following locations relative to appworld-leaderboard
repo's root directory.
./experiments/outputs/{test_normal_experiment_name}/leaderboard.bundle
./experiments/outputs/{test_challenge_experiment_name}/leaderboard.bundle
And then create a PR with these two files. The Github actions will run evaluations and post your results on the PR. We will review the submission and merge it. When merged, the leaderboard on the website will update automatically.
Important
Reminder: We track experiment outputs in encrypted .bundle
files to reduce the risk of it becoming part of the training corpora of LLMs. So please do NOT post it here (or anywhere else publicly on the interent) in unencrypted or uncompressed format. See license.
CD
into the root of this repository. The experiment outputs are available in this format:
experiments/outputs/{experiment_name}/leaderboard.bundle
To unpack the bundle, run the following based on the experiment name you want to open.
appworld unpack {experiment_name}
The bundle will be unpacked in the directory:
experiments/outputs/{experiment_name}/