IDA-Bench is a benchmark for evaluating Large Language Models (LLMs) as data analysis agents in interactive, multi-round scenarios that reflect the iterative nature of real-world data analysis. Tasks are derived from Kaggle notebooks, with an LLM-simulated user providing sequential natural language instructions.
Agent | Valid Submission (%) |
Baseline Achieved (%) |
Baseline Achieved/ Valid Submission (%) |
Avg Time (s) | Avg Turns | Avg Code Snippets |
---|---|---|---|---|---|---|
Gemini-2.5-pro | 88 | 40 | 45.45 | 711.63 | 18.24 | 11.80 |
DeepSeek-V3 | 96 | 24 | 25.00 | 463.02 | 9.08 | 12.32 |
DeepSeek-R1 | 68 | 12 | 17.65 | 567.79 | 7.24 | 12.16 |
OpenAI o3 | 12 | 4 | 33.33 | 321.49 | 9.72 | 1.08 |
OpenAI o4-mini | 96 | 40 | 41.67 | 224.02 | 9.16 | 7.04 |
Claude-3.7-Sonnet-Thinking | 100 | 40 | 40.00 | 627.46 | 5.32 | 8.96 |
This leaderboard shows the performance of different Large Language Model (LLM) agents on the IDA-Bench benchmark. It compares several models, including Gemini-2.5-pro, DeepSeek-V3, DeepSeek-R1, OpenAI OpenAI o3, OpenAI o4-mini, and Claude-3.7-Sonnet-Thinking. Evaluation metrics include successful submission rate, baseline achieved percentage, percentage of successful submissions that achieved baseline, average time taken, average conversation turns, and average code snippets. These initial results reveal limitations in even state-of-the-art coding agents during multi-round tests that are not apparent in single-turn tests. This underscores the necessity of enhancing the multi-round interactive capabilities of LLMs for building more reliable data analysis agents.
Before setup the benchmark, make sure the following things are installed:
- git. See this link for installation instructions.
- gcc. If you are using Windows, you can refer to this link for installation instructions. Usually, Mac and Linux come with gcc pre-installed. If not, you can install it via your package manager. For example, on Ubuntu, you can run:
sudo apt-get install build-essential
- uv. See this link for installation instructions.
- Docker. See this link for installation instructions.
Then, you can setup the benchmark by the following instructions:
- Clone this repository:
git clone https://github.com/lhydave/ida-bench && cd ./ida-bench
- Sync all dependencies:
uv sync
The datasets used in IDA-Bench are sourced from 25 publicly available, high-quality Jupyter notebooks from Kaggle. These notebooks, and their corresponding datasets, were selected to cover a diverse range of topics, including manufacturing, business, psychology, and weather prediction. It is stored at Kaggle and can be downloaded from here.
NOTE: You need to have a Kaggle API to download the benchmark data. You can follow the instructions here to setup your Kaggle API key.
To download the benchmark data, you can run the following command in the root directory of the repository:
uv run run_benchmark.py --download_benchmark
NOTE: it could take a while to download the benchmark data, please be patient.
This will download the benchmark data and store it in the benchmark_final
directory. For the organization of the benchmark data, please refer to the Kaggle website.
Before running the benchmark, you need to configure several files:
Create or modify configs/base_config.toml
to set up the benchmark environment:
- Specify user LLM agent configuration
- Define benchmark paths (benchmark data, checkpoints, results, logs)
- Configure concurrent test execution via
max_workers
- Optionally limit testing to specific test cases using
test_cases
You can copy and modify the sample provided in configs/sample_base_config_default.toml
.
NOTE: If you want to test using the shard user, you can copy and modify the sample provided in
configs/sample_base_config_shards.toml
.
Prepare agent configuration files in the agent_configs/
directory:
- For multiple agents: Create separate .toml files for each agent in this directory
- For a single agent: Create
agent_configs/agent_config.toml
Each agent config should specify:
- Agent name/id
- API keys for LLM services
- Model selection
- Agent framework details
Refer to configs/sample_agent_config.toml
for the expected format.
We currently only support base-agent
as the agent framework. It uses a modified version of open-interpreter as the python interpreter backend. To use base-agent
, please prepare configs/interpreter_config.toml
. You can copy the sample from configs/sample_interpreter_config.toml
.
Warning
Our modification fixes compatibility bugs in open-interpreter when working with Gemini. However, the system still suffers from race conditions and deadlock problems that arise from the asynchronous interactions between LLMs and the IPython interpreter.
As a consequence, base-agent
occasionally gets stuck during code execution. We are currently devoted to re-implementing a more robust and reliable LLM-Python interactor.
To run the benchmark, you need to start the Docker service. If you use Docker Desktop (for example, on Windows or Mac), you can start it from the application. If you are using Linux terminal, you can run the following command to start the Docker service:
sudo systemctl start docker
sudo systemctl enable docker
NOTE: If you are using Docker Desktop, make sure that you have enabled host networking in the Docker settings.
There are several modes for running the benchmark:
Run the benchmark with all configured agents and test cases with default configuration paths:
uv run run_benchmark.py
Specify custom paths for configurations:
uv run run_benchmark.py --config_path path/to/config --agent_config_path path/to/agent_config
Evaluate existing test results without running new tests:
uv run run_benchmark.py --evaluate_only
This will process results from the checkpoint directory defined in your base config.
Adjust logging verbosity with the --log_level
or -l
flag:
uv run run_benchmark.py --log_level DEBUG
Available levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Each benchmark run produces:
- Logs: Individual test logs in the configured log directory
- Checkpoints: Complete interaction records in JSON format
- Submissions: Agent output data for evaluation
- Results: Evaluation metrics and scores for each test
- Summary: Overall benchmark performance summary
To test your own agent:
- Implement the AgentClass Protocol (refer to the LLMInteract implementation) in
llms
file - Add your agent to the agent_dict in your code
- Create appropriate configuration files as described above
If you want to construct your own benchmark from Jupyter notebooks, you can use the independent scripts provided in the scripts/
and example use cases in the examples/
directory. These scripts are designed to help you convert Jupyter notebooks into the format used in IDA-Bench. Please refer to scripts/README.md for more details about the scripts and examples/README.md for example use cases.
This project is licensed under the MIT License. See the LICENSE file for details.