This repository contains the evaluation harness for 2501 engines and mixture of models, using the script (evaluate.py
) to evaluate tasks defined in a JSONL file.
The script processes each task by unzipping corresponding files, executing commands, and running tests.
- Python 3.x
- 2501 CLI, NPM repository here
On MacOS 15+, we recommend you create a virtual environment using the following commands:
python3 -m venv venv
source venv/bin/activate
Then install the required packages:
pip3 install -r requirements.txt
If the installation of psycopg2
fails, you may need to export the following environment variables:
export LDFLAGS="-L/opt/homebrew/opt/openssl/lib"
export CPPFLAGS="-I/opt/homebrew/opt/openssl/include"
This is a common problem with psycopg2
, you may find more info on stackoverflow
evaluate.py
: The main script to process and evaluate tasks.honest_benchmark.jsonl
: A JSONL file containing tasks to be evaluated.files/
: A directory containing zip files and other necessary files for the tasks.
- Ensure you have Python 3.x installed on your system.
- Place the
config/honest_benchmark.jsonl
file in the same directory asevaluate.py
. - Create a
datasets/
directory in the same location and place the corresponding zip files there. - Run the
evaluate.py
script:
python evaluate.py # Reads tasks from honest_benchmark.jsonl
python evaluate.py myfile.jsonl # Reads tasks from myfile.jsonl
python evaluate.py --test honest_24 # Runs a specific task by ID
python evaluate.py --agent-config CODING_AGENT # Runs all tasks from a specific agent config
python evaluate.py --from honest_24 # Runs all tasks from a specific task ID
Each line in the honest_benchmark.jsonl
file should be a valid JSON object with the following keys:
id
: A unique identifier for the task.input
: The shell command to be executed.test
: The test command to validate the task.
Example:
{"id": "honest_1", "input": "echo 'Hello, World!'", "test": "assert 'Hello, World!' in command_output"}
The script will produce a ****_result.jsonl file which the results of each test and the variable passed=True|False
added to each line.
- The script checks if the
files/
directory exists and creates it if it doesn't. - It reads the
honest_benchmark.jsonl
file line by line. - For each task:
- Parses the JSON line.
- Constructs the zip file name based on the task ID.
- Attempts to unzip the corresponding zip file in the
files/
directory. - Executes the shell command specified in the
input
key usingsubprocess.run()
. - Runs the test command specified in the
test
key using Python'sexec()
function. - Prints the results of the task execution and test.
- The script uses
subprocess.run()
withshell=True
, which can be a security risk if the input is not properly sanitized. Ensure that theinput
commands in the JSONL file are from trusted sources. - The
test
commands are executed using Python'sexec()
function, which can also be a security risk. Make sure the test commands are safe and from trusted sources.
- The script assumes that all necessary dependencies for running the tasks are already installed on the system.
- There's no built-in timeout mechanism for long-running tasks, which could potentially cause the script to hang.
Contributions to improve the script or documentation are welcome. Please submit a pull request or open an issue to discuss proposed changes.
This project is open-source and available under the MIT License.