LREC-COLING 2024 | HumanEval-XL: An Execution-based Multilingual Code Generation Benchmark Across 23 Natural Languages and 12 Programming Languages
This repository contains data and evaluation code for the paper "HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization".
- 26 February, 2024: 🎉 We release the official codebase and data! [GitHub, 🤗dataset] 🔥
- 19 February, 2024: 🎉 Our work has been accepted to LREC-COLING 2024! ✨
Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
The data is stored in data/program_language/natural_language/
. We have 80 parallel problems in 23 different natural languages and 12 programming languages.
23 NLs are: "English", "Russian", "Chinese", "German", "Spanish", "French", "Italian", "Portuguese", "Greek", "Hungarian", "Dutch", "Finnish", "Indonesian", "Turkish", "Arabic", "Vietnamese", "Bulgarian", "Persian", "Malay", "Hebrew", "Estonian", "Tagalog", "Afrikaans"
12 PLs are: "python", "java", "javascript", "csharp", "go", "kotlin", "perl", "php", "ruby", "scala", "swift", "typescript"
You can also use 🤗HuggingFace datasets to load a specific dataset and language of our dataset!!!
from datasets import load_dataset
dataset = load_dataset("FloatAI/HumanEval-XL", "python")
DatasetDict({
English: Dataset({
features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],
num_rows: 80
})
Russian: Dataset({
features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],
num_rows: 80
})
Chinese: Dataset({
features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],
num_rows: 80
})
⋮
Afrikaans: Dataset({
features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],
num_rows: 80
})
})
An example of a dataset instance (In python split with Chinese prompts - dataset["Chinese"][0]):
{
'task_id': 'python/0',
'language': 'python',
'prompt': 'from typing import List\n\n\ndef below_zero(operations: List[int]) -> bool:\n """ 你会得到一个银行账户的存款和取款操作列表,该账户从零余额开始。你的任务是检测账户余额是否在任何时候降至零以下,并在该点返回True。否则应返回False。\n \n >>> below_zero([1, 2, 3])\n False\n >>> below_zero([1, 2, -4, 5])\n True\n """\n',
'description': '你会得到一个银行账户的存款和取款操作列表,该账户从零余额开始。你的任务是检测账户余额是否在任何时候降至零以下,并在该点返回True。否则应返回False。\n ',
'test': "\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate([]) == False\n assert candidate([1, 2, -3, 1, 2, -3]) == False\n assert candidate([1, 2, -4, 5, 6]) == True\n assert candidate([1, -1, 2, -2, 5, -5, 4, -4]) == False\n assert candidate([1, -1, 2, -2, 5, -5, 4, -5]) == True\n assert candidate([1, -2, 2, -2, 5, -5, 4, -4]) == True\n",
'entry_point': 'below_zero',
'canonical_solution': ' balance = 0\n\n for op in operations:\n balance += op\n if balance < 0:\n return True\n\n return False\n',
'natural_language': 'Chinese'
}
task_id
: identifier for the data sampleprompt
: input for the model containing function header and docstringscanonical_solution
: solution for the problem in theprompt
description
: task descriptiontest
: contains function to test generated code for correctnessentry_point
: entry point for testlanguage
: programming lanuage identifier to call the appropriate subprocess call for program executionnatural_language
: natural language identifier to show the language the prompt is in
programming languages are used to speicify splits:
- python
- java
- javascript
- csharp
- go
- kotlin
- php
- perl
- ruby
- swift
- scala
- typescript
Check out and install this repository:
git clone [email protected]:FloatAI/humaneval-xl.git
cd mxeval
pip install -e mxeval
We provide scripts to help set up programming language dependencies that are used to execute and evaluate using dataset. (We use the same scripts from https://github.com/amazon-science/mxeval for code generation evaluation)
bash language_setup/amazon_linux_ami.sh
bash language_setup/ubuntu.sh
This program exists to run untrusted model-generated code. Users are strongly
encouraged not to do so outside of a robust security sandbox. See the comment in
execution.py
for more information and instructions.
(We use the same scripts from https://github.com/amazon-science/mxeval for code generation evaluation)
Each sample is formatted into a single line:
{"task_id": "Corresponding task ID", "completion": "Completion only without the prompt",
"language": "programming language name"}
We provide python_chinese_generated_samples.jsonl
to illustrate the format.
Here is nearly functional example code (you just have to provide
generate_one_completion
to make it work) that saves generated completions to
samples.jsonl
.
from mxeval.data import write_jsonl, read_problems
problems = read_problems()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, language=problems[task_id]["language"], completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples for, e.g., Python, Chinese evaluation, run
evaluate_functional_correctness python_chinese_generated_samples.jsonl --problem_file data/python/Chinese.jsonl
Note: Because there is no unbiased way of estimating pass@k when there are fewer
samples than k, the script does not evaluate pass@k for these cases. To
evaluate with other k values, pass --k <comma-separated-values-here>
. For
other options, see
$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.
We adapted Amazon-science's mxeval package (https://github.com/amazon-science/mxeval) for the evaluation. We thank Amazon for their pioneering effort in this field including the release of the dataset and evaluation code.
@inproceedings{peng-etal-2024-humaneval-xl,
title = "{H}uman{E}val-{XL}: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization",
author = "Peng, Qiwei and
Chai, Yekun and
Li, Xuhong",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.735",
pages = "8383--8394",
abstract = "Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.",
}