This is a repository for TOSEM'22 paper "Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair".
Bias in APR evaluation which could result in serious consequences (i.e., incorrect conclusions) is an important issue for the APR community. One representative fact is that the bias study has been performed along with the pioneer APR techniques and has become increasingly popular in recent years. However, the bias in APR evaluation is still not resolved yet. Specifically, there are three major challenges:
- Uncovered bias understanding: since the first proposal of the APR technique, how many biases have been uncovered? what about their corresponding impact on APR evaluation? and are they really eliminated by subsequent evaluations?
- New bias discovery: we observed that existing bias studies are still very dispersed. That is, researchers in APR community mainly rely on their experience to discover new biases that reside in the APR evaluation. Therefore, a challenge naturally arises: is there any methodology to support new bias discovery?
- Bias validation and elimination: the lack of infrastructure hinders the bias validation and elimination in APR evaluation. To the best of our knowledge, there is only one executable framework (i.e., RepairThemAll) for the repair of general bugs. But it mainly focus on the validation and elimination of a single evaluation bias (i.e., benchmark overfitting). A new infrastructure is needed to cover existing biases.
This strongly motivates us to conduct the following three tasks:
- We perform an systematic literature review (an evidence-based software engineering methodology) to understand uncovered biases;
- We propose a tree-based taxonomy to discover new biases;
- We design and implement APRConfig, an executable framework, to support bias validation and elimination.
All artifacts of our study "Seeing the Whole Elephant: Systematically Understanding and Uncovering Evaluation Biases in Automated Program Repair" are available in this repository.
The overview of our APRConfig framework is as follows:
Our motivation to design, implement, and release APRConfig is to facilitate the bias validation and elimination, as well as a fairer evaluation for APR community. The goal of APRConfig is to make the APR evaluation more reliable, less complex and expensive. It currently integrates two datasets, one standard fault localization, four patch generation algorithms, and three patch validation strategies. We hope in the future APRConfig could integrate more datasets and APR modules. Any contribution to this repository or any usage of the APRConfig is very welcome. Let's fight for a better APR!
This repository is dedicated to serve the following groups:
- practitioners who want to put APR techniques into practice. They could learn from our study to avoid bias escape into industrial usage.
- researchers who aim to obtain trustworthy conclusions for evaluating the strengths and weaknesses of an (newly proposed) APR technique.
- other potential users who are interested in APR research and want to obtain a comprehensive understanding of existing biases in APR evaluation.
The experiment of our study is performed with:
- OS: Ubuntu 16.04
- JDK: JDK 7and JDK 8
- Python 3 (with Conda environment)
# 1) Clone this repo
git clone https://github.com/DehengYang/APRConfig.git
# 2) Init repo
./init.sh
cd APRConfig
./run_d4j_1.sh
./run_quixbugs.sh
To obtain statistics, run:
cd APRConfig
# 1) to parse execution logs
python parser/Main_parse.py
# 2) to obtain plausible patches
python statistics/Get_patch.py
# 3) to calculate time cost
python statistics/Count_repair_time.py
# 4) to gather data (compression with high reduction size via lrzip)
python result_processor/Result_processor.py
Or you can directly run:
cd APRConfig
./run_parser.sh
To obtain the figures shown in our paper, run:
cd APRConfig
1) to gather results at results_defects4j/merge dir
python parser_plot/Prepare_data.py
2) to analyze repair effectiveness
python parser_plot/Plot_impact_effectiveness.py
3) to analyze repair efficiency
python parser_plot/Plot_impact_efficiency.py
To obtain the results of statistical tests presented in our paper, run:
python parser_plot/Statistical_tests.py
├── APRConfig: source code and execution scripts of APRConfig
├── apr_tools: submodule for patch generation
├── datasets: submodule for dataset preparation
├── fl_modules: submodule for fault localization
├── patch_validator: submodule for patch validation
├── package.sh: script for package all submodules of APRConfig
├── INSTALL: scripts for environment configuration
├── results_bears: raw data of repair experiment on Bears
├── results_defects4j: raw data of repair experiment on Defects4J
├── results_quixbugs: raw data of repair experiment on QuixBugs
├── doc: SLR data
├── LICENSE
└── README.md
To facilitate the usage of potential users, we plan to continuously maintain this repository. Accordingly, we provide the guideline, which is available at How_to_extend.md, on how to extend APRConfig with more datasets, fault localizers, patch generation algorithms, and patch validators. Any question or contribution is much welcomed.
We would like to sincerely thank Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu for contributing a great framework (i.e., RepairThemAll), which serves as a quite useful reference point for how to construct an executable framework. We also adopted some software design patterns used in RepairThemAll, and finally accomplished a new executable framework that properly decouples the APR implementation into three sub-modules, including fault localization, patch generation, and patch validation, to support the bias mitigation and validation as well as further explorations on APR evaluation for potential end users.
The repository is licensed under the GNU GPLv3 license. See LICENSE for details.