This repository contains the code and instructions to reproduce the experiments and results presented in the paper Fair-OBNC: Correcting Label Noise for Fairer Datasets.
This section details how to replicate our experiments to obtain the results we present in the paper Fair-OBNC: Correcting Label Noise for Fairer Datasets.
The first step is to install the Aequitas Flow package:
pip install git+https://github.com/dssg/aequitas.git
Then, one can download the necessary data by running:
# To store the necessary data
>>> from generate_data import generate_data
>>> generate_data({"BankAccountFraud": ["TypeII"]})
Finally, we include in this repository the configuration files we used in our experiments, so the only step left is to run the fairobnc_experiment.py
script to run the experiments:
# To run the experiments with the multiple injected noise scenarios
>>> python -m fairobnc_experiment baf typeii noise_injection_experiment --noise_injection
# To run the experiments without noise injection
>>> python -m fairobnc_experiment baf typeii noise_injection_experiment
If you wish to test our method in addtional scenarios, our framework can be used to test more cases.
The generate_data
function loads the desired datasets from Aequitas, generates the IID versions of it and injects noise into the labels, storing the necessary files for using the IIDDataset
and NoisyDataset
classes.
# To store the necessary data
>>> from generate_data import generate_data
>>> generate_data({"BankAccountFraud": ["TypeII"]})
# To load an IID dataset
>>> from datasets import IIDDataset
>>> iid_dataset = IIDDataset("BankAccountFraud", "TypeII")
>>> iid_dataset.load_data()
>>> iid_dataset.create_splits()
# To load a noisy dataset, where noise is being applied only on the instances from the negative class, flipping 5% of the instances belonging to the negative sensitive group and 20% of the ones from the positive group
>>> from datasets import NoisyDataset
>>> noisy_dataset = NoisyDataset("BankAccountFraud", "TypeII", {0:0.05, 1:0.20}, [0])
>>> noisy_dataset.load_data()
>>> noisy_dataset.create_splits()
The configs
folder is organized into 2 subfolders, following the Aequitas experiment logic:
methods
contains the config files for each of the preprocessing methods being analyzeddatasets
which contains the config files for each noisy version of the used datasets. These configs can be automatically generated by calling thegenerate_dataset_configs
function:>>> from generate_configs import generate_dataset_configs >>> generate_dataset_configs({"BankAccountFraud":["TypeII"]})
Each specific type of injected noise must be run as a seperate experiment so that the same hyperparameters are sampled in each trial.
The experiment config files can be generated using the generate_experiment_file
function:
>>> from generate_configs import generate_experiment_files
>>> generate_experiment_files(
... methods = ["lightgbm", "OBNC", "Fair-OBNC", "PrevalenceSampling"],
... variants = {"BankAccountFraud":["TypeII"]},
... noise_injection = True,
... n_trials = 50,
)
After setting up all the data and config files, one can run the fairobnc_experiment.py
script to run the experiments:
>>> python -m fairobnc_experiment baf typeii noise_injection_experiment --noise_injection
The result_analysis.py
file contains the definition of the functions used to analyze the obtained results and generate the plot presented in the paper.