Databalancer is the python library using in machine learning applications to balance the imbalanced text classification datasets before the model training.
- Databalancer is able to balance any imbalanced text classification datasets
- If the given dataset is imbalanced then while balancing no existing data is removed, but new data will be generated and added to the dataset
- For a particular class the newly generated data will be the paraphrases of the existing data in that particular class
- By default, these paraphrases are generated using the ramsrigouthamg/t5_paraphraser model (You can read more about the model from Huggingface official documentation)
- The current version can generate the sentence paraphrases using multiple methods such as T5 models, NLPAUG and Textattack
- The user can select the balance method by passing the
balance_method
parameter while calling thebalanceDataset
method such asbalance_method=1
forramsrigouthamg/t5_paraphraser
T5 model based balancing (Default) ( For more info check t5_paraphraser)balance_method=2
forramsrigouthamg/t5-large-paraphraser-diverse-high-quality
T5 model based balancing (For more info check t5-large-paraphraser-diverse-high-quality)balance_method=3
fornlpaug
based balancing (For more info check nlpaug)balance_method=4
fortextattack
based balancing (For more info check textattack)
- The
model
argument in thebalanceDataset
method is only applicable whenbalance_method
is set as3
, through which user can pass the transformer model name from Huggingface to generate paraphrases using NLPAUG . - If the user enable
quantize=True
inbalanceDataset
then the T5 models(balance_method==1
andbalance_method=2
) will go through the quantization process using fastT5 before inference, so that the model inference time will be reduced. - By default
quantize
parameter is set asFalse
because quantization requires more RAM and more CPU Processing power - Databalancer also provides another method called classCountVisualization to show the dataset class count distribution
Install the databalancer
package with pip
pip install databalancer
Databalancer is only compatable with python 3.6.9 or above.
The library databalancer provides two different functionalities.
1 - classCountVisualization
2 - balanceDataset
#Import the classCountVisualization from the 'databalancer' module
from databalancer import classCountVisualization
#Pass the required datasetname(here traindata.csv) to the function
classCountVisualization("traindata.csv")
#Import the balanceDataset from the 'databalancer' module
from databalancer import balanceDataset
#Pass the dataset name which is to be balanced(here traindata.csv) to the balanceDataset function
balanceDataset("traindata.csv",balance_method=1)
The above code will balance the dataset and store the balanced dataset('balanced_data.csv') in the local machine.
#Import the balanceDataset from the 'databalancer' module
from databalancer import balanceDataset
#Pass the dataset name which is to be balanced(here traindata.csv) to the balanceDataset function with balance_method=2 and enable quantization
balanceDataset("traindata.csv",balance_method=2,quantize=True)
The above code will balance the dataset using balance_method=2 with quantization and store the balanced dataset('balanced_data.csv') in the local machine.
To show the balanced dataset class count distribution, run the code below.
from databalancer import classCountVisualization
classCountVisualization("balanced_data.csv")