The repository contains code for training and evaluating the experiments performed in the submission titled "Document Image Cleaning using Budget-Aware Black-Box Approximation". A large part of the code is derived from Gradient-Approx-to-improve-OCR.
Create a python virtual environment and install the required packages using
pip3 install -r requirements.txtThe dataset links are as follows:
Train, Val and Test splits should be extracted and placed in a folder called "data".
An example command to train a preprocessor using the POS dataset is shown below -
python -u train_nn_patch.py --epoch $EPOCH --data_base_path $DATA_PATH --crnn_model $CRNN_MODEL_PATH --exp_base_path $EXP_BASE_PATH --minibatch_subset TopKCER --minibatch_subset_prop 0.95 --inner_limit 1 --inner_limit_skip --cers_ocr_path $CER_JSON_PATH --ocr $OCRRelevant arguments are explained here
data_base_path: Path to folder containing train, val and test sets.crnn_model: Path to pre-trained CRNN modelexp_base_path: Path for saving model checkpointsminibatch_subset: Used to specify different selection algorithms. (Random=random, TopKCER=TopKCER, UniformCER=rangeCER)minibatch_subset_prop: Specify the proportion of samples for each OCR is not queried. Here, 0.95 indicates skipping almost 95-96% of samples, hence the OCR is queried for only 4% of samples.inner_limit: Number of times the images are jittered. If inner_limit_skip is specified, label tracking is enabled and images are not jittered at all.cers_ocr_path: Initialize the sample cers with a json file. E.g. VGG, POSocr: Specify the OCR - Tesseract / EasyOCR
To train a preprocessor with the VGG dataset, use train_nn_area.py with the same arguments as train_nn_patch.py.
An example command to train a CRNN model is shown below -
python -u train_crnn.py --batch_size $BATCH_SIZE --epoch $EPOCH --crnn_model_path $CRNN_MODEL_PATH --dataset vgg --data_base_path $DATA_PATH --ocr EasyOCReval_prep.py is used for evaluating a trained preprocessor.
python -u eval_prep.py --prep_path $PREP_PATH --dataset pos --prep_model_name $PREP_MODEL_NAME --data_base_path $DATA_PATH --ocr EasyOCRprep_pathspecifies folder path containing preprocessor checkpoints.prep_model_namespecifies name of specific model checkpoint to be evaluated.datasetspecifies pos/vgg dataset.
The directory pretrained_models contains trained preprocessors and pretrained CRNN models from some experiments. The preprocessor directory contains models with name n_model where n can be 4, 8 or 100 (indicating the query budget). The models in the preprocessor directory were obtained using the POS dataset and Tesseract OCR engine.
- Trained Models
- Add colab link