Exploratory data analysis and data preparation scripts for model training, done on the ug-language-parallel-text
dataset
- Concatenate any new chunks of data into the publicly available json dataset by running the
Data EDA and cleaning.ipynb
notebook - Create the model training dataset files and put them in the right folder structure by running the
Model Training Data Prep.ipynb
andMultilingual Data Prep.ipynb
notebooks, in that order - Upload the dataset to the
sunbird-translate
bucket onAWS S3
(as a versioned dataset, for examplev4-dataset.zip
) and make sure the resource is public - Update the dataset URL in the SunbirdAI/datasets repository with the new dataset link (as shown in the next section of this
README
). The code in this file picks up the models fromS3 bucket
whose link we added as the dataset URL - Run the
SunbirdAI
language model training notebook onAWS SageMaker
. The training notebook refers toSunbirdAI/datasets
for the datasets to be used in training - Save the checkpoints and upload them to the
sunbird-translate
AWS S3
bucket, in themodels
folder - Upload the models to Hugging Face
Find the _URL
constant in the datasets/sunbird/sunbird.py
file on the init-sunbird-dataset
branch of the SunbirdAI/datasets repository.
The image below shows an example of this: