Reusable workflows for ml teams at skit.ai. Built using kubeflow components.
A component does one thing really well. As an example, if you want to download a dataset and train a model:
- You would query a database.
- Save the results as a file.
- Prepare train/test datasets for training.
- Run a program to train the model.
- Save the model once training is complete.
- Evaluate the model on the test set to benchmark performance.
- Run 1 - 6 till results are favourable.
- Persist the best model on the cloud.
- Persist the best results on the cloud.
Each step here is a component. As long as components ensure single responsibility we can build complex pipelines conveniently.
Attention!
If a component trains a model after performing a 70-30 split on a given dataset. It would be very difficult to train if we have a dataset that should be used entirely for training. The component will helplessly reduce 30% of the data always.
Pipelines are complex ML workflows that are required regularly like: training a model, sampling data, getting data annotated, producing metrics, etc.
A list of official pipelines which are supported by Skit can be found here.
Understand the directory strucuture.
.
├── build
│ └── fetch_calls_pipeline.yaml
├── ... (skipping other files)
├── skit_pipelines
│ ├── components
│ │ ├── fetch_calls.py
│ │ ├── __init__.py
│ │ └── upload2s3.py
│ ├── constants.py
│ ├── __init__.py
│ ├── pipelines
│ │ ├── fetch_calls_pipeline.py
│ │ └── __init__.py
│ └── utils
│ └── __init__.py
└── tests
- We have components module, each file corresponds to exactly one component.
- We have pipelines module, each file corresponds to exactly one pipeline.
- Reusable functions should go to utils.
- We have constants.py file that contains constants to prevent typos and assist with code completion.
- build houses our pipeline yamls. These will get more important in a later section.
It is necessary to understand the anatomy of a kubeflow component and pipeline before contributing to this project.
Once a new pipeline and its pre-requisite components are ready.
- Add an entry to the
CHANGELOG.md
. - Create a new tag with updated semver and push, our github actions take care of pushing the image to our private ECR.
- Run
make all
. This will rebuild all the pipeline yamls and update the docs. This will create a secrets dir. Doesn't work if you don't have s3 credentials. - Run
source secrets/env.sh
You may not have this if you aren't part of skit.ai. - Upload the yamls to kubeflow ui or use it via the sdk.
- This project is based on
python 3.10
. You would require an environment setup for the same. Using miniconda is recommended. - make mac
- poetry
Source secrets.
dvc pull && source secrets/env.sh
Run
uvicorn skit_pipelines.api.endpoints:app \ --proxy-headers --host 0.0.0.0 \ --port 9991 \ --workers 1 \ --reload
{ "status":"ok", "response":{ "message":"Pipeline run created successfully.", "name":"train-voicebot-xlmr", "run_id":"e33879a1-xxxxx", "run_url":"https://kubeflow.skit.ai/pipeline/?ns=..." } }
{
"status": "ok",
"response": {
"message": "Run completed successfully.",
"run_id": "662b9909-d251-45f8-a8xxxxx",
"run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
"file_path": "/tmp/outputs/Output/data",
"s3_path": "<artifact s3_path tar file>",
"webhook": true
}
}
{
"status": "error",
"response": {
"message": "Run failed.",
"run_id": "662b9909-d251-45f8xxxxxxxx",
"run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
"file_path": null,
"s3_path": null,
"webhook": true
}
}
{
"status": "pending",
"response": {
"message": "Run in progress.",
"run_id": "662b9909-d251-45f8-axxxxxxxxx",
"run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
"file_path": null,
"s3_path": null,
"webhook": true
}
}