skit-pipelines

Reusable workflows for ml teams at skit.ai. Built using kubeflow components.

Components

A component does one thing really well. As an example, if you want to download a dataset and train a model:

You would query a database.
Save the results as a file.
Prepare train/test datasets for training.
Run a program to train the model.
Save the model once training is complete.
Evaluate the model on the test set to benchmark performance.
Run 1 - 6 till results are favourable.
Persist the best model on the cloud.
Persist the best results on the cloud.

Each step here is a component. As long as components ensure single responsibility we can build complex pipelines conveniently.

Attention!

If a component trains a model after performing a 70-30 split on a given dataset. It would be very difficult to train if we have a dataset that should be used entirely for training. The component will helplessly reduce 30% of the data always.

Pipelines

Pipelines are complex ML workflows that are required regularly like: training a model, sampling data, getting data annotated, producing metrics, etc.

A list of official pipelines which are supported by Skit can be found here.

Project strucuture

Understand the directory strucuture.

.
├── build
│   └── fetch_calls_pipeline.yaml
├── ... (skipping other files)
├── skit_pipelines
│   ├── components
│   │   ├── fetch_calls.py
│   │   ├── __init__.py
│   │   └── upload2s3.py
│   ├── constants.py
│   ├── __init__.py
│   ├── pipelines
│   │   ├── fetch_calls_pipeline.py
│   │   └── __init__.py
│   └── utils
│       └── __init__.py
└── tests

We have components module, each file corresponds to exactly one component.
We have pipelines module, each file corresponds to exactly one pipeline.
Reusable functions should go to utils.
We have constants.py file that contains constants to prevent typos and assist with code completion.
build houses our pipeline yamls. These will get more important in a later section.

It is necessary to understand the anatomy of a kubeflow component and pipeline before contributing to this project.

Making new pipelines

Once a new pipeline and its pre-requisite components are ready.

Add an entry to the CHANGELOG.md.
Create a new tag with updated semver and push, our github actions take care of pushing the image to our private ECR.
Run make all. This will rebuild all the pipeline yamls and update the docs. This will create a secrets dir. Doesn't work if you don't have s3 credentials.
Run source secrets/env.sh You may not have this if you aren't part of skit.ai.
Upload the yamls to kubeflow ui or use it via the sdk.

Pre-requisites

This project is based on python 3.10. You would require an environment setup for the same. Using miniconda is recommended.
make mac
poetry

Local development

Source secrets.
```
dvc pull && source secrets/env.sh
```

Run

uvicorn skit_pipelines.api.endpoints:app \
--proxy-headers --host 0.0.0.0 \
--port 9991 \
--workers 1 \
--reload

Responses

Endpoint responses

{
   "status":"ok",
   "response":{
      "message":"Pipeline run created successfully.",
      "name":"train-voicebot-xlmr",
      "run_id":"e33879a1-xxxxx",
      "run_url":"https://kubeflow.skit.ai/pipeline/?ns=..."
   }
}

Webhook responses

Success

{
   "status": "ok",
   "response": {
      "message": "Run completed successfully.",
      "run_id": "662b9909-d251-45f8-a8xxxxx",
      "run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
      "file_path": "/tmp/outputs/Output/data",
      "s3_path": "<artifact s3_path tar file>",
      "webhook": true
   }
}

Error

{
"status": "error",
   "response": {
      "message": "Run failed.",
      "run_id": "662b9909-d251-45f8xxxxxxxx",
      "run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
      "file_path": null,
      "s3_path": null,
      "webhook": true
   }
}

Pending

{
"status": "pending",
   "response": {
      "message": "Run in progress.",
      "run_id": "662b9909-d251-45f8-axxxxxxxxx",
      "run_url": "https://kubeflow.skit.ai/pipeline/?ns=...",
      "file_path": null,
      "s3_path": null,
      "webhook": true
   }
}

Name		Name	Last commit message	Last commit date
Latest commit History 883 Commits
.dvc		.dvc
.github/workflows		.github/workflows
docs		docs
skit_pipelines		skit_pipelines
source		source
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
docker_build_dev_image.dvc		docker_build_dev_image.dvc
pipeline_secrets.dvc		pipeline_secrets.dvc
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_pipeline.py		run_pipeline.py
us_docker_build_dev_image.dvc		us_docker_build_dev_image.dvc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skit-pipelines

Components

Pipelines

Project strucuture

Making new pipelines

Pre-requisites

Local development

Responses

Endpoint responses

Webhook responses

Success

Error

Pending

About

Releases 5

Packages

Contributors 16

Languages

License

skit-ai/skit-pipelines

Folders and files

Latest commit

History

Repository files navigation

skit-pipelines

Components

Pipelines

Project strucuture

Making new pipelines

Pre-requisites

Local development

Responses

Endpoint responses

Webhook responses

Success

Error

Pending

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 16

Languages

Packages