Kubeflow Training Operator

Overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, TensorFlow, HuggingFace, JAX, DeepSpeed, XGBoost, PaddlePaddle and others.

You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC. The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version, please follow this guide to install MPI Operator V2.

The Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using the Training Operator Python SDK.

Prerequisites

Please check the official Kubeflow documentation for prerequisites to install the Training Operator.

Installation

Please follow the Kubeflow Training Operator guide for the detailed instructions on how to install Training Operator.

Installing the Control Plane

Run the following command to install the latest stable release of the Training Operator control plane: v1.8.0.

kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"

Run the following command to install the latest changes of the Training Operator control plane:

kubectl apply --server-side -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Installing the Python SDK

The Training Operator implements a Python SDK to simplify creation of distributed training and fine-tuning jobs for Data Scientists.

Run the following command to install the latest stable release of the Training SDK:

pip install -U kubeflow-training

Getting Started

Please refer to the getting started guide to quickly create your first distributed training job using the Python SDK.

If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.

Community

The following links provide information on how to get involved in the community:

Attend the bi-weekly AutoML and Training Working Group community meeting.
Join our #kubeflow-training Slack channel.
Check out who is using the Training Operator.

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing

Please refer to the CONTRIBUTING guide.

Change Log

Please refer to the CHANGELOG.

Release

⚠️ EXTREMELY IMPORTANT ⚠️

Whenever you rebase this fork onto a new upstream release, you must update the version in component_metadata.yaml. This version is displayed to customers via the DataScienceCluster resource, so it must remain accurate.

If a new ODH release is planned, ensure the updated version is also posted in the release tracker. See example: Issue #170.

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version	API Version	Kubernetes Version
`v1.4.x`	`v1`	1.23+
`v1.5.x`	`v1`	1.23+
`v1.6.x`	`v1`	1.23+
`v1.7.x`	`v1`	1.25+
`v1.8.x`	`v1`	1.27+
`latest` (master HEAD)	`v1`	1.27+

Reference

For a complete reference of the custom resource definitions, please refer to the API Definition.

For details on the Training Operator custom resources APIs, refer to the following API documentation

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 1,451 Commits
.github		.github
.syft		.syft
.tekton		.tekton
build/images		build/images
cmd/training-operator.v1		cmd/training-operator.v1
docs		docs
examples		examples
hack		hack
manifests		manifests
odh_utils		odh_utils
pkg		pkg
scripts		scripts
sdk/python		sdk/python
test_job		test_job
third_party/library		third_party/library
third_party_licenses		third_party_licenses
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ADOPTERS.md		ADOPTERS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
OWNERS_ALIASES		OWNERS_ALIASES
PROJECT		PROJECT
README.md		README.md
ROADMAP.md		ROADMAP.md
go.mod		go.mod
go.sum		go.sum
prow_config.yaml		prow_config.yaml
rpms.in.yaml		rpms.in.yaml
rpms.lock.yaml		rpms.lock.yaml
ubi.repo		ubi.repo
vendor.go		vendor.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubeflow Training Operator

Overview

Prerequisites

Installation

Installing the Control Plane

Installing the Python SDK

Getting Started

Community

Contributing

Change Log

Release

⚠️ EXTREMELY IMPORTANT ⚠️

Version Matrix

Reference

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

red-hat-data-services/training-operator

Folders and files

Latest commit

History

Repository files navigation

Kubeflow Training Operator

Overview

Prerequisites

Installation

Installing the Control Plane

Installing the Python SDK

Getting Started

Community

Contributing

Change Log

Release

⚠️ EXTREMELY IMPORTANT ⚠️

Version Matrix

Reference

Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages