Skip to content

Conversation

@abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Sep 12, 2025

What this PR does / why we need it:

Fixes #87, #92, #116

QuickStart :

from kubeflow.trainer import TrainerClient, CustomTrainer
from kubeflow.trainer.options import Name, Labels

client = TrainerClient()
client.train(
    runtime=client.get_runtime("torch-distributed"),
    trainer=CustomTrainer(func=my_function),
    options=[
        Name(name="my-job"),
        Labels(labels={"team": "ml"})
    ]
)

Common Options:

from kubeflow.trainer.options import Name

Kubernetes Options

from kubeflow.trainer.options.kubernetes import (
    Name,        #JobName
    Labels,              # Job metadata labels
    Annotations,         # Job metadata annotations
    SpecLabels,          # Labels for pods/jobsets
    SpecAnnotations,     # Annotations for pods/jobsets
    PodTemplateOverrides,  # Pod customization
    TrainerCommand,      # Override container command (CustomTrainerContainer only)
    TrainerArgs,         # Override container args (CustomTrainerContainer only)
)

Basic config:

from kubeflow.trainer.options import Name, Labels

options=[
    Name(name="my-job"),
    Labels(labels={"team": "ml-team", "project": "nlp"})
]

Custom container specific options:

from kubeflow.trainer.options import TrainerCommand, TrainerArgs

options=[
    TrainerCommand(["python", "-m", "torch.distributed.run"]),
    TrainerArgs(["train.py", "--epochs=10"])
]

Pod customisation :

from kubeflow.trainer.options import (
    PodTemplateOverride,
    PodTemplateOverrides,
    PodSpecOverride
)

options=[
    PodTemplateOverrides(
        PodTemplateOverride(
            target_jobs=["node"],
            spec=PodSpecOverride(
                node_selector={"gpu-type": "a100"},
                volumes=[{
                    "name": "data",
                    "persistentVolumeClaim": {"claimName": "my-data"}
                }]
            )
        )
    )
]

Creating custom options:

from dataclasses import dataclass
from typing import Any, Optional

@dataclass
class MyCustomOption:
    """My custom option for Kubernetes jobs."""
    
    timeout_seconds: int
    
    def __call__(self, job_spec: dict[str, Any], trainer, backend) -> None:
        # Validate backend compatibility
        from kubeflow.trainer.backends.kubernetes.backend import KubernetesBackend
        if not isinstance(backend, KubernetesBackend):
            raise ValueError("MyCustomOption only works with Kubernetes")
        
        # Apply your changes
        spec = job_spec.setdefault("spec", {})
        spec["activeDeadlineSeconds"] = self.timeout_seconds

client.train(
    trainer=my_trainer,
    options=[MyCustomOption(timeout_seconds=3600)]
)

Overrides:

PodTemplateOverride Fields

PodTemplateOverride(
    target_jobs=["node", "launcher"],  # Which pods to apply to
    metadata={...},                     # Pod labels/annotations
    spec=PodSpecOverride(...)          # Pod spec customizations
)

PodSpecOverride Fields

PodSpecOverride(
    service_account_name="...",
    node_selector={...},
    tolerations=[...],
    affinity={...},
    volumes=[...],
    init_containers=[...],
    containers=[...],
    scheduling_gates=[...],
    image_pull_secrets=[...]
)

ContainerOverride Fields

ContainerOverride(
    name="trainer",
    env=[...],
    volume_mounts=[...]
)

Checklist:

  • Docs included if any changes are user facing

@abhijeet-dhumal abhijeet-dhumal changed the title Add labels and annotations support for train client feast: Add labels and annotations support for train client Sep 12, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feast: Add labels and annotations support for train client feat: Add labels and annotations support for train client Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 67d12d8 to b39b364 Compare September 15, 2025 05:22
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 2 times, most recently from 2d7a8e6 to dbba135 Compare September 15, 2025 05:44
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Add labels and annotations support for train client feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization feat: Implement Training Options pattern for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 36c0160 to 95155f6 Compare September 15, 2025 10:41
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review September 15, 2025 11:23
Copy link
Member

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal

I believe we need to handle how options will be applied for while backends. either we ignore options for localprocess or make options targeted towards specific backend.

@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft September 17, 2025 13:17
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from f092845 to 03994e7 Compare September 24, 2025 11:40
@coveralls
Copy link

coveralls commented Sep 24, 2025

Pull Request Test Coverage Report for Build 19096824715

Details

  • 390 of 463 (84.23%) changed or added relevant lines in 11 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-11.6%) to 68.007%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/api/trainer_client.py 0 1 0.0%
kubeflow/trainer/options/kubernetes_test.py 85 86 98.84%
kubeflow/trainer/backends/localprocess/backend.py 17 19 89.47%
kubeflow/trainer/backends/localprocess/backend_test.py 111 114 97.37%
kubeflow/trainer/backends/kubernetes/backend.py 25 32 78.13%
kubeflow/trainer/options/kubernetes.py 117 176 66.48%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/kubernetes/backend.py 1 79.92%
Totals Coverage Status
Change from base Build 19075264540: -11.6%
Covered Lines: 2487
Relevant Lines: 3657

💛 - Coveralls

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 7e16a13 to b796722 Compare November 5, 2025 09:04
…h podTemplateOverrides

Signed-off-by: Abhijeet Dhumal <[email protected]>
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from b796722 to dc0fb12 Compare November 5, 2025 09:07
@astefanutti
Copy link
Contributor

Thanks @abhijeet-dhumal for this outstanding contribution (and your patience 😅)!

/lgtm

/assign @andreyvelich @kramaranya

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this amazing work, @abhijeet-dhumal! 🎉
/lgtm

Also, we just beat the record for the most commented PR :)

@andreyvelich
Copy link
Member

Thank you for this amazing contribution @abhijeet-dhumal!
/lgtm
/approve
/hold cancel

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 0531999 into kubeflow:main Nov 5, 2025
14 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Nov 5, 2025
@abhijeet-dhumal
Copy link
Contributor Author

Thanks @andreyvelich @astefanutti @kramaranya , delighted to have made this contribution. Thank you to all reviewers for their help.
For the next course of action, I will be adding PR in https://github.com/kubeflow/website/tree/master/content/en/docs/components/trainer/user-guides too, for documenting the training options available via SDK 🏁
Thanks 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose TrainJob labels and annotations in the SDK

8 participants