Skip to content

Conversation

@kramaranya
Copy link
Contributor

I've added get_job_logs API to the OptimizerClient

Working example:

from kubeflow.optimizer import OptimizerClient, Search, Objective, TrialConfig

def get_torch_dist(learning_rate: str, num_epochs: str):
    import os
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="gloo")
    print("PyTorch Distributed Environment")
    print(f"WORLD_SIZE: {dist.get_world_size()}")
    print(f"RANK: {dist.get_rank()}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

    lr = float(learning_rate)
    epochs = int(num_epochs)
    loss = 1.0 - (lr * 2) - (epochs * 0.01)

    if dist.get_rank() == 0:
        print(f"loss={loss}")
    
    dist.barrier()

template = TrainJobTemplate(
    trainer=CustomTrainer(
        func=get_torch_dist,
        func_args={"learning_rate": "0.01", "num_epochs": "5"},
        num_nodes=2,
        resources_per_node={"gpu": 1},
    ),
    runtime=TrainerClient().get_runtime("torch-distributed"),
)

job_id = OptimizerClient().optimize(
    trial_template=template,
    trial_config=TrialConfig(num_trials=10, parallel_trials=2),
    search_space={
        "learning_rate": Search.loguniform(0.001, 0.1),
        "num_epochs": Search.choice([5, 10, 15]),
    },
)

print(f"OptimizationJob created: {job_id}")

print("\n".join(OptimizerClient().get_job_logs(name=job_id)))

/assign @kubeflow/kubeflow-sdk-team

@coveralls
Copy link

coveralls commented Nov 6, 2025

Pull Request Test Coverage Report for Build 19122484801

Details

  • 8 of 39 (20.51%) changed or added relevant lines in 5 files are covered.
  • 3 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.5%) to 66.827%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/optimizer/constants/constants.py 0 1 0.0%
kubeflow/trainer/backends/kubernetes/backend.py 8 10 80.0%
kubeflow/optimizer/api/optimizer_client.py 0 3 0.0%
kubeflow/optimizer/backends/base.py 0 4 0.0%
kubeflow/optimizer/backends/kubernetes/backend.py 0 21 0.0%
Files with Coverage Reduction New Missed Lines %
kubeflow/optimizer/api/optimizer_client.py 1 0.0%
kubeflow/optimizer/backends/base.py 1 0.0%
kubeflow/optimizer/backends/kubernetes/backend.py 1 0.0%
Totals Coverage Status
Change from base Build 19117385888: -0.5%
Covered Lines: 2506
Relevant Lines: 3750

💛 - Coveralls

def get_job_logs(
self,
name: str,
trial: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use trial_name here ?

Suggested change
trial: Optional[str] = None,
trial_name: Optional[str] = None,

def get_job_logs(
self,
name: str,
trial: Optional[str],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trial: Optional[str],
trial_name: Optional[str],

if trial is None:
# Get logs from the best current trial.
best_trial = self.get_best_trial(name)
if best_trial is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, if the best Trial is empty, let's take the first Trial from the OptimizationJob if list is not empty:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated in 66927dc

# TODO (kramaranya): Consider waiting for best trial when follow=True
return
trial = best_trial.name
logger.info(f"Getting logs from best trial: {trial}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use .debug here, since we don't use .info logging in the SDK for now.

Suggested change
logger.info(f"Getting logs from best trial: {trial}")
logger.debug(f"Getting logs from best trial: {trial}")

name: str,
trial: Optional[str] = None,
follow: bool = False,
step: str = trainer_constants.NODE + "-0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for now, we should remove step from this API ?
The problem is that for other Steps (e.g. Pods), the container name is not metrics-logger-and-collector.
As a workaround, users can always use TrainerClient() to get logs for other steps of TrainJob
(e.g. TrainJob name == Trial name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this too, makes sense to me, updated in e1a00b8

return

container_name = constants.METRICS_COLLECTOR_CONTAINER
try:
Copy link
Member

@andreyvelich andreyvelich Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce code duplication can you wrap this code under helper function: self.__read_pod_logs(pod_name: str, container_name: str, follow: bool) in , the Trainer client and use it as:

yield from self.trainer_backend.__read_pod_logs(
        pod_name=pod_name,
        container_name=container_name,
        follow=follow
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 49b4da0

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti
Copy link
Contributor

Thanks @kramaranya

/lgtm
/approve

I think we need to improve how Katib instruments the train nodes for StdOutCollector and FileCollector metrics to avoid depending on the metric collector sidecar and also make it work for other configurations like TfEventCollector or PrometheusMetricCollector.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d3d2e5b into kubeflow:main Nov 6, 2025
14 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants