Skip to content

aws/sagemaker-hyperpod-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SageMaker HyperPod command-line interface

The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage clusters, training jobs, and inference endpoints on the SageMaker HyperPod clusters orchestrated by Amazon EKS.

This documentation serves as a reference for the available HyperPod CLI commands. For a comprehensive user guide, see Orchestrating SageMaker HyperPod clusters with Amazon EKS in the Amazon SageMaker Developer Guide.

Note: Old hyperpodCLI V2 has been moved to release_v2 branch. Please refer release_v2 branch for usage.

Table of Contents

Overview

The SageMaker HyperPod CLI is a tool that helps create training jobs and inference endpoint deployments to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of jobs, including create, describe, list, and delete operations, as well as accessing pod and operator logs where applicable. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.

Prerequisites

Region Configuration

Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Prerequisites for Training

  • HyperPod CLI currently supports starting PyTorchJobs. To start a job, you need to install Training Operator first.

Prerequisites for Inference

  • HyperPod CLI supports creating Inference Endpoints through jumpstart and through custom Endpoint config

Platform Support

SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform is not supported now.

ML Framework Support

SageMaker HyperPod CLI currently supports start training job with:

  • PyTorch ML Framework. Version requirements: PyTorch >= 1.10

Installation

  1. Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.

  2. Install the sagemaker-hyperpod-cli package.

    pip install sagemaker-hyperpod
  3. Verify if the installation succeeded by running the following command.

    hyp --help

Usage

The HyperPod CLI provides the following commands:

Getting Started

Getting Cluster information

This command lists the available SageMaker HyperPod clusters and their capacity information.

hyp list-cluster
Option Type Description
--region <region> Optional The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
--namespace <namespace> Optional The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
--output <json|table> Optional The output format. Available values are table and json. The default value is json.
--debug Optional Enable debug mode for detailed logging.

Connecting to a Cluster

This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.

hyp set-cluster-context --cluster-name <cluster-name>
Option Type Description
--cluster-name <cluster-name> Required The SageMaker HyperPod cluster name to configure with.
--namespace <namespace> Optional The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
--region <region> Optional The AWS region where the HyperPod cluster resides.
--debug Optional Enable debug mode for detailed logging.

Getting Cluster Context

Get all the context related to the current set Cluster

hyp get-cluster-context
Option Type Description
--debug Optional Enable debug mode for detailed logging.

CLI

Cluster Management

Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.

Initialize Cluster Configuration

Initialize a new cluster configuration in the current directory:

hyp init cluster-stack

Important: The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.

Configure Cluster Parameters

Configure cluster parameters interactively or via command line:

hyp configure --resource-name-prefix my-cluster --stage prod

Validate Configuration

Validate the configuration file syntax:

hyp validate

Create Cluster Stack

Create the cluster stack using the configured parameters:

hyp create --region <region>

Note: The region flag is optional. If not provided, the command will use the default region from your AWS credentials configuration.

List Cluster Stacks

hyp list cluster-stack
Option Type Description
--region <region> Optional The AWS region to list stacks from.
--status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']" Optional Filter by stack status.
--debug Optional Enable debug mode for detailed logging.

Describe Cluster Stack

hyp describe cluster-stack <stack-name>
Option Type Description
--region <region> Optional The AWS region where the stack exists.
--debug Optional Enable debug mode for detailed logging.

Delete Cluster Stack

Delete a HyperPod cluster stack. Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone.

 hyp delete cluster-stack <stack-name>
Option Type Description
--region <region> Required The AWS region where the stack exists.
--retain-resources S3Bucket-TrainingData,EFSFileSystem-Models Optional Comma-separated list of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks). Resource names are shown in failed deletion output, or use AWS CLI: aws cloudformation list-stack-resources STACK_NAME --region REGION.
--debug Optional Enable debug mode for detailed logging.

Update Existing Cluster

hyp update cluster --cluster-name my-cluster \
    --instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \
    --node-recovery Automatic

Reset Configuration

Reset configuration to default values:

hyp reset

Training

Option 1: Create Pytorch job through init experience

Initialize Pytorch Job Configuration

Initialize a new pytorch job configuration in the current directory:

hyp init hyp-pytorch-job

Configure Pytorch Job Parameters

Configure pytorch job parameters interactively or via command line:

hyp configure --job-name my-pytorch-job

Validate Configuration

Validate the configuration file syntax:

hyp validate

Create Pytorch Job

Create the pytorch job using the configured parameters:

hyp create

Option 2: Create Pytorch job through create command

hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '[python, train.py]' \
    --args '[--epochs=10, --batch-size=32]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --accelerators 8 \
    --vcpu 96.0 \
    --memory 1152.0 \
    --accelerators-limit 8 \
    --vcpu-limit 96.0 \
    --memory-limit 1152.0 \
    --preferred-topology "topology.kubernetes.io/zone=us-west-2a" \
    --volume name=model-data,type=hostPath,mount_path=/data,path=/data \
    --volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false
Parameter Type Required Description
--job-name TEXT Yes Unique name for the training job (1-63 characters, alphanumeric with hyphens)
--image TEXT Yes Docker image URI containing your training code
--namespace TEXT No Kubernetes namespace
--command ARRAY No Command to run in the container (array of strings)
--args ARRAY No Arguments for the entry script (array of strings)
--environment OBJECT No Environment variables as key-value pairs
--pull-policy TEXT No Image pull policy (Always, Never, IfNotPresent)
--instance-type TEXT No Instance type for training
--node-count INTEGER No Number of nodes (minimum: 1)
--tasks-per-node INTEGER No Number of tasks per node (minimum: 1)
--label-selector OBJECT No Node label selector as key-value pairs
--deep-health-check-passed-nodes-only BOOLEAN No Schedule pods only on nodes that passed deep health check (default: false)
--scheduler-type TEXT No Scheduler type
--queue-name TEXT No Queue name for job scheduling (1-63 characters, alphanumeric with hyphens)
--priority TEXT No Priority class for job scheduling
--max-retry INTEGER No Maximum number of job retries (minimum: 0)
--volume ARRAY No List of volume configurations (Refer Volume Configuration for detailed parameter info)
--service-account-name TEXT No Service account name
--accelerators INTEGER No Number of accelerators a.k.a GPUs or Trainium Chips
--vcpu FLOAT No Number of vCPUs
--memory FLOAT No Amount of memory in GiB
--accelerators-limit INTEGER No Limit for the number of accelerators a.k.a GPUs or Trainium Chips
--vcpu-limit FLOAT No Limit for the number of vCPUs
--memory-limit FLOAT No Limit for the amount of memory in GiB
--preferred-topology TEXT No Preferred topology annotation for scheduling
--required-topology TEXT No Required topology annotation for scheduling
--debug FLAG No Enable debug mode (default: false)

List Training Jobs

hyp list hyp-pytorch-job

Describe a Training Job

hyp describe hyp-pytorch-job --job-name <job-name>

Listing Pods

This command lists all the pods associated with a specific training job.

hyp list-pods hyp-pytorch-job --job-name <job-name>
  • job-name (string) - Required. The name of the job to list pods for.

Accessing Logs

This command retrieves the logs for a specific pod within a training job.

hyp get-logs hyp-pytorch-job --pod-name <pod-name> --job-name <job-name>
Parameter Required Description
--job-name Yes The name of the job to get the log for.
--pod-name Yes The name of the pod to get the log from.
--namespace No The namespace of the job. Defaults to 'default'.
--container No The container name to get logs from.

Get Operator Logs

hyp get-operator-logs hyp-pytorch-job --since-hours 0.5

Delete a Training Job

hyp delete hyp-pytorch-job --job-name <job-name>

Inference

Jumpstart Endpoint Creation

Option 1: Create jumpstart endpoint through init experience

Initialize Jumpstart Endpoint Configuration

Initialize a new jumpstart endpoint configuration in the current directory:

hyp init hyp-jumpstart-endpoint

Configure Jumpstart Endpoint Parameters

Configure jumpstart endpoint parameters interactively or via command line:

hyp configure --endpoint-name my-jumpstart-endpoint

Validate Configuration

Validate the configuration file syntax:

hyp validate

Create Jumpstart Endpoint

Create the jumpstart endpoint using the configured parameters:

hyp create

Option 2: Create jumpstart endpoint through create command

Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint

hyp create hyp-jumpstart-endpoint \
    --version 1.0 \
    --model-id jumpstart-model-id\
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-jumpstart
Parameter Type Required Description
--model-id TEXT Yes JumpStart model identifier (1-63 characters, alphanumeric with hyphens)
--instance-type TEXT Yes EC2 instance type for inference (must start with "ml.")
--namespace TEXT No Kubernetes namespace
--metadata-name TEXT No Name of the jumpstart endpoint object
--accept-eula BOOLEAN No Whether model terms of use have been accepted (default: false)
--model-version TEXT No Semantic version of the model (e.g., "1.0.0", 5-14 characters)
--endpoint-name TEXT No Name of SageMaker endpoint (1-63 characters, alphanumeric with hyphens)
--tls-certificate-output-s3-uri TEXT No S3 URI to write the TLS certificate (optional)
--debug FLAG No Enable debug mode (default: false)

Invoke a JumpstartModel Endpoint

hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'

Managing an Endpoint

hyp list hyp-jumpstart-endpoint
hyp describe hyp-jumpstart-endpoint --name endpoint-jumpstart

List Pods

hyp list-pods hyp-jumpstart-endpoint

Get Logs

hyp get-logs hyp-jumpstart-endpoint --pod-name <pod-name>

Get Operator Logs

hyp get-operator-logs hyp-jumpstart-endpoint --since-hours 0.5

Deleting an Endpoint

hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart

Custom Endpoint Creation

Option 1: Create custom endpoint through init experience

Initialize Custom Endpoint Configuration

Initialize a new custom endpoint configuration in the current directory:

hyp init hyp-custom-endpoint

Configure Custom Endpoint Parameters

Configure custom endpoint parameters interactively or via command line:

hyp configure --endpoint-name my-custom-endpoint

Validate Configuration

Validate the configuration file syntax:

hyp validate

Create Custom Endpoint

Create the custom endpoint using the configured parameters:

hyp create

Option 2: Create custom endpoint through create command

hyp create hyp-custom-endpoint \
    --version 1.0 \
    --endpoint-name endpoint-custom \
    --model-name my-pytorch-model \
    --model-source-type s3 \
    --model-location my-pytorch-training \
    --model-volume-mount-name test-volume \
    --s3-bucket-name your-bucket \
    --s3-region us-east-1 \
    --instance-type ml.g5.8xlarge \
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --container-port 8080
Parameter Type Required Description
--instance-type TEXT Yes EC2 instance type for inference (must start with "ml.")
--model-name TEXT Yes Name of model to create on SageMaker (1-63 characters, alphanumeric with hyphens)
--model-source-type TEXT Yes Model source type ("s3" or "fsx")
--image-uri TEXT Yes Docker image URI for inference
--container-port INTEGER Yes Port on which model server listens (1-65535)
--model-volume-mount-name TEXT Yes Name of the model volume mount
--namespace TEXT No Kubernetes namespace
--metadata-name TEXT No Name of the custom endpoint object
--endpoint-name TEXT No Name of SageMaker endpoint (1-63 characters, alphanumeric with hyphens)
--env OBJECT No Environment variables as key-value pairs
--metrics-enabled BOOLEAN No Enable metrics collection (default: false)
--model-version TEXT No Version of the model (semantic version format)
--model-location TEXT No Specific model data location
--prefetch-enabled BOOLEAN No Whether to pre-fetch model data (default: false)
--tls-certificate-output-s3-uri TEXT No S3 URI for TLS certificate output
--fsx-dns-name TEXT No FSx File System DNS Name
--fsx-file-system-id TEXT No FSx File System ID
--fsx-mount-name TEXT No FSx File System Mount Name
--s3-bucket-name TEXT No S3 bucket location
--s3-region TEXT No S3 bucket region
--model-volume-mount-path TEXT No Path inside container for model volume (default: "/opt/ml/model")
--resources-limits OBJECT No Resource limits for the worker
--resources-requests OBJECT No Resource requests for the worker
--dimensions OBJECT No CloudWatch Metric dimensions as key-value pairs
--metric-collection-period INTEGER No Period for CloudWatch query (default: 300)
--metric-collection-start-time INTEGER No StartTime for CloudWatch query (default: 300)
--metric-name TEXT No Metric name to query for CloudWatch trigger
--metric-stat TEXT No Statistics metric for CloudWatch (default: "Average")
--metric-type TEXT No Type of metric for HPA ("Value" or "Average", default: "Average")
--min-value NUMBER No Minimum metric value for empty CloudWatch response (default: 0)
--cloud-watch-trigger-name TEXT No Name for the CloudWatch trigger
--cloud-watch-trigger-namespace TEXT No AWS CloudWatch namespace for the metric
--target-value NUMBER No Target value for the CloudWatch metric
--use-cached-metrics BOOLEAN No Enable caching of metric values (default: true)
--invocation-endpoint TEXT No Invocation endpoint path (default: "invocations")
--debug FLAG No Enable debug mode (default: false)

Invoke a Custom Inference Endpoint

hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'

Managing an Endpoint

hyp list hyp-custom-endpoint
hyp describe hyp-custom-endpoint --name endpoint-custom

List Pods

hyp list-pods hyp-custom-endpoint

Get Logs

hyp get-logs hyp-custom-endpoint --pod-name <pod-name>

Get Operator Logs

hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5

Deleting an Endpoint

hyp delete hyp-custom-endpoint --name endpoint-custom

SDK

Along with the CLI, we also have SDKs available that can perform the cluster management, training and inference functionalities that the CLI performs

Cluster Management SDK

Creating a Cluster Stack

from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

# Initialize cluster stack configuration
cluster_stack = HpClusterStack(
    stage="prod",
    resource_name_prefix="my-hyperpod",
    hyperpod_cluster_name="my-hyperpod-cluster",
    eks_cluster_name="my-hyperpod-eks",
    
    # Infrastructure components
    create_vpc_stack=True,
    create_eks_cluster_stack=True,
    create_hyperpod_cluster_stack=True,
    
    # Network configuration
    vpc_cidr="10.192.0.0/16",
    availability_zone_ids=["use2-az1", "use2-az2"],
    
    # Instance group configuration
    instance_group_settings=[
        {
            "InstanceCount": 1,
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.t3.medium",
            "TargetAvailabilityZoneId": "use2-az2"
        }
    ]
)

# Create the cluster stack
response = cluster_stack.create(region="us-east-2")

Listing Cluster Stacks

# List all cluster stacks
stacks = HpClusterStack.list(region="us-east-2")
print(f"Found {len(stacks['StackSummaries'])} stacks")

Describing a Cluster Stack

# Describe a specific cluster stack
stack_info = HpClusterStack.describe("my-stack-name", region="us-east-2")
print(f"Stack status: {stack_info['Stacks'][0]['StackStatus']}")

Monitoring Cluster Status

from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

stack = HpClusterStack()
response = stack.create(region="us-west-2")
status = stack.get_status(region="us-west-2")
print(status)

Training SDK

Creating a Training Job

from sagemaker.hyperpod.training.hyperpod_pytorch_job import HyperPodPytorchJob
from sagemaker.hyperpod.training.config.hyperpod_pytorch_job_unified_config import (
    ReplicaSpec, Template, Spec, Containers, Resources, RunPolicy
)
from sagemaker.hyperpod.common.config.metadata import Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Containers
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="123456789012.dkr.ecr.us-west-2.amazonaws.com/my-training-image:latest",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
)
# Launch the job
pytorch_job.create()  

List Training Jobs

from sagemaker.hyperpod.training import HyperPodPytorchJob
import yaml

# List all PyTorch jobs
jobs = HyperPodPytorchJob.list()
print(yaml.dump(jobs))

Describe a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")

print(job)

List Pods for a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# List Pods for an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.list_pods())

Get Logs from a Pod

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get pod logs for a job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_logs_from_pod("pod-name"))

Get Training Operator Logs

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get training operator logs
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_operator_logs(since_hours=0.1))

Delete a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")

# Delete the job
job.delete()

Inference SDK

Creating a JumpstartModel Endpoint

Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)
server=Server(
    instance_type='ml.g5.8xlarge',
)
endpoint_name=SageMakerEndpoint(name='<my-endpoint-name>')

js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)

js_endpoint.create()

Creating a Custom Inference Endpoint (with S3)

from sagemaker.hyperpod.inference.config.hp_endpoint_config import CloudWatchTrigger, Dimensions, AutoScalingSpec, Metrics, S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Worker
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

model_source_config = ModelSourceConfig(
    model_source_type='s3',
    model_location="<my-model-folder-in-s3>",
    s3_storage=S3Storage(
        bucket_name='<my-model-artifacts-bucket>',
        region='us-east-2',
    ),
)

environment_variables = [
    EnvironmentVariables(name="HF_MODEL_ID", value="/opt/ml/model"),
    EnvironmentVariables(name="SAGEMAKER_PROGRAM", value="inference.py"),
    EnvironmentVariables(name="SAGEMAKER_SUBMIT_DIRECTORY", value="/opt/ml/model/code"),
    EnvironmentVariables(name="MODEL_CACHE_ROOT", value="/opt/ml/model"),
    EnvironmentVariables(name="SAGEMAKER_ENV", value="1"),
]

worker = Worker(
    image='763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0',
    model_volume_mount=ModelVolumeMount(
        name='model-weights',
    ),
    model_invocation_port=ModelInvocationPort(container_port=8080),
    resources=Resources(
            requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
            limits={"nvidia.com/gpu": 1}
    ),
    environment_variables=environment_variables,
)

tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket-name>')

custom_endpoint = HPEndpoint(
    endpoint_name='<my-endpoint-name>',
    instance_type='ml.g5.8xlarge',
    model_name='deepseek15b-test-model-name',  
    tls_config=tls_config,
    model_source_config=model_source_config,
    worker=worker,
)

custom_endpoint.create()

List Endpoints

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# List JumpStart endpoints
jumpstart_endpoints = HPJumpStartEndpoint.list()
print(jumpstart_endpoints)

# List custom endpoints
custom_endpoints = HPEndpoint.list()
print(custom_endpoints)

Describe an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Get JumpStart endpoint details
jumpstart_endpoint = HPJumpStartEndpoint.get(name="js-endpoint-name", namespace="test")
print(jumpstart_endpoint)

# Get custom endpoint details
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
print(custom_endpoint)

Invoke an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

data = '{"inputs":"What is the capital of USA?"}'
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
response = jumpstart_endpoint.invoke(body=data).body.read()
print(response)

custom_endpoint = HPEndpoint.get(name="endpoint-custom")
response = custom_endpoint.invoke(body=data).body.read()
print(response)

List Pods

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# List pods 
js_pods = HPJumpStartEndpoint.list_pods()
print(js_pods)

c_pods = HPEndpoint.list_pods()
print(c_pods)

Get Logs

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Get logs from pod 
js_logs = HPJumpStartEndpoint.get_logs(pod=<pod-name>)
print(js_logs)

c_logs = HPEndpoint.get_logs(pod=<pod-name>)
print(c_logs)

Get Operator Logs

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Invoke JumpStart endpoint
print(HPJumpStartEndpoint.get_operator_logs(since_hours=0.1))

# Invoke custom endpoint
print(HPEndpoint.get_operator_logs(since_hours=0.1))

Delete an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Delete JumpStart endpoint
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
jumpstart_endpoint.delete()

# Delete custom endpoint
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
custom_endpoint.delete()

Observability - Getting Monitoring Information

from sagemaker.hyperpod.observability.utils import get_monitoring_config
monitor_config = get_monitoring_config()

Examples

Cluster Management Example Notebooks

CLI Cluster Management Example

SDK Cluster Management Example

Training Example Notebooks

CLI Training Init Experience Example

CLI Training Example

SDK Training Example

Inference Example Notebooks

CLI

CLI Inference Jumpstart Model Init Experience Example

CLI Inference JumpStart Model Example

CLI Inference FSX Model Example

CLI Inference S3 Model Init Experience Example

CLI Inference S3 Model Example

SDK

SDK Inference JumpStart Model Example

SDK Inference FSX Model Example

SDK Inference S3 Model Example

Disclaimer

  • This CLI and SDK requires access to the user's file system to set and get context and function properly. It needs to read configuration files such as kubeconfig to establish the necessary environment settings.

Working behind a proxy server ?

  • Follow these steps from here to set up HTTP proxy connections

About

A CLI tool that helps manage training jobs on the SageMaker HyperPod clusters orchestrated by Amazon EKS

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 46

Languages