Cost effective and Scalable Model Inference on AWS Graviton with Ray on EKS

Overview

The solution implements a scalable ML inference architecture using Amazon EKS, leveraging Graviton processors. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.

Prerequisites

EKS cluster with KubeRay Operator installed
Karpenter node pool is setup for Graviton instance, the node pool label is "kubernetes.io/arch: arm64" in this example
Make sure running following command under the llamacpp-rayserve-graviton directory

Deployment

Deploy an elastic Ray service hosting llama 3.2 model on Graviton:

1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp.yaml`

2. Configure model and inference parameters in the yaml file:

MODEL_ID: Hugging Face model repository
MODEL_FILENAME: Model file name in the Hugging Face repo
N_THREADS: Number of threads for inference (recommended: match host EC2 instance vCPU count)
CMAKE_ARGS: C/C++ compile flags for llama.cpp on Graviton

Note: The example model uses GGUF format, optimized for llama.cpp. See GGUF documentation for details.

3. Create the Kubernetes service:

kubectl create -f ray-service-llamacpp.yaml

4.Get the Kubernetes service name:

kubectl get svc

How do we measure

Our client program will generate 20 different prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.

Follow this guidance if you want to set it up and replicate the experiment

1. Launch load generator instance

Launch an EC2 instance as the client in the same AZ with the Ray cluster (For optimal performance testing, deploy a client EC2 instance in the same AZ as your Ray cluster. To generate sufficient load, use a compute-optimized instance like c6i.16xlarge. If you observe that worker node CPU utilization remains flat despite increasing concurrent requests, this indicates your test client may be reaching its capacity limits. In such cases, scale your testing infrastructure by launching additional EC2 instances to generate higher concurrent loads.)

2. Execute port forward for the ray service

kubectl port-forward service/ray-service-llamacpp-serve-svc 8000:8000

3. Configure environment

Install Golang environment in the client EC2 instance (please refer this for the Golang installation guidance). Specify the environment variables as the test configuration.

export URL=http://localhost:8000/v1/chat/completions
export REQUESTS_PER_PROMPT=<The_number_of_concurrent_calls>
export NUM_WARMUP_REQUESTS=<The_number_of_warmup_requests>

4. Run test

Run the performance test golang script and you can find the results from the output.

go run perf_benchmark.go

Contact

Please contact wangaws@ or fmamazon@ if you want to know more and/or contribute.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
base_eks_setup		base_eks_setup
dockerfiles		dockerfiles
karpenter-pools		karpenter-pools
ray-server		ray-server
ray-services		ray-services
.DS_Store		.DS_Store
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cost effective and Scalable Model Inference on AWS Graviton with Ray on EKS

Overview

Prerequisites

Deployment

1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp.yaml`

2. Configure model and inference parameters in the yaml file:

3. Create the Kubernetes service:

4.Get the Kubernetes service name:

How do we measure

1. Launch load generator instance

2. Execute port forward for the ray service

3. Configure environment

4. Run test

Contact

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/Cost_effective_and_scalable_Small_Language_Models_Inference_on_AWS_Graviton4_with_EKS

Folders and files

Latest commit

History

Repository files navigation

Cost effective and Scalable Model Inference on AWS Graviton with Ray on EKS

Overview

Prerequisites

Deployment

1. Edit your Hugging Face token for env HUGGING_FACE_HUB_TOKEN in the secret section of ray-service-llamacpp.yaml

2. Configure model and inference parameters in the yaml file:

3. Create the Kubernetes service:

4.Get the Kubernetes service name:

How do we measure

1. Launch load generator instance

2. Execute port forward for the ray service

3. Configure environment

4. Run test

Contact

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. Edit your Hugging Face token for env `HUGGING_FACE_HUB_TOKEN` in the secret section of `ray-service-llamacpp.yaml`

Packages