The solution implements a scalable ML inference architecture using Amazon EKS, leveraging Graviton processors. The system utilizes Ray Serve for model serving, deployed as containerized workloads within a Kubernetes environment.
- EKS cluster with KubeRay Operator installed
- Karpenter node pool is setup for Graviton instance, the node pool label is "kubernetes.io/arch: arm64" in this example
- Make sure running following command under the llamacpp-rayserve-graviton directory
Deploy an elastic Ray service hosting llama 3.2 model on Graviton:
1. Edit your Hugging Face token for env HUGGING_FACE_HUB_TOKEN
in the secret section of ray-service-llamacpp.yaml
MODEL_ID
: Hugging Face model repositoryMODEL_FILENAME
: Model file name in the Hugging Face repoN_THREADS
: Number of threads for inference (recommended: match host EC2 instance vCPU count)CMAKE_ARGS
: C/C++ compile flags for llama.cpp on Graviton
Note: The example model uses GGUF format, optimized for llama.cpp. See GGUF documentation for details.
kubectl create -f ray-service-llamacpp.yaml
kubectl get svc
Our client program will generate 20 different prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
Follow this guidance if you want to set it up and replicate the experiment
Launch an EC2 instance as the client in the same AZ with the Ray cluster (For optimal performance testing, deploy a client EC2 instance in the same AZ as your Ray cluster. To generate sufficient load, use a compute-optimized instance like c6i.16xlarge. If you observe that worker node CPU utilization remains flat despite increasing concurrent requests, this indicates your test client may be reaching its capacity limits. In such cases, scale your testing infrastructure by launching additional EC2 instances to generate higher concurrent loads.)
kubectl port-forward service/ray-service-llamacpp-serve-svc 8000:8000
Install Golang environment in the client EC2 instance (please refer this for the Golang installation guidance). Specify the environment variables as the test configuration.
export URL=http://localhost:8000/v1/chat/completions
export REQUESTS_PER_PROMPT=<The_number_of_concurrent_calls>
export NUM_WARMUP_REQUESTS=<The_number_of_warmup_requests>
Run the performance test golang script and you can find the results from the output.
go run perf_benchmark.go
Please contact wangaws@ or fmamazon@ if you want to know more and/or contribute.