diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/README.md b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/README.md new file mode 100644 index 0000000..e0b977d --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/README.md @@ -0,0 +1,533 @@ +# DeepSeek R1/V3 Multi-host Inference on TPU v6e with JetStream, MaxText and Pathways on Cloud with GKE Cluster + +This recipe outlines the steps to benchmark [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) or [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) 671B model using [JetStream](https://github.com/AI-Hypercomputer/JetStream/tree/main) \+ [MaxText](https://github.com/AI-Hypercomputer/maxtext) inference engine deployed on a GKE cluster with multi-host [TPU v6e slices](https://cloud.google.com/kubernetes-engine) utilizing [Pathways on Cloud](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro). + +* [Jetstream](https://github.com/AI-Hypercomputer/JetStream) is a throughput and memory-optimized engine for LLM inference on XLA devices, primarily TPUs written in JAX. +* [MaxText](https://github.com/AI-Hypercomputer/maxtext) is an open-source LLM project by Google, written in JAX and designed to be highly performant and scalable, running efficiently on Google Cloud TPUs and GPUs. +* [Pathways](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro) is a system that simplifies large-scale ML computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices, spanning thousands of TPU chips. +* [TPUs](https://cloud.google.com/tpu/docs/v6e) are Google's custom-developed accelerator for ML and AI models built using frameworks such as TensorFlow, PyTorch, and JAX. TPU v6e is Cloud TPU's latest generation AI accelerator. + +## Outline + +1. [Ensure prerequisites are met.](#prerequisites) +2. [Setup development environment.](#setup-your-local-environment) +3. [Provision a GKE Cluster with TPU v6e and CPU nodepools](#create-gke-cluster-with-tpu-v6e-nodepool-using-xpk) +4. [Configure service account for access](#configure-a-service-account-for-access) +5. [Create container image with dependencies](#build-jetstreammaxtext-container-image-to-deploy-the-workload) +6. [Checkpoint conversion](#checkpoint-conversion) + - Download model weights from HuggingFace + - Convert Hugging Face checkpoint from FP8 to BF16 + - Convert Hugging Face BF16 checkpoint to MaxText compatible checkpoint +7. [Deploy JetStream and Pathways](#deploy-jetstream-and-pathways) +8. [Run MMLU benchmark](#run-mmlu-benchmark) + +## Prerequisites + +1. Verify that your project has enough quota in your region of choice for: + * A Cloud TPU slice, for example v6e-64 (`TPUS_PER_TPU_FAMILY`) + * Compute Engine API quota for M1 machine configuration for 160 chips (`M1_CPUS`) +2. Required IAM Permissions + Make sure that you have the following roles on the project: + * Compute Admin (`roles/compute.admin`) + * Kubernetes Engine Admin (`roles/container.admin`) + * Storage Admin (`roles/storage.admin`) + * Logging Admin (`roles/logging.admin`) + * Monitoring Admin (`roles/monitoring.admin`) + * Artifact Registry Writer (`roles/artifactregistry.writer`) + * Service Account Admin (`roles/iam.serviceAccountAdmin`) + * Project IAM Admin (`roles/resourcemanager.projectIamAdmin`) +3. Access to Pathways Container Images. + * You run a Pathways cluster on GKE in one of the [Pathways container images](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathways-components). +4. Access to DeepSeek models on Hugging Face. + To access the [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) or [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) model through Hugging Face, you'll need a Hugging Face token. Follow these steps to generate a new token if you don't have one already: + * Create a [Hugging Face account](https://huggingface.co/), if you don't already have one. + * Click Your **Profile \> Settings \> Access Tokens**. + * Select **New Token**. + * Specify a Name and a Role of at least Read. + * Select **Generate a token**. + * Copy the generated token to your clipboard. + +## Setup your local environment + +We recommend running this recipe from [Cloud Shell](https://console.cloud.google.com/?cloudshell=true) or a client workstation with the following pre-installed: + +* [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) +* [Helm](https://helm.sh/docs/intro/install/) +* [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) + +Install [xpk](https://github.com/AI-Hypercomputer/xpk) toolkit that lets you create pre-configured GKE clusters that support Pathways-based workloads + +``` bash +git clone https://github.com/AI-Hypercomputer/xpk.git ~/xpk +cd ~/xpk +make install && export PATH=$PATH:$PWD/bin +``` + +### Clone the recipe + +From your client, clone the [`tpu-recipes`](https://github.com/AI-Hypercomputer/tpu-recipes) repository and set a reference to the recipe folder. + +``` bash +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B +``` + +### Configure environment settings + +Define the following environment variables with values appropriate to your workload: + +``` bash +# Required variables to be set +export PROJECT_ID= +export REGION= +export CLUSTER_NAME= +export CLUSTER_ZONE= +export GCS_BUCKET= +export TPU_RESERVATION= + +# Required variables with default values +export TPU_TYPE=v6e-64 +export NUM_SLICES=1 +export CLUSTER_CPU_MACHINE_TYPE=n2d-standard-32 +export CLUSTER_CKPT_NODEPOOL_NAME=ckpt-conversion-node-pool-0 +export CLUSTER_CKPT_NODE_MACHINE_TYPE=m1-ultramem-160 +export CLUSTER_CKPT_NODE_REGION=us-east4 +export CLUSTER_CKPT_NODE_DISK_SIZE=3000 +export CLUSTER_CKPT_NUM_NODES=1 +export ARTIFACT_REGISTRY_REPO_NAME=jetstream-maxtext-ar +export ARTIFACT_REGISTRY=${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REGISTRY_REPO_NAME} +export JETSTREAM_MAXTEXT_IMAGE=jetstream-maxtext +export JETSTREAM_MAXTEXT_VERSION=latest +export HF_MODEL_NAME="deepseek-ai/DeepSeek-R1" +export MODEL_NAME=deepseek3-671b +export GCS_CKPT_PATH_BF16=gs://${GCS_BUCKET}/models/${MODEL_NAME}/bf16 +export GCS_CKPT_PATH_UNSCANNED=gs://${GCS_BUCKET}/models/${MODEL_NAME}/unscanned +``` + +Following are required variables that must be set: + +- ``: your Google Cloud project ID +- ``: the region where you want to run Cloud Build +- ``: the name of your GKE cluster +- ``: the zone where your cluster is located +- ``: the name of your Cloud Storage bucket. Do not include the gs:// prefix +- ``: the name of the TPU reservation + +Following are required variables with default values already set: + +- `TPU_TYPE`: TPU accelerator type supported by TPU v6e. Refer to the [supported list](https://cloud.google.com/tpu/docs/v6e#configurations). +- `NUM_SLICES`: The number of slices to use +- `CLUSTER_CPU_MACHINE_TYPE`: The CPU nodepool machine type +- `CLUSTER_CKPT_NODEPOOL_NAME`: The name of CPU nodepool used for checkpoint conversion +- `CLUSTER_CKPT_NODE_MACHINE_TYPE`: The machine type of CPU nodepool used for checkpoint conversion +- `CLUSTER_CKPT_NODE_DISK_SIZE`: The disk size of CPU nodepool used for checkpoint conversion. For this recipe, a minimum of 3TB disk size is suggested. +- CLUSTER\_CKPT\_NUM\_NODES +- `ARTIFACT_REGISTRY`: the full name of your Artifact Registry in the following format: *LOCATION*\-docker.pkg.dev/*PROJECT\_ID*/*REPOSITORY* +- `JETSTREAM_MAXTEXT_IMAGE`: the name of the JetStream MaxText image +- `JETSTREAM_MAXTEXT_VERSION`: the version of the JetStream MaxText image + +Set the default project: + +``` bash +gcloud config set project $PROJECT_ID +``` + +## Create GKE Cluster with TPU v6e nodepool using xpk + +Use a custom network for better performance as well as to avoid having the default network becoming overloaded. Please refer to the [network performance optimizations](https://cloud.google.com/tpu/docs/v6e-intro/#network_performance_optimizations) for more details. + +``` bash +export NETWORK_NAME_1=${CLUSTER_NAME}-mtu9k-1 +export NETWORK_FW_NAME_1=${NETWORK_NAME_1}-fw-1 + +# Use a custom network for better performance as well as avoid the default network to be overloaded. +gcloud compute networks create ${NETWORK_NAME_1} --mtu=8896 --project=${PROJECT_ID} --subnet-mode=auto --bgp-routing-mode=regional +gcloud compute firewall-rules create ${NETWORK_FW_NAME_1} --network ${NETWORK_NAME_1} --allow tcp,icmp,udp --project=${PROJECT_ID} + +# Secondary subnet for multinic experience. Need custom ip routing to be different from first network’s subnet. +export NETWORK_NAME_2=${CLUSTER_NAME}-privatenetwork-4 +export SUBNET_NAME_2=${CLUSTER_NAME}-privatesubnet-4 +export FIREWALL_RULE_NAME=${CLUSTER_NAME}-privatefirewall-4 +export ROUTER_NAME=${CLUSTER_NAME}-network-4 +export NAT_CONFIG=${CLUSTER_NAME}-natconfig-4 + +# Create networks +gcloud compute networks create "${NETWORK_NAME_2}" --mtu=8896 --bgp-routing-mode=regional --subnet-mode=custom --project=${PROJECT_ID} + +# Create subnets +gcloud compute networks subnets create "${SUBNET_NAME_2}" --network="${NETWORK_NAME_2}" --range=10.10.0.0/18 --region="${REGION}" --project=${PROJECT_ID} + +# Create Firewall rules +gcloud compute firewall-rules create "${FIREWALL_RULE_NAME}" --network "${NETWORK_NAME_2}" --allow tcp,icmp,udp --project="${PROJECT_ID}" + +# Create router +gcloud compute routers create "${ROUTER_NAME}" \ + --project="${PROJECT_ID}" \ + --network="${NETWORK_NAME_2}" \ + --region="${REGION}" + +# Create NAT +gcloud compute routers nats create "${NAT_CONFIG}" \ + --router="${ROUTER_NAME}" \ + --region="${REGION}" \ + --auto-allocate-nat-external-ips \ + --nat-all-subnet-ip-ranges \ + --project="${PROJECT_ID}" \ + --enable-logging +``` + +Create GKE cluster using xpk toolkit with custom network and TPU v6e nodepool + +``` bash +export CLUSTER_ARGUMENTS="--enable-dataplane-v2 --enable-ip-alias --enable-multi-networking --network=${NETWORK_NAME_1} --subnetwork=${NETWORK_NAME_1} --scopes cloud-platform" + +export NODE_POOL_ARGUMENTS="--additional-node-network network=${NETWORK_NAME_2},subnetwork=${SUBNET_NAME_2} --scopes cloud-platform --workload-metadata=GCE_METADATA --placement-type=COMPACT" + +python3 ~/xpk/xpk.py cluster create \ + --cluster $CLUSTER_NAME \ + --default-pool-cpu-machine-type=$CLUSTER_CPU_MACHINE_TYPE \ + --num-slices=$NUM_SLICES \ + --tpu-type=$TPU_TYPE \ + --zone=${CLUSTER_ZONE} \ + --project=${PROJECT_ID} \ + --reservation=${TPU_RESERVATION} \ + --custom-cluster-arguments="${CLUSTER_ARGUMENTS}" \ + --custom-nodepool-arguments="${NODE_POOL_ARGUMENTS}" +``` + +## Create a Cloud Storage bucket to store checkpoints and temporary files + +Create a Cloud Storage bucket to store model checkpoint, Pathways temporary files like compilation cache. It's recommended to create a bucket in the same region as the TPU nodepool is located. + +``` bash +gcloud storage buckets create gs://$GCS_BUCKET --location=$REGION +``` + +## Configure a service account for access + +Configure a Kubernetes service account to act as an IAM service account. + +* Create an IAM service account for your application: + +``` bash +gcloud iam service-accounts create jetstream-pathways +``` + +* Add an IAM policy binding for your IAM service account to manage Cloud Storage. This is to access the storage bucket where your checkpoint will be stored: + +``` bash +gcloud projects add-iam-policy-binding ${PROJECT_ID} \ + --member "serviceAccount:jetstream-pathways@${PROJECT_ID}.iam.gserviceaccount.com" \ + --role roles/storage.objectUser + +gcloud projects add-iam-policy-binding ${PROJECT_ID} \ + --member "serviceAccount:jetstream-pathways@${PROJECT_ID}.iam.gserviceaccount.com" \ + --role roles/storage.insightsCollectorService +``` + +* Annotate the Kubernetes service account with the email address of the IAM service account. + +``` bash +kubectl annotate serviceaccount default \ +iam.gke.io/gcp-service-account=jetstream-pathways@${PROJECT_ID}.iam.gserviceaccount.com +``` + +## Build JetStream/MaxText container image to deploy the workload + +### Create Artifact Registry repository to store Docker images + +``` bash +gcloud artifacts repositories create ${ARTIFACT_REGISTRY_REPO_NAME} \ + --repository-format=docker \ + --location=${REGION} \ + --description="Repository for JetStream/MaxText container images" \ + --project=${PROJECT_ID} +``` + +### Configure Docker to authenticate to Artifact Registry + +[Configure Docker](https://cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) to authenticate to Artifact Registry to pull the allowlisted Pathways images + +``` bash +gcloud auth configure-docker ${REGION}-docker.pkg.dev +``` + +### Build and push the Docker container image to Artifact Registry + +To build the container, submit a Cloud Build job to build and push the container image running the following command from your client: + +``` bash +cd $RECIPE_ROOT/docker +gcloud builds submit \ + --project=${PROJECT_ID} \ + --region=${REGION} \ + --config cloudbuild.yml \ + --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY,_JETSTREAM_MAXTEXT_IMAGE=$JETSTREAM_MAXTEXT_IMAGE,_JETSTREAM_MAXTEXT_VERSION=$JETSTREAM_MAXTEXT_VERSION \ + --timeout "2h" \ + --machine-type=e2-highcpu-32 \ + --disk-size=1000 \ + --quiet \ + --async +``` + +This command outputs the `BUILD ID`. You can monitor the build progress by streaming the logs for the `BUILD ID`. To do this, run the following command with `` replaced with your build ID. + +``` bash +BUILD_ID= +gcloud beta builds log $BUILD_ID --region=$REGION +``` + +## Checkpoint conversion + +This step requires an `m1-ultramem-160` (memory-optimized) machine with 3TB of storage that can be run. The recipe uses a [Cloud Batch job](https://cloud.google.com/batch/docs/get-started) to run the conversion. + +The following job performs following steps: +- Downloads DeepSeek V3 or DeepSeek R1 (defined by `HF_MODEL_NAME`) weights from HuggingFace. +- Convert Hugging Face checkpoint weights from FP8 to BF16. +- Convert BF16 weights to MaxText compatible format (unscanned checkpoint) for efficient serving. + +Submit Cloud Batch job. This step can take >2 hours. + +``` bash +cd $RECIPE_ROOT/prepare-model +gcloud batch jobs submit convert-ckpt-to-unscanned-$(date +%Y%m%d-%H%M%S) \ + --project ${PROJECT_ID} \ + --location ${CLUSTER_CKPT_NODE_REGION} \ + --config - < Router -> Firewall Rule -> Subnet -> Network + +echo "Deleting NAT Gateway: ${NAT_CONFIG}..." +gcloud compute routers nats delete ${NAT_CONFIG} \ + --router=${ROUTER_NAME} \ + --region=${REGION} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete NAT ${NAT_CONFIG} (may already be deleted or dependencies exist)." + +echo "Deleting Router: ${ROUTER_NAME}..." +gcloud compute routers delete ${ROUTER_NAME} \ + --region=${REGION} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Router ${ROUTER_NAME} (may already be deleted or dependencies exist)." + +echo "Deleting Firewall Rule: ${FIREWALL_RULE_NAME}..." +gcloud compute firewall-rules delete ${FIREWALL_RULE_NAME} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Firewall Rule ${FIREWALL_RULE_NAME} (may already be deleted)." + +echo "Deleting Subnet: ${SUBNET_NAME_2}..." +gcloud compute networks subnets delete ${SUBNET_NAME_2} \ + --region=${REGION} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Subnet ${SUBNET_NAME_2} (may already be deleted or dependencies exist)." + +echo "Deleting Network: ${NETWORK_NAME_2}..." +gcloud compute networks delete ${NETWORK_NAME_2} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Network ${NETWORK_NAME_2} (may already be deleted or dependencies exist)." + +# --- Delete Resources for Network 1 (${NETWORK_NAME_1}) --- +# Order: Firewall Rule -> Network (Auto-created subnets are deleted with the network if empty) + +echo "Deleting Firewall Rule: ${NETWORK_FW_NAME_1}..." +gcloud compute firewall-rules delete ${NETWORK_FW_NAME_1} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Firewall Rule ${NETWORK_FW_NAME_1} (may already be deleted)." + +echo "Deleting Network: ${NETWORK_NAME_1}..." +gcloud compute networks delete ${NETWORK_NAME_1} \ + --project=${PROJECT_ID} \ + --quiet || echo "Warning: Failed to delete Network ${NETWORK_NAME_1} (may already be deleted or dependencies exist)." +``` \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/Dockerfile b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/Dockerfile new file mode 100644 index 0000000..9efcb66 --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/Dockerfile @@ -0,0 +1,59 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FROM ubuntu:22.04 + +ENV DEBIAN_FRONTEND=noninteractive + +# Install dependencies +RUN apt -y update && apt install -y --no-install-recommends \ + apt-transport-https ca-certificates gnupg git wget \ + python3.10 python3-pip curl nano vim + +RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 + +# Install google cloud sdk +RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" \ + | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \ + && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \ + | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg \ + && apt-get update -y \ + && apt-get install google-cloud-sdk -y + +# Install pip +RUN python3 -m pip install --upgrade pip + +RUN pip install "huggingface_hub[cli]" hf_transfer + +# Set environment variables +ENV JAX_PLATFORMS=proxy +ENV JAX_BACKEND_TARGET=grpc://localhost:38681 +ENV XCLOUD_ENVIRONMENT=GCP + +# Install JetStream and MaxText + +RUN git clone https://github.com/AI-Hypercomputer/JetStream.git && \ +git clone https://github.com/AI-Hypercomputer/maxtext.git && \ +git clone https://github.com/google/aqt.git + +RUN cd /maxtext && bash setup.sh && pip install torch --index-url https://download.pytorch.org/whl/cpu + +RUN pip install safetensors setuptools fastapi uvicorn rouge_score scikit-learn + +RUN cd /JetStream && pip install -e . + +RUN apt -y update && apt-get -y install python3-dev && apt-get -y install build-essential +RUN cp -r /aqt/aqt/* /usr/local/lib/python3.10/dist-packages/aqt/ + +ENTRYPOINT [ "/bin/bash" ] \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/cloudbuild.yml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/cloudbuild.yml new file mode 100644 index 0000000..7a4c060 --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/docker/cloudbuild.yml @@ -0,0 +1,25 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +steps: +- name: 'gcr.io/cloud-builders/docker' + args: + - 'build' + - '--tag=${_ARTIFACT_REGISTRY}/${_JETSTREAM_MAXTEXT_IMAGE}:${_JETSTREAM_MAXTEXT_VERSION}' + - '--file=Dockerfile' + - '.' + automapSubstitutions: true + +images: +- ${_ARTIFACT_REGISTRY}/${_JETSTREAM_MAXTEXT_IMAGE}:${_JETSTREAM_MAXTEXT_VERSION} \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/pathways.png b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/pathways.png new file mode 100644 index 0000000..720cba0 Binary files /dev/null and b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/pathways.png differ diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/prepare-model/batch_job.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/prepare-model/batch_job.yaml new file mode 100644 index 0000000..57cc44b --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/prepare-model/batch_job.yaml @@ -0,0 +1,47 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +taskGroups: + - taskSpec: + runnables: + - container: + imageUri: ${ARTIFACT_REGISTRY}/${JETSTREAM_MAXTEXT_IMAGE}:${JETSTREAM_MAXTEXT_VERSION} + entrypoint: "/bin/sh" + commands: + - "-c" + - mkdir -p /mnt/disks/persist/models/ && echo "Downloading model ${HF_MODEL_NAME}" && huggingface-cli download ${HF_MODEL_NAME} --local-dir /mnt/disks/persist/models/fp8 && cd /maxtext && echo "Converting checkpoint from fp8 to bf16" && python3 -m MaxText.deepseek_fp8_to_bf16 --input-fp8-hf-path /mnt/disks/persist/models/fp8 --output-bf16-hf-path /mnt/disks/persist/models/bf16 --cache-file-num 16 && echo "Converting checkpoint from bf16 to maxtext/unscanned format" && JAX_PLATFORMS='' python3 -m MaxText.convert_deepseek_unscanned_ckpt --base_model_path /mnt/disks/persist/models/bf16 --maxtext_model_path ${GCS_CKPT_PATH_UNSCANNED} --model_size $MODEL_NAME --use-zarr3 false --use-ocdbt false && echo "Completed checkpoint conversion. Unscanned checkpoint saved at ${GCS_CKPT_PATH_UNSCANNED}" + volumes: + - deviceName: persist + mountPath: /mnt/disks/persist + mountOptions: rw,async + computeResource: + cpuMilli: 160000 + memoryMib: 3936256 +# Define the allocation policy for provisioning VMs +allocationPolicy: + location: + allowedLocations: ["regions/${CLUSTER_CKPT_NODE_REGION}"] + instances: + - policy: + machineType: ${CLUSTER_CKPT_NODE_MACHINE_TYPE} + bootDisk: + type: pd-ssd + sizeGb: 500 + disks: + newDisk: + sizeGb: 3000 + type: pd-ssd + deviceName: persist +logsPolicy: + destination: CLOUD_LOGGING \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/Chart.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/Chart.yaml new file mode 100644 index 0000000..0d98d5e --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: trillium-pathways-jetstream-maxtext-serve-model +description: trillium-pathways-jetstream-maxtext-serve-model +type: application +version: 0.1.0 +appVersion: "1.16.0" \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-configmap.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-configmap.yaml new file mode 100644 index 0000000..b5c475e --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-configmap.yaml @@ -0,0 +1,23 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}" +data: + maxtext-configuration.yaml: |- + {{- range $key, $value := .Values.maxtext_config }} + {{ $key }}: {{ $value }} + {{- end }} diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-launcher.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-launcher.yaml new file mode 100644 index 0000000..582f8d1 --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-launcher.yaml @@ -0,0 +1,197 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- $root := . }} + +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: {{ .Release.Name }} + annotations: + leaderworkerset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool +spec: + replicas: 1 + leaderWorkerTemplate: + leaderTemplate: + metadata: + labels: + role: leader + app: {{ .Release.Name }} + spec: + nodeSelector: + cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice + cloud.google.com/gke-tpu-topology: 8x8 + tolerations: + - key: "google.com/tpu" + operator: "Exists" + effect: "NoSchedule" + volumes: + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + - name: workload-configuration + configMap: + name: "{{.Release.Name}}" + containers: + - name: pathways-proxy + image: "{{ .Values.job.pathways_proxy_image.repository }}:{{ .Values.job.pathways_proxy_image.tag }}" + args: + - --resource_manager_address=$(LWS_LEADER_ADDRESS):38677 + - --server_port=38681 + {{- with (index .Values.volumes.gcsMounts 0) }} + - --gcs_scratch_location=gs://{{ .bucketName }}/tmp + {{- end }} + imagePullPolicy: Always + ports: + - containerPort: 38681 + + - name: pathways-rm + env: + - name: HOST_ADDRESS + value: "$(LWS_LEADER_ADDRESS)" + - name: TPU_SKIP_MDS_QUERY + value: "true" + image: "{{ .Values.job.pathways_rm_image.repository }}:{{ .Values.job.pathways_rm_image.tag }}" + args: + - --server_port=38677 + - --node_type=resource_manager + - --instance_count=1 + - --instance_type=tpuv6e:8x8 + {{- with (index .Values.volumes.gcsMounts 0) }} + - --gcs_scratch_location=gs://{{ .bucketName }}/tmp + {{- end }} + imagePullPolicy: Always + ports: + - containerPort: 38677 + + - name: jax-tpu + image: "{{ .Values.job.jax_tpu_image.repository }}:{{ .Values.job.jax_tpu_image.tag }}" + imagePullPolicy: Always + env: + - name: ENABLE_PATHWAYS_PERSISTENCE + value: "1" + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: "{{ .Values.huggingface.secretName }}" + key: "{{ .Values.huggingface.secretData.token }}" + workingDir: /workspace + command: ["/bin/bash", "-c"] + args: + - | + set -eux + # Parse server configurations from values file + echo "MaxText configuration file:" + sed 's/^/| /' /etc/workload-configuration/maxtext-configuration.yaml + echo "" + + OPTIONS=() + while IFS= read -r line || [[ -n "$line" ]]; do + # Skip empty lines and comments + [[ -z "$line" || "$line" =~ ^[[:space:]]*# ]] && continue + + key=$(echo "$line" | cut -d':' -f1 | tr -d '[:space:]') + value=$(echo "$line" | cut -d':' -f2- | sed 's/^[[:space:]]*//') + + # Handle environment variable expansion + if [[ "$value" == \$* ]]; then + var_name=${value#\$} + + if [[ -z "$var_name" ]]; then + expanded_value="$" + else + expanded_value="${!var_name:-$value}" + fi + + OPTIONS+=("$key=$expanded_value") + else + OPTIONS+=("$key=$value") + fi + done < /etc/workload-configuration/maxtext-configuration.yaml + + echo "===== MaxText Configuration =====" + echo "${OPTIONS[@]}" + + cd /maxtext + python3 -m MaxText.maxengine_server \ + /maxtext/MaxText/configs/base.yml \ + "${OPTIONS[@]}" + + ports: + - containerPort: {{ .Values.jetstream.service.ports.grpc }} + startupProbe: + httpGet: + path: /healthcheck + port: {{ .Values.jetstream.service.ports.http }} + scheme: HTTP + periodSeconds: 1 + initialDelaySeconds: 600 + failureThreshold: 10000 + livenessProbe: + httpGet: + path: /healthcheck + port: {{ .Values.jetstream.service.ports.http }} + scheme: HTTP + periodSeconds: 60 + failureThreshold: 10 + readinessProbe: + httpGet: + path: /healthcheck + port: {{ .Values.jetstream.service.ports.http }} + scheme: HTTP + periodSeconds: 60 + failureThreshold: 10 + volumeMounts: + - name: shared-memory + mountPath: /dev/shm + - name: workload-configuration + mountPath: /etc/workload-configuration + - name: local-ssd + mountPath: {{ .Values.volumes.ssdMountPath }} + + - name: jetstream-http + image: "{{ .Values.job.jetstream_http_image.repository }}:{{ .Values.job.jetstream_http_image.tag }}" + imagePullPolicy: Always + ports: + - containerPort: 8000 + size: 17 + + workerTemplate: + spec: + nodeSelector: + cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice + cloud.google.com/gke-tpu-topology: 8x8 + tolerations: + - key: "google.com/tpu" + operator: "Exists" + effect: "NoSchedule" + containers: + - name: worker + args: + - --server_port=38679 + - --resource_manager_address=$(LWS_LEADER_ADDRESS):38677 + {{- with (index .Values.volumes.gcsMounts 0) }} + - --gcs_scratch_location=gs://{{ .bucketName }}/tmp + {{- end }} + image: "{{ .Values.job.pathways_rm_image.repository }}:{{ .Values.job.pathways_rm_image.tag }}" + imagePullPolicy: Always + ports: + - containerPort: 38679 + resources: + limits: + google.com/tpu: "4" diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-svc.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-svc.yaml new file mode 100644 index 0000000..7f4579f --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/serve-model/templates/model-serve-svc.yaml @@ -0,0 +1,26 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: jetstream-svc +spec: + selector: + app: jetstream-pathways + ports: + - protocol: TCP + name: jetstream-http + port: {{ .Values.jetstream.service.ports.http }} + targetPort: {{ .Values.jetstream.service.ports.http }} \ No newline at end of file diff --git a/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/values.yaml b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/values.yaml new file mode 100644 index 0000000..eae79e5 --- /dev/null +++ b/inference/trillium/JetStream-Maxtext/DeepSeek-R1-671B/values.yaml @@ -0,0 +1,82 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +clusterName: + +huggingface: + secretName: hf-secret + secretData: + token: "hf_api_token" + +model: + name: &model-name deepseek3-671b + hf_model_name: &hf-model-name deepseek-ai/DeepSeek-R1 + +job: + jax_tpu_image: + repository: + tag: + jetstream_http_image: + repository: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http + tag: v0.2.3 + pathways_proxy_image: + repository: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server + tag: latest + pathways_rm_image: + repository: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server + tag: latest + + +volumes: + ssdMountPath: "/ssd" + gcsMounts: + - bucketName: + mountPath: "/gcs" + +jetstream: + service: + ports: + http: 8000 + grpc: 9000 + +convert_hf_ckpt: true + +maxtext_config: + allow_split_physical_axes: true + tokenizer_type: huggingface + hf_access_token: $HF_TOKEN + tokenizer_path: *hf-model-name + model_name: *model-name + use_chat_template: false + load_parameters_path: + max_prefill_predict_length: 1024 + max_target_length: 1536 + async_checkpointing: false + steps: 1 + ici_fsdp_parallelism: 1 + ici_autoregressive_parallelism: 1 + ici_expert_parallelism: 1 + ici_tensor_parallelism: 64 + scan_layers: false + weight_dtype: bfloat16 + per_device_batch_size: 1 + enable_single_controller: true + megablox: false + sparse_matmul: false + capacity_factor: -1.0 + attention: "dot_product" + quantize_kvcache: true + kv_quant_dtype: int8 + enable_model_warmup: true \ No newline at end of file