Skip to content

Conversation

@atchernych
Copy link
Contributor

@atchernych atchernych commented Oct 23, 2025

Overview:

DEP-423

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Chores
    • Added new automated deployment workflow for GAIE integration with Dynamo runtime
    • Implements end-to-end build, deployment, and validation pipeline with automated cleanup processes

Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
@atchernych atchernych requested a review from a team as a code owner October 23, 2025 20:13
@github-actions github-actions bot added the feat label Oct 23, 2025
Signed-off-by: Anna Tchernych <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Walkthrough

A new GitHub Actions workflow file is introduced that automates GAIE deployment with Dynamo using vllm. The workflow orchestrates Docker image builds, Kubernetes cluster configuration, Helm chart installations, and coordinates multiple container registries (Azure ACR, AWS ECR, GitLab) alongside infrastructure deployment and cleanup operations.

Changes

Cohort / File(s) Summary
GitHub Actions Deployment Workflow
.github/workflows/gaie.yml
New workflow file that defines a comprehensive CI/CD pipeline triggered on push to dep-423-test and bi-monthly schedule. Orchestrates environment setup, multi-image Docker builds (Operator and vllm runtime), Kubernetes namespace creation, Helm chart deployments (Dynamo Operator, vllm backend, Gateway API, GAIE), GitLab image registry integration, deployment readiness validation, and full infrastructure cleanup with cluster-scoped resource removal.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

A single, dense workflow file containing multiple distinct deployment stages with intricate orchestration across Kubernetes, Helm, and multiple cloud registries. Review requires careful evaluation of credential handling (HF_TOKEN, ECR/ACR/GitLab authentication), image build and push logic, namespace and CRD management, and rollout validation sequences.

Poem

🐰 A workflow springs forth with purpose and grace,
Building images across every space—
From Azure to Lambda, through Kubernetes gates,
Dynamo and vllm orchestrate their fates!
With Helm charts and secrets in perfect array,
The rabbit deploys what's needed today.

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely incomplete despite having the correct template structure. The Overview section contains only "DEP-423" (a ticket reference) without explanation of its significance or the change context. The Details section lacks any description of what was actually changed, and the Where should the reviewer start section contains no file callouts or guidance for reviewers. The Related Issues section still contains the placeholder "#xxx" rather than an actual issue number. While the template structure is present, the substantive content required to understand the changeset is missing. Please complete the PR description by filling in: a meaningful Overview explaining the purpose and context of DEP-423, a Details section describing the changes made (the new GAIE deployment workflow), a Where should the reviewer start section highlighting that .github/workflows/gaie.yml is the key file to review, and the actual related GitHub issue number in the Related Issues section. This will help reviewers understand the scope and intent of the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "feat: Create GAIE EPP Integration Test" is partially related to the changeset. The summary shows the workflow does include GAIE EPP deployment and integration testing components (readiness checks, pod status evaluation), which aligns with the title. However, the main focus of the changeset is a comprehensive deployment pipeline that encompasses building multiple container images, Kubernetes configuration, Helm chart installations, and orchestrating multiple cloud registries. The title emphasizes only the integration test aspect, which is one part of a broader deployment workflow, making it incomplete rather than misleading.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
.github/workflows/gaie.yml (8)

189-189: Remove redundant sleep before kubectl wait.

Line 189 uses a hardcoded sleep 20 before calling kubectl wait on line 190. The kubectl wait command already polls with exponential backoff, making the sleep unnecessary and delaying job completion.

-          kubectl apply -n ${KUBE_NS} -f ${DEPLOYMENT_FILE}
-          sleep 20
-          kubectl wait --for=condition=ready pod -l "nvidia.com/dynamo-graph-deployment-name=${GRAPH_NAME}" -n ${KUBE_NS} --timeout=1000s
+          kubectl apply -n ${KUBE_NS} -f ${DEPLOYMENT_FILE}
+          kubectl wait --for=condition=ready pod -l "nvidia.com/dynamo-graph-deployment-name=${GRAPH_NAME}" -n ${KUBE_NS} --timeout=1000s

249-249: Hardcoded GCP image reference should use versioned environment variable.

Line 249 tags an image with a hardcoded GCP reference us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.5.1-dirty. The version suffix is pinned to v0.5.1-dirty, but INFERENCE_EXTENSION_VERSION is already defined (line 36) as v0.5.1. The -dirty suffix is unusual and may cause confusion.

Clarify the intent by either:

  1. Parameterizing the tag to match the environment variable, or
  2. Adding a clear comment explaining why -dirty is needed.
-         docker tag us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.5.1-dirty gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo/epp-inference-extension-dynamo:${{ github.run_id }}
+         # Note: v0.5.1-dirty reflects the GAIE development build from build-epp-dynamo.sh
+         docker tag us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:${INFERENCE_EXTENSION_VERSION}-dirty gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo/epp-inference-extension-dynamo:${{ github.run_id }}

51-59: Pin package versions in apt-get and external tool downloads.

Line 51 installs packages without version constraints: sudo apt-get install -y curl bash openssl gettext git jq. Additionally, lines 52–59 download tools (yq, Helm, kubectl) from upstream release pages without pinning exact versions. Future runs may pull incompatible versions, leading to silent failures or behavioral changes.

Pin versions explicitly:

-         sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq
+         sudo apt-get update && sudo apt-get install -y curl=7.* bash=5.* openssl git jq
-         curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
+         curl -L https://github.com/mikefarah/yq/releases/download/v4.35.1/yq_linux_amd64 -o yq
-         KUBECTL_VER=$(curl -L -s https://dl.k8s.io/release/stable.txt)
+         KUBECTL_VER=v1.30.0  # or retrieve and cache the stable version

304-304: Complex jq query for pod status checking is fragile.

Line 304 uses a complex multiline jq query to detect crash conditions. If the pod status structure differs (e.g., due to Kubernetes version changes or missing fields), the query may produce unexpected results or fail silently.

Add explicit error handling and simplify the query:

# Simpler approach: check for specific conditions independently
if kubectl get pods -n ${NS} -l "${SELECTOR}" -o jsonpath='{.items[*].status.containerStatuses[*].state.waiting.reason}' | grep -qE "CrashLoopBackOff|ImagePullBackOff|ErrImagePull"; then
  echo "Detected pod failure states in qwen-epp" >&2
  exit 1
fi

Or use a defensive jq approach:

if kubectl get pods -n ${NS} -l "${SELECTOR}" -o json | jq -e '
  .items[] | 
  select(.status.containerStatuses != null) | 
  .status.containerStatuses[] | 
  select(.state.waiting != null and (.state.waiting.reason == "CrashLoopBackOff" or .state.waiting.reason == "ImagePullBackOff" or .state.waiting.reason == "ErrImagePull"))
' >/dev/null 2>&1; then
  echo "Detected pod crash or image pull failure" >&2
  exit 1
fi

230-236: Git clone lacks retry logic for network resilience.

Line 234 clones the GAIE repository without any retry mechanism. If the network is temporarily unavailable or the remote is slow, the entire workflow fails.

Add retry logic:

GAIE_CLONE_DIR="${{ github.workspace }}/external/gateway-api-inference-extension"
rm -rf "${GAIE_CLONE_DIR}"
mkdir -p "$(dirname "${GAIE_CLONE_DIR}")"

MAX_RETRIES=3
for i in $(seq 1 ${MAX_RETRIES}); do
  if git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension.git "${GAIE_CLONE_DIR}"; then
    break
  fi
  if [[ $i -lt ${MAX_RETRIES} ]]; then
    echo "Clone attempt $i failed, retrying..."
    sleep $((2 ** i))
  else
    echo "Failed to clone after ${MAX_RETRIES} attempts" >&2
    exit 1
  fi
done

cd "${GAIE_CLONE_DIR}"
git checkout ${INFERENCE_EXTENSION_VERSION}

336-350: Commented-out email notification should be removed or completed.

Lines 336–350 contain a commented-out email notification step with a TODO marker. This creates maintenance debt and suggests incomplete work.

Either:

  1. Remove the commented code if email notifications are not yet needed.
  2. Implement the feature properly if needed for failure alerting.
  3. Open a tracked issue if this is a known limitation (reference it in a comment).

For now, remove the block:

-      # - name: Email on failure
-      # TODO
-      #   if: failure()
-      #   continue-on-error: true
-      #   uses: dawidd6/action-send-mail@v3
-      #   with:
-      #     server_address: ${{ secrets.SMTP_SERVER }}
-      #     server_port: ${{ secrets.SMTP_PORT }}
-      #     username: ${{ secrets.SMTP_USER }}
-      #     password: ${{ secrets.SMTP_PASS }}
-      #     subject: GAIE scheduled run failed
-      #     to: [email protected]
-      #     from: ${{ secrets.SMTP_FROM }}
-      #     secure: true
-      #     html_body: |
-      #       <p>GAIE scheduled run failed.</p>
-      #       <p>Run: <a href="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}">details</a></p>

143-150: String sanitization inconsistency: consider using yq for all YAML-related operations.

Line 143 uses bash string substitution (PROFILE_SANITIZED="${PROFILE//_/-}") to sanitize the profile name for use in Kubernetes namespace naming. While this works, the workflow uses yq for YAML parsing elsewhere (lines 52–53, 186–187), suggesting a preference for dedicated tools. For consistency and robustness, consider standardizing the approach.

This is a minor point, but if additional sanitization is needed in the future, a dedicated tool or function would be clearer. For now, the current approach is acceptable; just document the intent:

# Sanitize profile name for Kubernetes namespace (replace underscores with hyphens)
PROFILE_SANITIZED="${PROFILE//_/-}"

7-13: Commented-out workflow triggers reduce flexibility.

Lines 7, 11–13 contain commented-out workflow_dispatch and pull_request triggers. This limits the ability to manually trigger the workflow or test it in pull requests.

Consider enabling these triggers to improve developer experience:

on:
  workflow_dispatch:  # Allow manual trigger
  push:
    branches:
      - dep-423-test
  pull_request:       # Allow testing in PRs
    paths:
      - '.github/workflows/gaie.yml'
      - 'deploy/cloud/**'
      - 'deploy/inference-gateway/**'
  schedule:
    - cron: "0 3 1,15 * *"

If there are reasons to keep these commented out (e.g., preventing accidental runs), document them clearly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8de469c and a0aabf0.

📒 Files selected for processing (1)
  • .github/workflows/gaie.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/gaie.yml

27-27: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


95-95: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (2)
.github/workflows/gaie.yml (2)

27-27: Verify runner label is configured for self-hosted runners.

The custom runner label cpu-amd-m5-2xlarge is not a standard GitHub runner label. If this label is not configured in the repository's self-hosted runner setup, the workflow will hang indefinitely waiting for a matching runner.

Confirm that this runner is properly configured by checking the repository's runner settings, or document the setup requirement clearly in the repository's CI/CD documentation. Alternatively, if this should work with standard runners, use one from the GitHub runner list.


95-95: Ensure custom action has proper action metadata.

The custom action ./.github/actions/docker-tag-push is missing required metadata per the actionlint hint. Custom actions must define an action.yml (or action.yaml) file with at minimum a name field.

Verify that .github/actions/docker-tag-push/action.yml exists and includes:

name: 'Docker Tag and Push'
description: 'Tag and push Docker image to registries'
# ... inputs, outputs, runs, etc.

If the file is missing, the workflow will fail at runtime.

Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Signed-off-by: Anna Tchernych <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant