feat: Create GAIE EPP Integration Test #3852

atchernych · 2025-10-23T20:13:29Z

Overview:

DEP-423

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Added new automated deployment workflow for GAIE integration with Dynamo runtime
- Implements end-to-end build, deployment, and validation pipeline with automated cleanup processes

Signed-off-by: Anna Tchernych <[email protected]>

coderabbitai · 2025-10-23T20:16:57Z

Walkthrough

A new GitHub Actions workflow file is introduced that automates GAIE deployment with Dynamo using vllm. The workflow orchestrates Docker image builds, Kubernetes cluster configuration, Helm chart installations, and coordinates multiple container registries (Azure ACR, AWS ECR, GitLab) alongside infrastructure deployment and cleanup operations.

Changes

Cohort / File(s)	Summary
GitHub Actions Deployment Workflow `.github/workflows/gaie.yml`	New workflow file that defines a comprehensive CI/CD pipeline triggered on push to `dep-423-test` and bi-monthly schedule. Orchestrates environment setup, multi-image Docker builds (Operator and vllm runtime), Kubernetes namespace creation, Helm chart deployments (Dynamo Operator, vllm backend, Gateway API, GAIE), GitLab image registry integration, deployment readiness validation, and full infrastructure cleanup with cluster-scoped resource removal.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

A single, dense workflow file containing multiple distinct deployment stages with intricate orchestration across Kubernetes, Helm, and multiple cloud registries. Review requires careful evaluation of credential handling (HF_TOKEN, ECR/ACR/GitLab authentication), image build and push logic, namespace and CRD management, and rollout validation sequences.

Poem

🐰 A workflow springs forth with purpose and grace,
Building images across every space—
From Azure to Lambda, through Kubernetes gates,
Dynamo and vllm orchestrate their fates!
With Helm charts and secrets in perfect array,
The rabbit deploys what's needed today.

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is largely incomplete despite having the correct template structure. The Overview section contains only "DEP-423" (a ticket reference) without explanation of its significance or the change context. The Details section lacks any description of what was actually changed, and the Where should the reviewer start section contains no file callouts or guidance for reviewers. The Related Issues section still contains the placeholder "#xxx" rather than an actual issue number. While the template structure is present, the substantive content required to understand the changeset is missing.	Please complete the PR description by filling in: a meaningful Overview explaining the purpose and context of DEP-423, a Details section describing the changes made (the new GAIE deployment workflow), a Where should the reviewer start section highlighting that .github/workflows/gaie.yml is the key file to review, and the actual related GitHub issue number in the Related Issues section. This will help reviewers understand the scope and intent of the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "feat: Create GAIE EPP Integration Test" is partially related to the changeset. The summary shows the workflow does include GAIE EPP deployment and integration testing components (readiness checks, pod status evaluation), which aligns with the title. However, the main focus of the changeset is a comprehensive deployment pipeline that encompasses building multiple container images, Kubernetes configuration, Helm chart installations, and orchestrating multiple cloud registries. The title emphasizes only the integration test aspect, which is one part of a broader deployment workflow, making it incomplete rather than misleading.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

.github/workflows/gaie.yml (8)
189-189: Remove redundant sleep before kubectl wait.

Line 189 uses a hardcoded sleep 20 before calling kubectl wait on line 190. The kubectl wait command already polls with exponential backoff, making the sleep unnecessary and delaying job completion.
-          kubectl apply -n ${KUBE_NS} -f ${DEPLOYMENT_FILE}
-          sleep 20
-          kubectl wait --for=condition=ready pod -l "nvidia.com/dynamo-graph-deployment-name=${GRAPH_NAME}" -n ${KUBE_NS} --timeout=1000s
+          kubectl apply -n ${KUBE_NS} -f ${DEPLOYMENT_FILE}
+          kubectl wait --for=condition=ready pod -l "nvidia.com/dynamo-graph-deployment-name=${GRAPH_NAME}" -n ${KUBE_NS} --timeout=1000s
249-249: Hardcoded GCP image reference should use versioned environment variable.

Line 249 tags an image with a hardcoded GCP reference us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.5.1-dirty. The version suffix is pinned to v0.5.1-dirty, but INFERENCE_EXTENSION_VERSION is already defined (line 36) as v0.5.1. The -dirty suffix is unusual and may cause confusion.

Clarify the intent by either:

Parameterizing the tag to match the environment variable, or

Adding a clear comment explaining why -dirty is needed.
-         docker tag us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.5.1-dirty gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo/epp-inference-extension-dynamo:${{ github.run_id }}
+         # Note: v0.5.1-dirty reflects the GAIE development build from build-epp-dynamo.sh
+         docker tag us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:${INFERENCE_EXTENSION_VERSION}-dirty gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo/epp-inference-extension-dynamo:${{ github.run_id }}
51-59: Pin package versions in apt-get and external tool downloads.

Line 51 installs packages without version constraints: sudo apt-get install -y curl bash openssl gettext git jq. Additionally, lines 52–59 download tools (yq, Helm, kubectl) from upstream release pages without pinning exact versions. Future runs may pull incompatible versions, leading to silent failures or behavioral changes.

Pin versions explicitly:
-         sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq
+         sudo apt-get update && sudo apt-get install -y curl=7.* bash=5.* openssl git jq
-         curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
+         curl -L https://github.com/mikefarah/yq/releases/download/v4.35.1/yq_linux_amd64 -o yq
-         KUBECTL_VER=$(curl -L -s https://dl.k8s.io/release/stable.txt)
+         KUBECTL_VER=v1.30.0  # or retrieve and cache the stable version
304-304: Complex jq query for pod status checking is fragile.

Line 304 uses a complex multiline jq query to detect crash conditions. If the pod status structure differs (e.g., due to Kubernetes version changes or missing fields), the query may produce unexpected results or fail silently.

Add explicit error handling and simplify the query:
# Simpler approach: check for specific conditions independently
if kubectl get pods -n ${NS} -l "${SELECTOR}" -o jsonpath='{.items[*].status.containerStatuses[*].state.waiting.reason}' | grep -qE "CrashLoopBackOff|ImagePullBackOff|ErrImagePull"; then
  echo "Detected pod failure states in qwen-epp" >&2
  exit 1
fi
Or use a defensive jq approach:
if kubectl get pods -n ${NS} -l "${SELECTOR}" -o json | jq -e '
  .items[] | 
  select(.status.containerStatuses != null) | 
  .status.containerStatuses[] | 
  select(.state.waiting != null and (.state.waiting.reason == "CrashLoopBackOff" or .state.waiting.reason == "ImagePullBackOff" or .state.waiting.reason == "ErrImagePull"))
' >/dev/null 2>&1; then
  echo "Detected pod crash or image pull failure" >&2
  exit 1
fi
230-236: Git clone lacks retry logic for network resilience.

Line 234 clones the GAIE repository without any retry mechanism. If the network is temporarily unavailable or the remote is slow, the entire workflow fails.

Add retry logic:
GAIE_CLONE_DIR="${{ github.workspace }}/external/gateway-api-inference-extension"
rm -rf "${GAIE_CLONE_DIR}"
mkdir -p "$(dirname "${GAIE_CLONE_DIR}")"

MAX_RETRIES=3
for i in $(seq 1 ${MAX_RETRIES}); do
  if git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension.git "${GAIE_CLONE_DIR}"; then
    break
  fi
  if [[ $i -lt ${MAX_RETRIES} ]]; then
    echo "Clone attempt $i failed, retrying..."
    sleep $((2 ** i))
  else
    echo "Failed to clone after ${MAX_RETRIES} attempts" >&2
    exit 1
  fi
done

cd "${GAIE_CLONE_DIR}"
git checkout ${INFERENCE_EXTENSION_VERSION}
336-350: Commented-out email notification should be removed or completed.

Lines 336–350 contain a commented-out email notification step with a TODO marker. This creates maintenance debt and suggests incomplete work.

Either:

Remove the commented code if email notifications are not yet needed.

Implement the feature properly if needed for failure alerting.

Open a tracked issue if this is a known limitation (reference it in a comment).

For now, remove the block:
-      # - name: Email on failure
-      # TODO
-      #   if: failure()
-      #   continue-on-error: true
-      #   uses: dawidd6/action-send-mail@v3
-      #   with:
-      #     server_address: ${{ secrets.SMTP_SERVER }}
-      #     server_port: ${{ secrets.SMTP_PORT }}
-      #     username: ${{ secrets.SMTP_USER }}
-      #     password: ${{ secrets.SMTP_PASS }}
-      #     subject: GAIE scheduled run failed
-      #     to: [email protected]
-      #     from: ${{ secrets.SMTP_FROM }}
-      #     secure: true
-      #     html_body: |
-      #       <p>GAIE scheduled run failed.</p>
-      #       <p>Run: <a href="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}">details</a></p>
143-150: String sanitization inconsistency: consider using yq for all YAML-related operations.

Line 143 uses bash string substitution (PROFILE_SANITIZED="${PROFILE//_/-}") to sanitize the profile name for use in Kubernetes namespace naming. While this works, the workflow uses yq for YAML parsing elsewhere (lines 52–53, 186–187), suggesting a preference for dedicated tools. For consistency and robustness, consider standardizing the approach.

This is a minor point, but if additional sanitization is needed in the future, a dedicated tool or function would be clearer. For now, the current approach is acceptable; just document the intent:
# Sanitize profile name for Kubernetes namespace (replace underscores with hyphens)
PROFILE_SANITIZED="${PROFILE//_/-}"
7-13: Commented-out workflow triggers reduce flexibility.

Lines 7, 11–13 contain commented-out workflow_dispatch and pull_request triggers. This limits the ability to manually trigger the workflow or test it in pull requests.

Consider enabling these triggers to improve developer experience:
on:
  workflow_dispatch:  # Allow manual trigger
  push:
    branches:
      - dep-423-test
  pull_request:       # Allow testing in PRs
    paths:
      - '.github/workflows/gaie.yml'
      - 'deploy/cloud/**'
      - 'deploy/inference-gateway/**'
  schedule:
    - cron: "0 3 1,15 * *"
If there are reasons to keep these commented out (e.g., preventing accidental runs), document them clearly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8de469c and a0aabf0.

📒 Files selected for processing (1)

.github/workflows/gaie.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/gaie.yml

27-27: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

95-95: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build and Test - dynamo
GitHub Check: Mirror Repository to GitLab

🔇 Additional comments (2)

.github/workflows/gaie.yml (2)
27-27: Verify runner label is configured for self-hosted runners.

The custom runner label cpu-amd-m5-2xlarge is not a standard GitHub runner label. If this label is not configured in the repository's self-hosted runner setup, the workflow will hang indefinitely waiting for a matching runner.

Confirm that this runner is properly configured by checking the repository's runner settings, or document the setup requirement clearly in the repository's CI/CD documentation. Alternatively, if this should work with standard runners, use one from the GitHub runner list.

95-95: Ensure custom action has proper action metadata.

The custom action ./.github/actions/docker-tag-push is missing required metadata per the actionlint hint. Custom actions must define an action.yml (or action.yaml) file with at minimum a name field.

Verify that .github/actions/docker-tag-push/action.yml exists and includes:
name: 'Docker Tag and Push'
description: 'Tag and push Docker image to registries'
# ... inputs, outputs, runs, etc.
If the file is missing, the workflow will fail at runtime.

.github/workflows/gaie.yml

Signed-off-by: Anna Tchernych <[email protected]>

atchernych added 2 commits October 23, 2025 07:33

more

25dc5d6

Signed-off-by: Anna Tchernych <[email protected]>

test EPP build

00c094a

Signed-off-by: Anna Tchernych <[email protected]>

atchernych requested a review from a team as a code owner October 23, 2025 20:13

pull-request-size bot added the size/L label Oct 23, 2025

github-actions bot added the feat label Oct 23, 2025

enable on this branch

a0aabf0

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 20:16 Inactive

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

.github/workflows/gaie.yml Outdated Show resolved Hide resolved

.github/workflows/gaie.yml Outdated Show resolved Hide resolved

.github/workflows/gaie.yml Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 20:24 Inactive

cleanup

cdf1c16

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 20:30 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 20:31 Inactive

atchernych added 2 commits October 23, 2025 16:38

fix namespace

1b1eb84

Signed-off-by: Anna Tchernych <[email protected]>

force login

d41fe3d

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 00:01 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 00:02 Inactive

login to nvcro optionally

c9a50a3

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 00:05 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 00:07 Inactive

rm re-login to nvcro

e5be572

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 17:07 Inactive

change login to nvcro

4f5645d

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 17:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 17:29 Inactive

ru after permissions update

c2f2f65

Signed-off-by: Anna Tchernych <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 17:41 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 24, 2025 17:42 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Create GAIE EPP Integration Test #3852

feat: Create GAIE EPP Integration Test #3852

atchernych commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Create GAIE EPP Integration Test #3852

Are you sure you want to change the base?

feat: Create GAIE EPP Integration Test #3852

Conversation

atchernych commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atchernych commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading