ci: Add Weekly CI tests for full FT test suite #3853

indrajit96 · 2025-10-23T20:31:23Z

Overview:

Add Weekly CI tests for full fault tolerance test suite.
Runs once a week on Sunday evening PST

Details:

Runs every Sunday at 5:00 PM PST via scheduled cron job
Tests 200+ fault tolerance scenarios across all backends (vLLM, TensorRT-LLM, SGLang)

Where should the reviewer start?

.github/workflows/weekly-fault-tolerance.yml

Summary by CodeRabbit

Chores
- Implemented weekly fault tolerance test automation with multi-environment deployment support and failure notifications.

Signed-off-by: Indrajit Bhosale <[email protected]>

coderabbitai · 2025-10-23T20:37:18Z

Walkthrough

A new GitHub Actions workflow for weekly fault-tolerance testing has been added. It orchestrates parallel container image builds for multiple inference engines, deploys them to Kubernetes via Helm, and executes comprehensive fault-tolerance test matrices across various configurations with automated cleanup.

Changes

Cohort / File(s)	Summary
Weekly Fault-Tolerance Testing Workflow `\.github/workflows/weekly-fault-tolerance.yml`	New workflow configuration implementing scheduled weekly fault-tolerance tests with manual trigger capability. Includes parallel build jobs (operator, vllm, trtllm, sglang), container image registry pushes to Azure/AWS, Kubernetes namespace provisioning, Helm deployments, multi-variant test matrices (sglang, vLLM, TRT-LLM), pytest execution against fault-tolerance scenarios, resource cleanup, and aggregated status reporting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The workflow contains substantial orchestration logic with multiple interdependent jobs, parallel execution paths, cloud registry integrations, Kubernetes operations, extensive test matrices, and intricate error handling flows. Each job phase and matrix combination requires careful verification of correctness and resource management.

Poem

🐰 A workflow most splendid takes flight,
Testing fault-tolerance through the night,
Containers build in parallel dance,
Kubernetes clusters get their chance,
Helming the tests with utmost care,
Weekly assurance beyond compare! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	The pull request description covers the primary required sections from the template. It includes a clear Overview explaining the purpose and schedule, a Details section with concrete information about the cron schedule and test scope (200+ scenarios across backends), and a "Where should the reviewer start" section pointing to the specific workflow file. The description is mostly complete, though the "Related Issues" section is missing. However, this omission appears acceptable since this PR adds a new feature rather than resolving a specific issue, making the Related Issues section non-critical.
Title Check	✅ Passed	The pull request title "ci: Add Weekly CI tests for full FT test suite" directly and accurately captures the primary change in the changeset. The summary confirms that the PR introduces a new GitHub Actions workflow for running a full fault tolerance test suite on a weekly schedule, which aligns perfectly with the title's description. The title is concise, uses clear CI/CD terminology, avoids vague language or noise, and provides sufficient clarity that a team member reviewing the git history would immediately understand the purpose of this changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

.github/workflows/weekly-fault-tolerance.yml (5)
38-56: Consider more robust approach for commit-based scheduling.

The git log --since="24 hours ago" check is timezone-dependent and may behave unpredictably across different runner environments. Additionally, the test_scenarios input defined in workflow_dispatch (line 16-19) is never consumed by the workflow, so manual runs cannot selectively run specific test scenarios.

Consider:

Using a fixed, UTC-based timestamp instead of relative time

Storing the last run timestamp and comparing against it

Consuming the test_scenarios input to filter the matrix at runtime (if selective testing is needed)

75-86: Remove unused AWS ECR login from operator job.

The operator job installs AWS CLI and logs into ECR (lines 75-86) but only pushes to Azure ACR (line 105: azure_push: 'true', aws_push: 'false'). This setup is redundant.

Remove the unused AWS setup:
-      - name: Install awscli
-        shell: bash
-        run: |
-          curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
-          unzip awscliv2.zip
-          sudo ./aws/install
-      - name: Login to ECR
-        shell: bash
-        env:
-          ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
-        run: |
-          aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}
192-225: Align sglang build job structure with vllm and trtllm.

The sglang job lacks a strategy.matrix definition present in vllm and trtllm jobs. For consistency and future extensibility, consider refactoring sglang to use the same pattern.

Apply this diff to align the structure:
  sglang:
    needs: should-run
    if: needs.should-run.outputs.run_tests == 'true'
+   strategy:
+     fail-fast: false
+     matrix:
+       platform:
+         - { arch: amd64, runner: gpu-l40-amd64 }
-   runs-on: gpu-l40-amd64
+   runs-on: ${{ matrix.platform.runner }}
    name: sglang (amd64)
    steps:
      - name: Checkout repository
        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
      - name: Build Container
        id: build-image
        uses: ./.github/actions/docker-build
        with:
          framework: sglang
          target: runtime
-         platform: 'linux/amd64'
+         platform: 'linux/${{ matrix.platform.arch }}'
          ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
          ...
          push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64
          ...
599-608: Add context to failure notifications.

The status check job aggregates results but provides minimal context:

The jq command checks all tests passed but doesn't identify which specific tests failed

Notification logic is a placeholder with no actual implementation

When implementing the notification logic, include:

List of failed test scenarios (parse matrix and check which jobs failed)

Link to the failed job logs

Summary of failure types (if available)

Example improvement:
- name: Identify failed tests
  if: failure()
  run: |
    echo "Failed test scenarios:"
    echo '${{ toJson(needs.deploy-test-fault-tolerance.result) }}' | jq -r '.[] | select(.result != "success") | .name'
227-431: Plan for operational scalability and resource management.

With ~200 concurrent test jobs in the matrix (lines 236-431), consider:
Kubernetes resource quotas: Namespaces created per job lack explicit CPU/memory quotas. Add limits to prevent resource starvation:
kubectl create resourcequota job-quota --hard=requests.cpu=4,requests.memory=8Gi -n $NAMESPACE
GitHub Actions concurrency: 200+ concurrent jobs may exceed your account's runner capacity. Verify with your GitHub plan.

Test artifact retention: No mechanism for capturing test logs, reports, or debugging information. Consider:

Uploading failed test logs to artifact storage

Generating a summary report per test scenario

Cluster capacity: Ensure Azure AKS cluster is sized to handle 200+ pods spawning simultaneously, or implement staggered job execution.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8de469c and d722a6a.

📒 Files selected for processing (1)

.github/workflows/weekly-fault-tolerance.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/weekly-fault-tolerance.yml

100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)

195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang
GitHub Check: operator (amd64)
GitHub Check: vllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

.github/workflows/weekly-fault-tolerance.yml (5)

8-11: Verify cron schedule and documentation accuracy.

The cron expression 0 1 * * 1 schedules the workflow for Monday at 1:00 AM UTC, not Sunday at 5:00 PM PST as documented in the comment. Confirm this is intentional.

453-483: Harden dependency installation and add validation for extracted variables.

Several robustness concerns in the setup phase:

Tool installations (yq, Helm, kubectl) lack error checking and could silently fail

apt-get update has no timeout and could hang indefinitely

FRAMEWORK extraction via cut -d'-' -f1 assumes all scenario names follow {framework}-... format but has no validation

kubeconfig file (.kubeconfig) persists after job and could accumulate across multiple runs

Consider adding:

Error checks after each tool installation (e.g., set -e or explicit checks)

Timeout for apt-get update

Validation that FRAMEWORK is non-empty and matches expected values (sglang, trtllm, vllm)

Explicit cleanup of .kubeconfig file in the cleanup phase or use /tmp directory

485-535: Verify deployment prerequisites and add error handling.

Several concerns with the operator deployment phase:

IMAGE_TAG is sourced from build.env (line 506) but this file is not created by any prior step—is it committed to the repository?

Istio validation (line 501) has no error handling; the job will fail if Istio is not installed

Namespace cleanup uses || true, which suppresses errors and could leave orphaned resources

timeout 300s for kubectl rollout (line 531) may be insufficient for large deployments or heavily loaded clusters

nscleanup/ttl=7200 sets a 2-hour cleanup window, but if a job runs longer, resources may persist

Please verify:

Whether build.env is committed to the repository or needs to be created by a prior step

Whether Istio is a hard requirement; if so, add explicit error handling

Whether 300 seconds is adequate for rollout in your cluster's typical load conditions

537-568: Validate test scenario naming and add safety checks for test execution.

Several concerns:

The pytest parametrization assumes test scenario names match exactly: test_fault_scenario[${{ matrix.test_scenario }}]. If a scenario name is malformed, pytest will fail with "not found" instead of a clear error.

Using -s (no output capture) in pytest may leak secrets to logs if tests print environment variables.

PYTHONPATH includes $(pwd)/components/src, but this directory's existence is not validated.

No explicit error handling for venv creation failures.

Consider:

Adding validation to ensure scenario names conform to an expected pattern

Using output redaction in pytest or ensuring tests don't print sensitive data

Validating PYTHONPATH directories exist before test execution

570-591: Improve cleanup robustness and error visibility.

Cleanup phase concerns:

5-minute timeout may be too aggressive for large namespace teardown (200+ test scenarios' resources). Kubernetes pod termination grace period + Helm chart uninstall can exceed this window.

Error suppression with || true hides legitimate failures (failed helm uninstall, namespace deletion errors). Consider logging failures instead.

No polling for namespace deletion. The job may exit before cleanup completes, leaving resources orphaned until the 2-hour TTL cleanup kicks in.

.kubeconfig file not explicitly cleaned up after the job, creating potential for accumulation or leaks across runs.

Consider:

Increasing cleanup timeout to 10-15 minutes or making it proportional to the number of resources

Adding explicit error logging: helm uninstall ... || echo "Helm uninstall failed: $?"

Polling for namespace deletion: kubectl wait namespace/$NAMESPACE --for=condition=terminating --timeout=10m or similar

Explicitly cleaning up .kubeconfig: rm -f .kubeconfig

coderabbitai · 2025-10-23T20:37:22Z

.github/workflows/weekly-fault-tolerance.yml

+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+name: Weekly Fault Tolerance Tests
+
+on:
+  schedule:
+    # Run every Sunday at 5:00 PM PST (1:00 AM UTC Monday)
+    # Cron syntax: minute hour day-of-month month day-of-week
+    # Note: During PDT (daylight saving), this will run at 5:00 PM PDT (12:00 AM UTC Monday)
+    - cron: '0 1 * * 1'
+
+  # Allow manual triggering for testing
+  workflow_dispatch:
+    inputs:
+      test_scenarios:
+        description: 'Test scenarios to run (comma-separated or "all")'
+        required: false
+        default: 'all'
+        type: string
+
+concurrency:
+  group: ${{ github.workflow }}-weekly-${{ github.ref_name || github.run_id }}
+  cancel-in-progress: false
+
+jobs:
+  # Check if we should run (skip if no changes in last 24h for scheduled runs)
+  should-run:
+    runs-on: ubuntu-latest
+    outputs:
+      run_tests: ${{ steps.check.outputs.run_tests }}
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+        with:
+          fetch-depth: 0
+
+      - name: Check for recent activity
+        id: check
+        run: |
+          # Always run if manually triggered
+          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+            echo "run_tests=true" >> $GITHUB_OUTPUT
+            echo "Manual trigger - running tests"
+            exit 0
+          fi
+
+          # For scheduled runs, check if there were commits in last 24 hours
+          COMMITS_LAST_24H=$(git log --since="24 hours ago" --oneline | wc -l)
+          if [ "$COMMITS_LAST_24H" -gt 0 ]; then
+            echo "run_tests=true" >> $GITHUB_OUTPUT
+            echo "Found $COMMITS_LAST_24H commits in last 24 hours - running tests"
+          else
+            echo "run_tests=false" >> $GITHUB_OUTPUT
+            echo "No commits in last 24 hours - skipping tests"
+          fi
+
+  operator:
+    needs: should-run
+    if: needs.should-run.outputs.run_tests == 'true'
+    strategy:
+      fail-fast: false
+      matrix:
+        platform:
+          - { arch: amd64, runner: cpu-amd-m5-2xlarge }
+    name: operator (${{ matrix.platform.arch }})
+    runs-on: ${{ matrix.platform.runner }}
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+        with:
+          driver: docker
+      - name: Install awscli
+        shell: bash
+        run: |
+          curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
+          unzip awscliv2.zip
+          sudo ./aws/install
+      - name: Login to ECR
+        shell: bash
+        env:
+          ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
+        run: |
+          aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}
+      - name: Build Container
+        id: build-image
+        shell: bash
+        env:
+          ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
+        run: |
+          cd deploy/cloud/operator
+          docker buildx build --load \
+              --platform linux/${{ matrix.platform.arch }} \
+              --build-arg DOCKER_PROXY=${ECR_HOSTNAME}/dockerhub/ \
+              -f Dockerfile \
+              -t dynamo-operator:latest .
+      - name: Docker Tag and Push
+        uses: ./.github/actions/docker-tag-push
+        with:
+          local_image: dynamo-operator:latest
+          push_tag: ai-dynamo/dynamo:${{ github.sha }}-operator-${{ matrix.platform.arch }}
+          aws_push: 'false'
+          azure_push: 'true'
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
+          azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
+          azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}
+
+  vllm:
+    needs: should-run
+    if: needs.should-run.outputs.run_tests == 'true'
+    strategy:
+      fail-fast: false
+      matrix:
+        platform:
+          - { arch: amd64, runner: gpu-l40-amd64 }
+    name: vllm (${{ matrix.platform.arch }})
+    runs-on: ${{ matrix.platform.runner }}
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+      - name: Build Container
+        id: build-image
+        uses: ./.github/actions/docker-build
+        with:
+          framework: vllm
+          target: runtime
+          platform: 'linux/${{ matrix.platform.arch }}'
+          ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
+          ci_token: ${{ secrets.CI_TOKEN }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+      - name: Docker Tag and Push
+        uses: ./.github/actions/docker-tag-push
+        with:
+          local_image: ${{ steps.build-image.outputs.image_tag }}
+          push_tag: ai-dynamo/dynamo:${{ github.sha }}-vllm-${{ matrix.platform.arch }}
+          aws_push: 'false'
+          azure_push: 'true'
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
+          azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
+          azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}
+
+  trtllm:
+    needs: should-run
+    if: needs.should-run.outputs.run_tests == 'true'
+    strategy:
+      fail-fast: false
+      matrix:
+        platform:
+          - { arch: amd64, runner: gpu-l40-amd64 }
+    name: trtllm (${{ matrix.platform.arch }})
+    runs-on: ${{ matrix.platform.runner }}
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+      - name: Build Container
+        id: build-image
+        uses: ./.github/actions/docker-build
+        with:
+          framework: trtllm
+          target: runtime
+          platform: 'linux/${{ matrix.platform.arch }}'
+          ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
+          ci_token: ${{ secrets.CI_TOKEN }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+      - name: Docker Tag and Push
+        uses: ./.github/actions/docker-tag-push
+        with:
+          local_image: ${{ steps.build-image.outputs.image_tag }}
+          push_tag: ai-dynamo/dynamo:${{ github.sha }}-trtllm-${{ matrix.platform.arch }}
+          aws_push: 'false'
+          azure_push: 'true'
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
+          azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
+          azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}
+
+  sglang:
+    needs: should-run
+    if: needs.should-run.outputs.run_tests == 'true'
+    runs-on: gpu-l40-amd64
+    name: sglang (amd64)
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+      - name: Build Container
+        id: build-image
+        uses: ./.github/actions/docker-build
+        with:
+          framework: sglang
+          target: runtime
+          platform: 'linux/amd64'
+          ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
+          ci_token: ${{ secrets.CI_TOKEN }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+      - name: Docker Tag and Push
+        uses: ./.github/actions/docker-tag-push
+        with:
+          local_image: ${{ steps.build-image.outputs.image_tag }}
+          push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64
+          aws_push: 'false'
+          azure_push: 'true'
+          aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
+          aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
+          azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
+          azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
+          azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}
+
+  deploy-test-fault-tolerance:
+    runs-on: cpu-amd-m5-2xlarge
+    needs: [should-run, operator, vllm, trtllm, sglang]
+    if: needs.should-run.outputs.run_tests == 'true'
+    permissions:
+      contents: read
+    strategy:
+      fail-fast: false
+      matrix:
+        test_scenario:
+          # SGLang scenarios
+          - sglang-agg-tp-1-dp-1-decode_worker
+          - sglang-agg-tp-1-dp-1-decode_worker_pod
+          - sglang-agg-tp-1-dp-1-frontend
+          - sglang-agg-tp-1-dp-1-frontend_pod
+          - sglang-agg-tp-1-dp-1-none
+          - sglang-agg-tp-1-dp-1-sglang_decode_detokenizer
+          - sglang-agg-tp-1-dp-1-sglang_decode_scheduler
+          - sglang-agg-tp-1-dp-2-decode_worker
+          - sglang-agg-tp-1-dp-2-decode_worker_pod
+          - sglang-agg-tp-1-dp-2-frontend
+          - sglang-agg-tp-1-dp-2-frontend_pod
+          - sglang-agg-tp-1-dp-2-none
+          - sglang-agg-tp-1-dp-2-sglang_decode_detokenizer
+          - sglang-agg-tp-1-dp-2-sglang_decode_scheduler
+          - sglang-agg-tp-2-dp-1-decode_worker
+          - sglang-agg-tp-2-dp-1-decode_worker_pod
+          - sglang-agg-tp-2-dp-1-frontend
+          - sglang-agg-tp-2-dp-1-frontend_pod
+          - sglang-agg-tp-2-dp-1-none
+          - sglang-agg-tp-2-dp-1-sglang_decode_detokenizer
+          - sglang-agg-tp-2-dp-1-sglang_decode_scheduler
+          - sglang-agg-tp-4-dp-1-decode_worker
+          - sglang-agg-tp-4-dp-1-decode_worker_pod
+          - sglang-agg-tp-4-dp-1-frontend
+          - sglang-agg-tp-4-dp-1-frontend_pod
+          - sglang-agg-tp-4-dp-1-none
+          - sglang-agg-tp-4-dp-1-sglang_decode_detokenizer
+          - sglang-agg-tp-4-dp-1-sglang_decode_scheduler
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-none
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_detokenizer
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_scheduler
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_detokenizer
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_scheduler
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-none
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_detokenizer
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_scheduler
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_detokenizer
+          - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_scheduler
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-none
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_detokenizer
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_scheduler
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_detokenizer
+          - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_scheduler
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-none
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_detokenizer
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_scheduler
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_detokenizer
+          - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_scheduler
+          # TensorRT-LLM scenarios
+          - trtllm-agg-tp-1-dp-1-decode_worker
+          - trtllm-agg-tp-1-dp-1-decode_worker_pod
+          - trtllm-agg-tp-1-dp-1-frontend
+          - trtllm-agg-tp-1-dp-1-frontend_pod
+          - trtllm-agg-tp-1-dp-1-none
+          - trtllm-agg-tp-1-dp-1-trtllm_decode_engine_core
+          - trtllm-agg-tp-1-dp-2-decode_worker
+          - trtllm-agg-tp-1-dp-2-decode_worker_pod
+          - trtllm-agg-tp-1-dp-2-frontend
+          - trtllm-agg-tp-1-dp-2-frontend_pod
+          - trtllm-agg-tp-1-dp-2-none
+          - trtllm-agg-tp-1-dp-2-trtllm_decode_engine_core
+          - trtllm-agg-tp-2-dp-1-decode_worker
+          - trtllm-agg-tp-2-dp-1-decode_worker_pod
+          - trtllm-agg-tp-2-dp-1-frontend
+          - trtllm-agg-tp-2-dp-1-frontend_pod
+          - trtllm-agg-tp-2-dp-1-none
+          - trtllm-agg-tp-2-dp-1-trtllm_decode_engine_core
+          - trtllm-agg-tp-4-dp-1-decode_worker
+          - trtllm-agg-tp-4-dp-1-decode_worker_pod
+          - trtllm-agg-tp-4-dp-1-frontend
+          - trtllm-agg-tp-4-dp-1-frontend_pod
+          - trtllm-agg-tp-4-dp-1-none
+          - trtllm-agg-tp-4-dp-1-trtllm_decode_engine_core
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_decode_engine_core
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_prefill_engine_core
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_decode_engine_core
+          - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_prefill_engine_core
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_decode_engine_core
+          - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_prefill_engine_core
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_decode_engine_core
+          - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_prefill_engine_core
+          # vLLM scenarios
+          - vllm-agg-tp-1-dp-1-decode_worker
+          - vllm-agg-tp-1-dp-1-decode_worker_pod
+          - vllm-agg-tp-1-dp-1-frontend
+          - vllm-agg-tp-1-dp-1-frontend_pod
+          - vllm-agg-tp-1-dp-1-none
+          - vllm-agg-tp-1-dp-1-vllm_decode_engine_core
+          - vllm-agg-tp-1-dp-2-decode_worker
+          - vllm-agg-tp-1-dp-2-decode_worker_pod
+          - vllm-agg-tp-1-dp-2-frontend
+          - vllm-agg-tp-1-dp-2-frontend_pod
+          - vllm-agg-tp-1-dp-2-none
+          - vllm-agg-tp-1-dp-2-vllm_decode_engine_core
+          - vllm-agg-tp-2-dp-1-decode_worker
+          - vllm-agg-tp-2-dp-1-decode_worker_pod
+          - vllm-agg-tp-2-dp-1-frontend
+          - vllm-agg-tp-2-dp-1-frontend_pod
+          - vllm-agg-tp-2-dp-1-none
+          - vllm-agg-tp-2-dp-1-vllm_decode_engine_core
+          - vllm-agg-tp-4-dp-1-decode_worker
+          - vllm-agg-tp-4-dp-1-decode_worker_pod
+          - vllm-agg-tp-4-dp-1-frontend
+          - vllm-agg-tp-4-dp-1-frontend_pod
+          - vllm-agg-tp-4-dp-1-none
+          - vllm-agg-tp-4-dp-1-vllm_decode_engine_core
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_decode_engine_core
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_prefill_engine_core
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_decode_engine_core
+          - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_prefill_engine_core
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_decode_engine_core
+          - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_prefill_engine_core
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_decode_engine_core
+          - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_prefill_engine_core
+    name: deploy-test-fault-tolerance (${{ matrix.test_scenario }})
+    env:
+      DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
+
+      - name: Install awscli
+        shell: bash
+        run: |
+          curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
+          unzip awscliv2.zip
+          sudo ./aws/install
+
+      - name: Login to ECR
+        shell: bash
+        env:
+          ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
+        run: |
+          aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}
+
+      - name: Set namespace and install dependencies
+        run: |
+          # Extract framework from test scenario for unique namespace
+          FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1)
+          # Create unique namespace per matrix job with weekly prefix
+          echo "NAMESPACE=gh-weekly-${{ github.run_id }}-ft-${FRAMEWORK}" >> $GITHUB_ENV
+          set -x
+          # Install dependencies
+          sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq python3 python3-pip python3-venv
+
+          # Install yq
+          echo "Installing yq..."
+          curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
+          sudo chmod 755 yq
+          sudo mv yq /usr/local/bin/
+          # Install Helm
+          echo "Installing Helm..."
+          curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
+          sudo chmod 700 get_helm.sh
+          sudo ./get_helm.sh
+          # Install kubectl
+          curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
+          sudo chmod 755 kubectl
+          sudo mv kubectl /usr/local/bin/
+
+          # Setup kubeconfig
+          echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
+          chmod 600 .kubeconfig
+          export KUBECONFIG=$(pwd)/.kubeconfig
+          kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"
+          kubectl config current-context
+
+      - name: Deploy Operator
+        run: |
+          set -x
+          export KUBECONFIG=$(pwd)/.kubeconfig
+
+          # Create a namespace for this job
+          echo "Creating an ephemeral namespace..."
+          kubectl delete namespace $NAMESPACE || true
+          kubectl create namespace $NAMESPACE || true
+          echo "Attaching the labels for secrets and cleanup"
+          kubectl label namespaces ${NAMESPACE} nscleanup/enabled=true nscleanup/ttl=7200 gitlab-imagepull=enabled ngc-api=enabled nvcr-imagepull=enabled --overwrite=true
+
+          # Set the namespace as default
+          kubectl config set-context --current --namespace=$NAMESPACE
+
+          # Check if Istio is installed
+          kubectl get pods -n istio-system
+          # Check if default storage class exists
+          kubectl get storageclass
+
+          # Install Helm chart
+          export IMAGE_TAG=$(cat build.env)
+          echo $IMAGE_TAG
+          export VIRTUAL_ENV=/opt/dynamo/venv
+          export KUBE_NS=$NAMESPACE
+          export ISTIO_ENABLED=true
+          export ISTIO_GATEWAY=istio-system/ingress-alb
+          export VIRTUAL_SERVICE_SUPPORTS_HTTPS=true
+          export DYNAMO_CLOUD=https://${NAMESPACE}.${DYNAMO_INGRESS_SUFFIX}
+
+          # Install dynamo env secrets
+          kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true
+          # Create docker pull secret for operator image
+          kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE}
+          # Install helm dependencies
+          helm repo add bitnami https://charts.bitnami.com/bitnami
+          cd deploy/cloud/helm/platform/
+          helm dep build .
+          # Install platform with namespace restriction for single profile testing
+          helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \
+            --set dynamo-operator.namespaceRestriction.enabled=true \
+            --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \
+            --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \
+            --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \
+            --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
+          # Wait for all deployments to be ready
+          timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch
+          cd -
+
+          export KUBECONFIG=$(pwd)/.kubeconfig
+          kubectl config set-context --current --namespace=$NAMESPACE
+
+      - name: Run Fault Tolerance Tests
+        run: |
+          set -x
+          export KUBECONFIG=$(pwd)/.kubeconfig
+          export NAMESPACE=$NAMESPACE
+
+          # Extract framework from test scenario
+          FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1)
+          export IMAGE="${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64"
+
+          # Set up Python virtual environment and install dependencies
+          python3 -m venv venv
+          source venv/bin/activate
+          pip install --upgrade pip
+
+          # Install core dependencies needed for tests (without full project install)
+          pip install -r container/deps/requirements.test.txt
+          pip install kubernetes==32.0.1 kubernetes_asyncio kr8s pyyaml requests tabulate pydantic
+
+          # Add project source to PYTHONPATH for test imports
+          export PYTHONPATH=$(pwd):$(pwd)/components/src:$PYTHONPATH
+
+          echo "Running weekly fault tolerance test: ${{ matrix.test_scenario }}"
+          echo "Using namespace: $NAMESPACE"
+          echo "Using image: $IMAGE"
+
+          # Run the pytest command
+          pytest tests/fault_tolerance/deploy/test_deployment.py::test_fault_scenario[${{ matrix.test_scenario }}] \
+            -s -v \
+            --namespace ${NAMESPACE} \
+            --image ${IMAGE} \
+            --client-type legacy
+
+      - name: Cleanup
+        if: always()
+        timeout-minutes: 5
+        run: |
+          echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
+          chmod 600 .kubeconfig
+          export KUBECONFIG=$(pwd)/.kubeconfig
+          kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"
+
+          # For debugging purposes, list all the resources before we uninstall
+          kubectl get all
+
+          echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..."
+          kubectl delete dynamographdeployments --all -n $NAMESPACE || true
+
+          # Uninstall the helm chart
+          helm ls
+          helm uninstall dynamo-platform || true
+
+          echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..."
+          kubectl delete namespace $NAMESPACE || true
+          echo "Namespace $NAMESPACE completed."
+
+  # Status check job to verify all tests passed
+  weekly-status-check:
+    runs-on: ubuntu-latest
+    needs: [deploy-test-fault-tolerance]
+    if: always()
+    steps:
+      - name: Check all tests passed
+        run: |
+          echo '${{ toJson(needs) }}' | jq -e 'to_entries | map(.value.result) | all(. as $result | ["success", "skipped"] | any($result == .))'
+
+      - name: Send notification on failure
+        if: failure()
+        run: |
+          echo "Weekly fault tolerance tests failed!"
+          echo "Workflow run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+          # Add notification logic here (e.g., Slack, email, etc.)
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Address static analysis warnings: runner labels and action metadata.

Actionlint reports two issues:

Custom runner labels not recognized (lines 67, 119, 159, 195, 228): gpu-l40-amd64, cpu-amd-m5-2xlarge

These are self-hosted runner labels and should be configured in your actionlint.yaml file to suppress false positives.

Missing name field in .github/actions/docker-tag-push/action.yml (line 100)

The custom action lacks a required name field in its metadata. Add it to the action's metadata file.

🌐 Web query:

What is the required format for the "name" field in GitHub Actions custom action metadata (action.yml)?

💡 Result:

The name field must be a string (required). Example:
name: 'My Action'. (docs.github.com)

Citations:

1: https://docs.github.com/en/enterprise-cloud%40latest/actions/sharing-automations/creating-actions/metadata-syntax-for-github-actions?learn=create_actions&learnProduct=actions&utm_source=openai

🏁 Script executed:

#!/bin/bash # Check if docker-tag-push action exists and inspect its content find . -name "action.yml" -o -name "action.yaml" | grep docker-tag-push if [ -f ".github/actions/docker-tag-push/action.yml" ]; then echo "=== docker-tag-push/action.yml ===" head -20 .github/actions/docker-tag-push/action.yml fi # Check for actionlint config echo "" echo "=== Looking for actionlint configuration ===" find . -name "actionlint.yaml" -o -name ".actionlintrc*" 2>/dev/null

Length of output: 601

🏁 Script executed:

#!/bin/bash # Search for runner labels used in the workflow to verify they're defined echo "=== Custom runner labels used in workflow ===" grep -E "runs-on:|runner:" .github/workflows/weekly-fault-tolerance.yml | sort | uniq # Check for self-hosted runner definitions in repo echo "" echo "=== Searching for runner configuration files ===" find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "self-hosted" 2>/dev/null | head -10

Length of output: 331

Add required "name" field to .github/actions/docker-tag-push/action.yml and configure actionlint for custom runner labels.

The docker-tag-push action metadata is missing the required name field. Add this to the top of .github/actions/docker-tag-push/action.yml:

name: 'Tag and Push Docker Images' description: 'Tag and Push Docker Images'

Additionally, custom runner labels (gpu-l40-amd64, cpu-amd-m5-2xlarge) trigger actionlint warnings because they are not recognized as built-in GitHub runners. Create .github/actionlint.yaml to configure these self-hosted runner labels:

self-hosted-runner: - cpu-amd-m5-2xlarge - gpu-l40-amd64

🧰 Tools

🪛 actionlint (1.7.8)

100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)

195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents

In .github/workflows/weekly-fault-tolerance.yml lines 1-609: the workflow references a local action .github/actions/docker-tag-push whose action.yml is missing the required top-level name field (causing actionlint/metadata issues) and the workflow uses custom self-hosted runner labels that actionlint flags as unknown; fix by adding a name and description to .github/actions/docker-tag-push/action.yml (e.g., "Tag and Push Docker Images" and a short description) and add a new file .github/actionlint.yaml listing the custom self-hosted runner labels (cpu-amd-m5-2xlarge, gpu-l40-amd64) under a self-hosted-runner key so actionlint recognizes them.

Signed-off-by: Indrajit Bhosale <[email protected]>

tzulingk · 2025-10-24T00:55:18Z

.github/workflows/weekly-fault-tolerance.yml

+    strategy:
+      fail-fast: false
+      matrix:
+        test_scenario:


Is there an automatic way to capture all the test cases? Otherwise we have to remember to change this file everytime we add a test case

tzulingk · 2025-10-24T00:57:26Z

.github/workflows/weekly-fault-tolerance.yml

+          azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}
+
+  deploy-test-fault-tolerance:
+    runs-on: cpu-amd-m5-2xlarge


How many GPUs will be running these FT tests?
Will they be running in parallel?
As we add more and more tests will we need to change " runs-on: cpu-amd-m5-2xlarge"?

tzulingk · 2025-10-24T00:59:21Z

.github/workflows/weekly-fault-tolerance.yml

+          # Install dynamo env secrets
+          kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true
+          # Create docker pull secret for operator image
+          kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE}


FYI, Later for moe and elastic EP we might need to run on environment other than Azure

indrajit96 added 12 commits October 21, 2025 17:55

Add new test

7828416

Signed-off-by: Indrajit Bhosale <[email protected]>

Add new test

29eb816

Signed-off-by: Indrajit Bhosale <[email protected]>

Add new test

f76d78d

Signed-off-by: Indrajit Bhosale <[email protected]>

Add 1 FT test per Backend

ae7f96a

Signed-off-by: Indrajit Bhosale <[email protected]>

Add 1 FT test per Backend

dfcae2d

Signed-off-by: Indrajit Bhosale <[email protected]>

Add 1 FT test per Backend

9182cbd

Signed-off-by: Indrajit Bhosale <[email protected]>

Add nightly CI

531f569

Signed-off-by: Indrajit Bhosale <[email protected]>

Add nightly CI run now UTC

2a43d59

Signed-off-by: Indrajit Bhosale <[email protected]>

Add nightly CI run now UTC

c478214

Signed-off-by: Indrajit Bhosale <[email protected]>

Add weekly CI run

a684c94

Signed-off-by: Indrajit Bhosale <[email protected]>

Merge branch 'main' into ibhosale-nightly-ci

c7017e5

Revert old changes

d722a6a

Signed-off-by: Indrajit Bhosale <[email protected]>

indrajit96 requested review from alec-flowers, dillon-cullinan, nnshah1 and rmccorm4 October 23, 2025 20:31

indrajit96 requested a review from a team as a code owner October 23, 2025 20:31

pull-request-size bot added the size/XL label Oct 23, 2025

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

indrajit96 changed the title ~~ci : Add Weekly CI tests for full FT test suite~~ ci: Add Weekly CI tests for full FT test suite Oct 23, 2025

github-actions bot added the ci Issues/PRs that reference CI build/test label Oct 23, 2025

Add manual run placeholder

3985868

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 21:29 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 21:31 Inactive

tzulingk reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: Add Weekly CI tests for full FT test suite #3853

ci: Add Weekly CI tests for full FT test suite #3853

Uh oh!

indrajit96 commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 23, 2025

Uh oh!

tzulingk Oct 24, 2025

Uh oh!

tzulingk Oct 24, 2025

Uh oh!

tzulingk Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ci: Add Weekly CI tests for full FT test suite #3853

Are you sure you want to change the base?

ci: Add Weekly CI tests for full FT test suite #3853

Uh oh!

Conversation

indrajit96 commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

indrajit96 commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading