Skip to content

Conversation

@indrajit96
Copy link
Contributor

@indrajit96 indrajit96 commented Oct 23, 2025

Overview:

Add Weekly CI tests for full fault tolerance test suite.
Runs once a week on Sunday evening PST

Details:

  • Runs every Sunday at 5:00 PM PST via scheduled cron job
  • Tests 200+ fault tolerance scenarios across all backends (vLLM, TensorRT-LLM, SGLang)

Where should the reviewer start?

.github/workflows/weekly-fault-tolerance.yml

Summary by CodeRabbit

  • Chores
    • Implemented weekly fault tolerance test automation with multi-environment deployment support and failure notifications.

Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Walkthrough

A new GitHub Actions workflow for weekly fault-tolerance testing has been added. It orchestrates parallel container image builds for multiple inference engines, deploys them to Kubernetes via Helm, and executes comprehensive fault-tolerance test matrices across various configurations with automated cleanup.

Changes

Cohort / File(s) Summary
Weekly Fault-Tolerance Testing Workflow
\.github/workflows/weekly-fault-tolerance.yml
New workflow configuration implementing scheduled weekly fault-tolerance tests with manual trigger capability. Includes parallel build jobs (operator, vllm, trtllm, sglang), container image registry pushes to Azure/AWS, Kubernetes namespace provisioning, Helm deployments, multi-variant test matrices (sglang, vLLM, TRT-LLM), pytest execution against fault-tolerance scenarios, resource cleanup, and aggregated status reporting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The workflow contains substantial orchestration logic with multiple interdependent jobs, parallel execution paths, cloud registry integrations, Kubernetes operations, extensive test matrices, and intricate error handling flows. Each job phase and matrix combination requires careful verification of correctness and resource management.

Poem

🐰 A workflow most splendid takes flight,
Testing fault-tolerance through the night,
Containers build in parallel dance,
Kubernetes clusters get their chance,
Helming the tests with utmost care,
Weekly assurance beyond compare!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed The pull request description covers the primary required sections from the template. It includes a clear Overview explaining the purpose and schedule, a Details section with concrete information about the cron schedule and test scope (200+ scenarios across backends), and a "Where should the reviewer start" section pointing to the specific workflow file. The description is mostly complete, though the "Related Issues" section is missing. However, this omission appears acceptable since this PR adds a new feature rather than resolving a specific issue, making the Related Issues section non-critical.
Title Check ✅ Passed The pull request title "ci: Add Weekly CI tests for full FT test suite" directly and accurately captures the primary change in the changeset. The summary confirms that the PR introduces a new GitHub Actions workflow for running a full fault tolerance test suite on a weekly schedule, which aligns perfectly with the title's description. The title is concise, uses clear CI/CD terminology, avoids vague language or noise, and provides sufficient clarity that a team member reviewing the git history would immediately understand the purpose of this changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
.github/workflows/weekly-fault-tolerance.yml (5)

38-56: Consider more robust approach for commit-based scheduling.

The git log --since="24 hours ago" check is timezone-dependent and may behave unpredictably across different runner environments. Additionally, the test_scenarios input defined in workflow_dispatch (line 16-19) is never consumed by the workflow, so manual runs cannot selectively run specific test scenarios.

Consider:

  • Using a fixed, UTC-based timestamp instead of relative time
  • Storing the last run timestamp and comparing against it
  • Consuming the test_scenarios input to filter the matrix at runtime (if selective testing is needed)

75-86: Remove unused AWS ECR login from operator job.

The operator job installs AWS CLI and logs into ECR (lines 75-86) but only pushes to Azure ACR (line 105: azure_push: 'true', aws_push: 'false'). This setup is redundant.

Remove the unused AWS setup:

-      - name: Install awscli
-        shell: bash
-        run: |
-          curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
-          unzip awscliv2.zip
-          sudo ./aws/install
-      - name: Login to ECR
-        shell: bash
-        env:
-          ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
-        run: |
-          aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}

192-225: Align sglang build job structure with vllm and trtllm.

The sglang job lacks a strategy.matrix definition present in vllm and trtllm jobs. For consistency and future extensibility, consider refactoring sglang to use the same pattern.

Apply this diff to align the structure:

  sglang:
    needs: should-run
    if: needs.should-run.outputs.run_tests == 'true'
+   strategy:
+     fail-fast: false
+     matrix:
+       platform:
+         - { arch: amd64, runner: gpu-l40-amd64 }
-   runs-on: gpu-l40-amd64
+   runs-on: ${{ matrix.platform.runner }}
    name: sglang (amd64)
    steps:
      - name: Checkout repository
        uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955  # v4.3.0
      - name: Build Container
        id: build-image
        uses: ./.github/actions/docker-build
        with:
          framework: sglang
          target: runtime
-         platform: 'linux/amd64'
+         platform: 'linux/${{ matrix.platform.arch }}'
          ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
          ...
          push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64
          ...

599-608: Add context to failure notifications.

The status check job aggregates results but provides minimal context:

  • The jq command checks all tests passed but doesn't identify which specific tests failed
  • Notification logic is a placeholder with no actual implementation

When implementing the notification logic, include:

  • List of failed test scenarios (parse matrix and check which jobs failed)
  • Link to the failed job logs
  • Summary of failure types (if available)

Example improvement:

- name: Identify failed tests
  if: failure()
  run: |
    echo "Failed test scenarios:"
    echo '${{ toJson(needs.deploy-test-fault-tolerance.result) }}' | jq -r '.[] | select(.result != "success") | .name'

227-431: Plan for operational scalability and resource management.

With ~200 concurrent test jobs in the matrix (lines 236-431), consider:

  1. Kubernetes resource quotas: Namespaces created per job lack explicit CPU/memory quotas. Add limits to prevent resource starvation:

    kubectl create resourcequota job-quota --hard=requests.cpu=4,requests.memory=8Gi -n $NAMESPACE
  2. GitHub Actions concurrency: 200+ concurrent jobs may exceed your account's runner capacity. Verify with your GitHub plan.

  3. Test artifact retention: No mechanism for capturing test logs, reports, or debugging information. Consider:

    • Uploading failed test logs to artifact storage
    • Generating a summary report per test scenario
  4. Cluster capacity: Ensure Azure AKS cluster is sized to handle 200+ pods spawning simultaneously, or implement staggered job execution.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8de469c and d722a6a.

📒 Files selected for processing (1)
  • .github/workflows/weekly-fault-tolerance.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/weekly-fault-tolerance.yml

100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)


195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
.github/workflows/weekly-fault-tolerance.yml (5)

8-11: Verify cron schedule and documentation accuracy.

The cron expression 0 1 * * 1 schedules the workflow for Monday at 1:00 AM UTC, not Sunday at 5:00 PM PST as documented in the comment. Confirm this is intentional.


453-483: Harden dependency installation and add validation for extracted variables.

Several robustness concerns in the setup phase:

  1. Tool installations (yq, Helm, kubectl) lack error checking and could silently fail
  2. apt-get update has no timeout and could hang indefinitely
  3. FRAMEWORK extraction via cut -d'-' -f1 assumes all scenario names follow {framework}-... format but has no validation
  4. kubeconfig file (.kubeconfig) persists after job and could accumulate across multiple runs

Consider adding:

  • Error checks after each tool installation (e.g., set -e or explicit checks)
  • Timeout for apt-get update
  • Validation that FRAMEWORK is non-empty and matches expected values (sglang, trtllm, vllm)
  • Explicit cleanup of .kubeconfig file in the cleanup phase or use /tmp directory

485-535: Verify deployment prerequisites and add error handling.

Several concerns with the operator deployment phase:

  1. IMAGE_TAG is sourced from build.env (line 506) but this file is not created by any prior step—is it committed to the repository?
  2. Istio validation (line 501) has no error handling; the job will fail if Istio is not installed
  3. Namespace cleanup uses || true, which suppresses errors and could leave orphaned resources
  4. timeout 300s for kubectl rollout (line 531) may be insufficient for large deployments or heavily loaded clusters
  5. nscleanup/ttl=7200 sets a 2-hour cleanup window, but if a job runs longer, resources may persist

Please verify:

  • Whether build.env is committed to the repository or needs to be created by a prior step
  • Whether Istio is a hard requirement; if so, add explicit error handling
  • Whether 300 seconds is adequate for rollout in your cluster's typical load conditions

537-568: Validate test scenario naming and add safety checks for test execution.

Several concerns:

  1. The pytest parametrization assumes test scenario names match exactly: test_fault_scenario[${{ matrix.test_scenario }}]. If a scenario name is malformed, pytest will fail with "not found" instead of a clear error.
  2. Using -s (no output capture) in pytest may leak secrets to logs if tests print environment variables.
  3. PYTHONPATH includes $(pwd)/components/src, but this directory's existence is not validated.
  4. No explicit error handling for venv creation failures.

Consider:

  • Adding validation to ensure scenario names conform to an expected pattern
  • Using output redaction in pytest or ensuring tests don't print sensitive data
  • Validating PYTHONPATH directories exist before test execution

570-591: Improve cleanup robustness and error visibility.

Cleanup phase concerns:

  1. 5-minute timeout may be too aggressive for large namespace teardown (200+ test scenarios' resources). Kubernetes pod termination grace period + Helm chart uninstall can exceed this window.
  2. Error suppression with || true hides legitimate failures (failed helm uninstall, namespace deletion errors). Consider logging failures instead.
  3. No polling for namespace deletion. The job may exit before cleanup completes, leaving resources orphaned until the 2-hour TTL cleanup kicks in.
  4. .kubeconfig file not explicitly cleaned up after the job, creating potential for accumulation or leaks across runs.

Consider:

  • Increasing cleanup timeout to 10-15 minutes or making it proportional to the number of resources
  • Adding explicit error logging: helm uninstall ... || echo "Helm uninstall failed: $?"
  • Polling for namespace deletion: kubectl wait namespace/$NAMESPACE --for=condition=terminating --timeout=10m or similar
  • Explicitly cleaning up .kubeconfig: rm -f .kubeconfig

Comment on lines 1 to 609
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

name: Weekly Fault Tolerance Tests

on:
schedule:
# Run every Sunday at 5:00 PM PST (1:00 AM UTC Monday)
# Cron syntax: minute hour day-of-month month day-of-week
# Note: During PDT (daylight saving), this will run at 5:00 PM PDT (12:00 AM UTC Monday)
- cron: '0 1 * * 1'

# Allow manual triggering for testing
workflow_dispatch:
inputs:
test_scenarios:
description: 'Test scenarios to run (comma-separated or "all")'
required: false
default: 'all'
type: string

concurrency:
group: ${{ github.workflow }}-weekly-${{ github.ref_name || github.run_id }}
cancel-in-progress: false

jobs:
# Check if we should run (skip if no changes in last 24h for scheduled runs)
should-run:
runs-on: ubuntu-latest
outputs:
run_tests: ${{ steps.check.outputs.run_tests }}
steps:
- name: Checkout code
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
with:
fetch-depth: 0

- name: Check for recent activity
id: check
run: |
# Always run if manually triggered
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "run_tests=true" >> $GITHUB_OUTPUT
echo "Manual trigger - running tests"
exit 0
fi

# For scheduled runs, check if there were commits in last 24 hours
COMMITS_LAST_24H=$(git log --since="24 hours ago" --oneline | wc -l)
if [ "$COMMITS_LAST_24H" -gt 0 ]; then
echo "run_tests=true" >> $GITHUB_OUTPUT
echo "Found $COMMITS_LAST_24H commits in last 24 hours - running tests"
else
echo "run_tests=false" >> $GITHUB_OUTPUT
echo "No commits in last 24 hours - skipping tests"
fi

operator:
needs: should-run
if: needs.should-run.outputs.run_tests == 'true'
strategy:
fail-fast: false
matrix:
platform:
- { arch: amd64, runner: cpu-amd-m5-2xlarge }
name: operator (${{ matrix.platform.arch }})
runs-on: ${{ matrix.platform.runner }}
steps:
- name: Checkout code
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver: docker
- name: Install awscli
shell: bash
run: |
curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
- name: Login to ECR
shell: bash
env:
ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
run: |
aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}
- name: Build Container
id: build-image
shell: bash
env:
ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
run: |
cd deploy/cloud/operator
docker buildx build --load \
--platform linux/${{ matrix.platform.arch }} \
--build-arg DOCKER_PROXY=${ECR_HOSTNAME}/dockerhub/ \
-f Dockerfile \
-t dynamo-operator:latest .
- name: Docker Tag and Push
uses: ./.github/actions/docker-tag-push
with:
local_image: dynamo-operator:latest
push_tag: ai-dynamo/dynamo:${{ github.sha }}-operator-${{ matrix.platform.arch }}
aws_push: 'false'
azure_push: 'true'
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}

vllm:
needs: should-run
if: needs.should-run.outputs.run_tests == 'true'
strategy:
fail-fast: false
matrix:
platform:
- { arch: amd64, runner: gpu-l40-amd64 }
name: vllm (${{ matrix.platform.arch }})
runs-on: ${{ matrix.platform.runner }}
steps:
- name: Checkout code
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
- name: Build Container
id: build-image
uses: ./.github/actions/docker-build
with:
framework: vllm
target: runtime
platform: 'linux/${{ matrix.platform.arch }}'
ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
ci_token: ${{ secrets.CI_TOKEN }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Docker Tag and Push
uses: ./.github/actions/docker-tag-push
with:
local_image: ${{ steps.build-image.outputs.image_tag }}
push_tag: ai-dynamo/dynamo:${{ github.sha }}-vllm-${{ matrix.platform.arch }}
aws_push: 'false'
azure_push: 'true'
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}

trtllm:
needs: should-run
if: needs.should-run.outputs.run_tests == 'true'
strategy:
fail-fast: false
matrix:
platform:
- { arch: amd64, runner: gpu-l40-amd64 }
name: trtllm (${{ matrix.platform.arch }})
runs-on: ${{ matrix.platform.runner }}
steps:
- name: Checkout code
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
- name: Build Container
id: build-image
uses: ./.github/actions/docker-build
with:
framework: trtllm
target: runtime
platform: 'linux/${{ matrix.platform.arch }}'
ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
ci_token: ${{ secrets.CI_TOKEN }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Docker Tag and Push
uses: ./.github/actions/docker-tag-push
with:
local_image: ${{ steps.build-image.outputs.image_tag }}
push_tag: ai-dynamo/dynamo:${{ github.sha }}-trtllm-${{ matrix.platform.arch }}
aws_push: 'false'
azure_push: 'true'
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}

sglang:
needs: should-run
if: needs.should-run.outputs.run_tests == 'true'
runs-on: gpu-l40-amd64
name: sglang (amd64)
steps:
- name: Checkout repository
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
- name: Build Container
id: build-image
uses: ./.github/actions/docker-build
with:
framework: sglang
target: runtime
platform: 'linux/amd64'
ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }}
ci_token: ${{ secrets.CI_TOKEN }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }}
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Docker Tag and Push
uses: ./.github/actions/docker-tag-push
with:
local_image: ${{ steps.build-image.outputs.image_tag }}
push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64
aws_push: 'false'
azure_push: 'true'
aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }}
aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }}
azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }}
azure_acr_user: ${{ secrets.AZURE_ACR_USER }}
azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}

deploy-test-fault-tolerance:
runs-on: cpu-amd-m5-2xlarge
needs: [should-run, operator, vllm, trtllm, sglang]
if: needs.should-run.outputs.run_tests == 'true'
permissions:
contents: read
strategy:
fail-fast: false
matrix:
test_scenario:
# SGLang scenarios
- sglang-agg-tp-1-dp-1-decode_worker
- sglang-agg-tp-1-dp-1-decode_worker_pod
- sglang-agg-tp-1-dp-1-frontend
- sglang-agg-tp-1-dp-1-frontend_pod
- sglang-agg-tp-1-dp-1-none
- sglang-agg-tp-1-dp-1-sglang_decode_detokenizer
- sglang-agg-tp-1-dp-1-sglang_decode_scheduler
- sglang-agg-tp-1-dp-2-decode_worker
- sglang-agg-tp-1-dp-2-decode_worker_pod
- sglang-agg-tp-1-dp-2-frontend
- sglang-agg-tp-1-dp-2-frontend_pod
- sglang-agg-tp-1-dp-2-none
- sglang-agg-tp-1-dp-2-sglang_decode_detokenizer
- sglang-agg-tp-1-dp-2-sglang_decode_scheduler
- sglang-agg-tp-2-dp-1-decode_worker
- sglang-agg-tp-2-dp-1-decode_worker_pod
- sglang-agg-tp-2-dp-1-frontend
- sglang-agg-tp-2-dp-1-frontend_pod
- sglang-agg-tp-2-dp-1-none
- sglang-agg-tp-2-dp-1-sglang_decode_detokenizer
- sglang-agg-tp-2-dp-1-sglang_decode_scheduler
- sglang-agg-tp-4-dp-1-decode_worker
- sglang-agg-tp-4-dp-1-decode_worker_pod
- sglang-agg-tp-4-dp-1-frontend
- sglang-agg-tp-4-dp-1-frontend_pod
- sglang-agg-tp-4-dp-1-none
- sglang-agg-tp-4-dp-1-sglang_decode_detokenizer
- sglang-agg-tp-4-dp-1-sglang_decode_scheduler
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-none
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_detokenizer
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_scheduler
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_detokenizer
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_scheduler
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-none
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_detokenizer
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_scheduler
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_detokenizer
- sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_scheduler
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-none
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_detokenizer
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_scheduler
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_detokenizer
- sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_scheduler
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-none
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_detokenizer
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_scheduler
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_detokenizer
- sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_scheduler
# TensorRT-LLM scenarios
- trtllm-agg-tp-1-dp-1-decode_worker
- trtllm-agg-tp-1-dp-1-decode_worker_pod
- trtllm-agg-tp-1-dp-1-frontend
- trtllm-agg-tp-1-dp-1-frontend_pod
- trtllm-agg-tp-1-dp-1-none
- trtllm-agg-tp-1-dp-1-trtllm_decode_engine_core
- trtllm-agg-tp-1-dp-2-decode_worker
- trtllm-agg-tp-1-dp-2-decode_worker_pod
- trtllm-agg-tp-1-dp-2-frontend
- trtllm-agg-tp-1-dp-2-frontend_pod
- trtllm-agg-tp-1-dp-2-none
- trtllm-agg-tp-1-dp-2-trtllm_decode_engine_core
- trtllm-agg-tp-2-dp-1-decode_worker
- trtllm-agg-tp-2-dp-1-decode_worker_pod
- trtllm-agg-tp-2-dp-1-frontend
- trtllm-agg-tp-2-dp-1-frontend_pod
- trtllm-agg-tp-2-dp-1-none
- trtllm-agg-tp-2-dp-1-trtllm_decode_engine_core
- trtllm-agg-tp-4-dp-1-decode_worker
- trtllm-agg-tp-4-dp-1-decode_worker_pod
- trtllm-agg-tp-4-dp-1-frontend
- trtllm-agg-tp-4-dp-1-frontend_pod
- trtllm-agg-tp-4-dp-1-none
- trtllm-agg-tp-4-dp-1-trtllm_decode_engine_core
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_decode_engine_core
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_prefill_engine_core
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_decode_engine_core
- trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_prefill_engine_core
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_decode_engine_core
- trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_prefill_engine_core
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_decode_engine_core
- trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_prefill_engine_core
# vLLM scenarios
- vllm-agg-tp-1-dp-1-decode_worker
- vllm-agg-tp-1-dp-1-decode_worker_pod
- vllm-agg-tp-1-dp-1-frontend
- vllm-agg-tp-1-dp-1-frontend_pod
- vllm-agg-tp-1-dp-1-none
- vllm-agg-tp-1-dp-1-vllm_decode_engine_core
- vllm-agg-tp-1-dp-2-decode_worker
- vllm-agg-tp-1-dp-2-decode_worker_pod
- vllm-agg-tp-1-dp-2-frontend
- vllm-agg-tp-1-dp-2-frontend_pod
- vllm-agg-tp-1-dp-2-none
- vllm-agg-tp-1-dp-2-vllm_decode_engine_core
- vllm-agg-tp-2-dp-1-decode_worker
- vllm-agg-tp-2-dp-1-decode_worker_pod
- vllm-agg-tp-2-dp-1-frontend
- vllm-agg-tp-2-dp-1-frontend_pod
- vllm-agg-tp-2-dp-1-none
- vllm-agg-tp-2-dp-1-vllm_decode_engine_core
- vllm-agg-tp-4-dp-1-decode_worker
- vllm-agg-tp-4-dp-1-decode_worker_pod
- vllm-agg-tp-4-dp-1-frontend
- vllm-agg-tp-4-dp-1-frontend_pod
- vllm-agg-tp-4-dp-1-none
- vllm-agg-tp-4-dp-1-vllm_decode_engine_core
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_decode_engine_core
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_prefill_engine_core
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_decode_engine_core
- vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_prefill_engine_core
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_decode_engine_core
- vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_prefill_engine_core
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_decode_engine_core
- vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_prefill_engine_core
name: deploy-test-fault-tolerance (${{ matrix.test_scenario }})
env:
DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com
steps:
- name: Checkout code
uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0

- name: Install awscli
shell: bash
run: |
curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

- name: Login to ECR
shell: bash
env:
ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com
run: |
aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}

- name: Set namespace and install dependencies
run: |
# Extract framework from test scenario for unique namespace
FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1)
# Create unique namespace per matrix job with weekly prefix
echo "NAMESPACE=gh-weekly-${{ github.run_id }}-ft-${FRAMEWORK}" >> $GITHUB_ENV
set -x
# Install dependencies
sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq python3 python3-pip python3-venv

# Install yq
echo "Installing yq..."
curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq
sudo chmod 755 yq
sudo mv yq /usr/local/bin/
# Install Helm
echo "Installing Helm..."
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
sudo chmod 700 get_helm.sh
sudo ./get_helm.sh
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo chmod 755 kubectl
sudo mv kubectl /usr/local/bin/

# Setup kubeconfig
echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
chmod 600 .kubeconfig
export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"
kubectl config current-context

- name: Deploy Operator
run: |
set -x
export KUBECONFIG=$(pwd)/.kubeconfig

# Create a namespace for this job
echo "Creating an ephemeral namespace..."
kubectl delete namespace $NAMESPACE || true
kubectl create namespace $NAMESPACE || true
echo "Attaching the labels for secrets and cleanup"
kubectl label namespaces ${NAMESPACE} nscleanup/enabled=true nscleanup/ttl=7200 gitlab-imagepull=enabled ngc-api=enabled nvcr-imagepull=enabled --overwrite=true

# Set the namespace as default
kubectl config set-context --current --namespace=$NAMESPACE

# Check if Istio is installed
kubectl get pods -n istio-system
# Check if default storage class exists
kubectl get storageclass

# Install Helm chart
export IMAGE_TAG=$(cat build.env)
echo $IMAGE_TAG
export VIRTUAL_ENV=/opt/dynamo/venv
export KUBE_NS=$NAMESPACE
export ISTIO_ENABLED=true
export ISTIO_GATEWAY=istio-system/ingress-alb
export VIRTUAL_SERVICE_SUPPORTS_HTTPS=true
export DYNAMO_CLOUD=https://${NAMESPACE}.${DYNAMO_INGRESS_SUFFIX}

# Install dynamo env secrets
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true
# Create docker pull secret for operator image
kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE}
# Install helm dependencies
helm repo add bitnami https://charts.bitnami.com/bitnami
cd deploy/cloud/helm/platform/
helm dep build .
# Install platform with namespace restriction for single profile testing
helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \
--set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \
--set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
# Wait for all deployments to be ready
timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch
cd -

export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE

- name: Run Fault Tolerance Tests
run: |
set -x
export KUBECONFIG=$(pwd)/.kubeconfig
export NAMESPACE=$NAMESPACE

# Extract framework from test scenario
FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1)
export IMAGE="${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64"

# Set up Python virtual environment and install dependencies
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

# Install core dependencies needed for tests (without full project install)
pip install -r container/deps/requirements.test.txt
pip install kubernetes==32.0.1 kubernetes_asyncio kr8s pyyaml requests tabulate pydantic

# Add project source to PYTHONPATH for test imports
export PYTHONPATH=$(pwd):$(pwd)/components/src:$PYTHONPATH

echo "Running weekly fault tolerance test: ${{ matrix.test_scenario }}"
echo "Using namespace: $NAMESPACE"
echo "Using image: $IMAGE"

# Run the pytest command
pytest tests/fault_tolerance/deploy/test_deployment.py::test_fault_scenario[${{ matrix.test_scenario }}] \
-s -v \
--namespace ${NAMESPACE} \
--image ${IMAGE} \
--client-type legacy

- name: Cleanup
if: always()
timeout-minutes: 5
run: |
echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig
chmod 600 .kubeconfig
export KUBECONFIG=$(pwd)/.kubeconfig
kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}"

# For debugging purposes, list all the resources before we uninstall
kubectl get all

echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..."
kubectl delete dynamographdeployments --all -n $NAMESPACE || true

# Uninstall the helm chart
helm ls
helm uninstall dynamo-platform || true

echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..."
kubectl delete namespace $NAMESPACE || true
echo "Namespace $NAMESPACE completed."

# Status check job to verify all tests passed
weekly-status-check:
runs-on: ubuntu-latest
needs: [deploy-test-fault-tolerance]
if: always()
steps:
- name: Check all tests passed
run: |
echo '${{ toJson(needs) }}' | jq -e 'to_entries | map(.value.result) | all(. as $result | ["success", "skipped"] | any($result == .))'

- name: Send notification on failure
if: failure()
run: |
echo "Weekly fault tolerance tests failed!"
echo "Workflow run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
# Add notification logic here (e.g., Slack, email, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Address static analysis warnings: runner labels and action metadata.

Actionlint reports two issues:

  1. Custom runner labels not recognized (lines 67, 119, 159, 195, 228): gpu-l40-amd64, cpu-amd-m5-2xlarge

    • These are self-hosted runner labels and should be configured in your actionlint.yaml file to suppress false positives.
  2. Missing name field in .github/actions/docker-tag-push/action.yml (line 100)

    • The custom action lacks a required name field in its metadata. Add it to the action's metadata file.

🌐 Web query:

What is the required format for the "name" field in GitHub Actions custom action metadata (action.yml)?

💡 Result:

The name field must be a string (required). Example:
name: 'My Action'. (docs.github.com)

Citations:


🏁 Script executed:

#!/bin/bash
# Check if docker-tag-push action exists and inspect its content
find . -name "action.yml" -o -name "action.yaml" | grep docker-tag-push
if [ -f ".github/actions/docker-tag-push/action.yml" ]; then
  echo "=== docker-tag-push/action.yml ===" 
  head -20 .github/actions/docker-tag-push/action.yml
fi

# Check for actionlint config
echo ""
echo "=== Looking for actionlint configuration ==="
find . -name "actionlint.yaml" -o -name ".actionlintrc*" 2>/dev/null

Length of output: 601


🏁 Script executed:

#!/bin/bash
# Search for runner labels used in the workflow to verify they're defined
echo "=== Custom runner labels used in workflow ==="
grep -E "runs-on:|runner:" .github/workflows/weekly-fault-tolerance.yml | sort | uniq

# Check for self-hosted runner definitions in repo
echo ""
echo "=== Searching for runner configuration files ==="
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "self-hosted" 2>/dev/null | head -10

Length of output: 331


Add required "name" field to .github/actions/docker-tag-push/action.yml and configure actionlint for custom runner labels.

The docker-tag-push action metadata is missing the required name field. Add this to the top of .github/actions/docker-tag-push/action.yml:

name: 'Tag and Push Docker Images'
description: 'Tag and Push Docker Images'

Additionally, custom runner labels (gpu-l40-amd64, cpu-amd-m5-2xlarge) trigger actionlint warnings because they are not recognized as built-in GitHub runners. Create .github/actionlint.yaml to configure these self-hosted runner labels:

self-hosted-runner:
  - cpu-amd-m5-2xlarge
  - gpu-l40-amd64
🧰 Tools
🪛 actionlint (1.7.8)

100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"

(action)


195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents
In .github/workflows/weekly-fault-tolerance.yml lines 1-609: the workflow
references a local action .github/actions/docker-tag-push whose action.yml is
missing the required top-level name field (causing actionlint/metadata issues)
and the workflow uses custom self-hosted runner labels that actionlint flags as
unknown; fix by adding a name and description to
.github/actions/docker-tag-push/action.yml (e.g., "Tag and Push Docker Images"
and a short description) and add a new file .github/actionlint.yaml listing the
custom self-hosted runner labels (cpu-amd-m5-2xlarge, gpu-l40-amd64) under a
self-hosted-runner key so actionlint recognizes them.

@indrajit96 indrajit96 changed the title ci : Add Weekly CI tests for full FT test suite ci: Add Weekly CI tests for full FT test suite Oct 23, 2025
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Oct 23, 2025
Signed-off-by: Indrajit Bhosale <[email protected]>
strategy:
fail-fast: false
matrix:
test_scenario:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an automatic way to capture all the test cases? Otherwise we have to remember to change this file everytime we add a test case

azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }}

deploy-test-fault-tolerance:
runs-on: cpu-amd-m5-2xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many GPUs will be running these FT tests?
Will they be running in parallel?
As we add more and more tests will we need to change " runs-on: cpu-amd-m5-2xlarge"?

# Install dynamo env secrets
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true
# Create docker pull secret for operator image
kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, Later for moe and elastic EP we might need to run on environment other than Azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants