-
Notifications
You must be signed in to change notification settings - Fork 655
ci: Add Weekly CI tests for full FT test suite #3853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
WalkthroughA new GitHub Actions workflow for weekly fault-tolerance testing has been added. It orchestrates parallel container image builds for multiple inference engines, deploys them to Kubernetes via Helm, and executes comprehensive fault-tolerance test matrices across various configurations with automated cleanup. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The workflow contains substantial orchestration logic with multiple interdependent jobs, parallel execution paths, cloud registry integrations, Kubernetes operations, extensive test matrices, and intricate error handling flows. Each job phase and matrix combination requires careful verification of correctness and resource management. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
.github/workflows/weekly-fault-tolerance.yml (5)
38-56: Consider more robust approach for commit-based scheduling.The
git log --since="24 hours ago"check is timezone-dependent and may behave unpredictably across different runner environments. Additionally, thetest_scenariosinput defined inworkflow_dispatch(line 16-19) is never consumed by the workflow, so manual runs cannot selectively run specific test scenarios.Consider:
- Using a fixed, UTC-based timestamp instead of relative time
- Storing the last run timestamp and comparing against it
- Consuming the
test_scenariosinput to filter the matrix at runtime (if selective testing is needed)
75-86: Remove unused AWS ECR login from operator job.The operator job installs AWS CLI and logs into ECR (lines 75-86) but only pushes to Azure ACR (line 105:
azure_push: 'true',aws_push: 'false'). This setup is redundant.Remove the unused AWS setup:
- - name: Install awscli - shell: bash - run: | - curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip" - unzip awscliv2.zip - sudo ./aws/install - - name: Login to ECR - shell: bash - env: - ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com - run: | - aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME}
192-225: Align sglang build job structure with vllm and trtllm.The sglang job lacks a
strategy.matrixdefinition present in vllm and trtllm jobs. For consistency and future extensibility, consider refactoring sglang to use the same pattern.Apply this diff to align the structure:
sglang: needs: should-run if: needs.should-run.outputs.run_tests == 'true' + strategy: + fail-fast: false + matrix: + platform: + - { arch: amd64, runner: gpu-l40-amd64 } - runs-on: gpu-l40-amd64 + runs-on: ${{ matrix.platform.runner }} name: sglang (amd64) steps: - name: Checkout repository uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 - name: Build Container id: build-image uses: ./.github/actions/docker-build with: framework: sglang target: runtime - platform: 'linux/amd64' + platform: 'linux/${{ matrix.platform.arch }}' ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }} ... push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64 ...
599-608: Add context to failure notifications.The status check job aggregates results but provides minimal context:
- The jq command checks all tests passed but doesn't identify which specific tests failed
- Notification logic is a placeholder with no actual implementation
When implementing the notification logic, include:
- List of failed test scenarios (parse matrix and check which jobs failed)
- Link to the failed job logs
- Summary of failure types (if available)
Example improvement:
- name: Identify failed tests if: failure() run: | echo "Failed test scenarios:" echo '${{ toJson(needs.deploy-test-fault-tolerance.result) }}' | jq -r '.[] | select(.result != "success") | .name'
227-431: Plan for operational scalability and resource management.With ~200 concurrent test jobs in the matrix (lines 236-431), consider:
Kubernetes resource quotas: Namespaces created per job lack explicit CPU/memory quotas. Add limits to prevent resource starvation:
kubectl create resourcequota job-quota --hard=requests.cpu=4,requests.memory=8Gi -n $NAMESPACEGitHub Actions concurrency: 200+ concurrent jobs may exceed your account's runner capacity. Verify with your GitHub plan.
Test artifact retention: No mechanism for capturing test logs, reports, or debugging information. Consider:
- Uploading failed test logs to artifact storage
- Generating a summary report per test scenario
Cluster capacity: Ensure Azure AKS cluster is sized to handle 200+ pods spawning simultaneously, or implement staggered job execution.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/weekly-fault-tolerance.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/weekly-fault-tolerance.yml
100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"
(action)
195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
.github/workflows/weekly-fault-tolerance.yml (5)
8-11: Verify cron schedule and documentation accuracy.The cron expression
0 1 * * 1schedules the workflow for Monday at 1:00 AM UTC, not Sunday at 5:00 PM PST as documented in the comment. Confirm this is intentional.
453-483: Harden dependency installation and add validation for extracted variables.Several robustness concerns in the setup phase:
- Tool installations (yq, Helm, kubectl) lack error checking and could silently fail
apt-get updatehas no timeout and could hang indefinitely- FRAMEWORK extraction via
cut -d'-' -f1assumes all scenario names follow{framework}-...format but has no validation- kubeconfig file (
.kubeconfig) persists after job and could accumulate across multiple runsConsider adding:
- Error checks after each tool installation (e.g.,
set -eor explicit checks)- Timeout for
apt-get update- Validation that FRAMEWORK is non-empty and matches expected values (sglang, trtllm, vllm)
- Explicit cleanup of
.kubeconfigfile in the cleanup phase or use/tmpdirectory
485-535: Verify deployment prerequisites and add error handling.Several concerns with the operator deployment phase:
IMAGE_TAGis sourced frombuild.env(line 506) but this file is not created by any prior step—is it committed to the repository?- Istio validation (line 501) has no error handling; the job will fail if Istio is not installed
- Namespace cleanup uses
|| true, which suppresses errors and could leave orphaned resourcestimeout 300sfor kubectl rollout (line 531) may be insufficient for large deployments or heavily loaded clustersnscleanup/ttl=7200sets a 2-hour cleanup window, but if a job runs longer, resources may persistPlease verify:
- Whether
build.envis committed to the repository or needs to be created by a prior step- Whether Istio is a hard requirement; if so, add explicit error handling
- Whether 300 seconds is adequate for rollout in your cluster's typical load conditions
537-568: Validate test scenario naming and add safety checks for test execution.Several concerns:
- The pytest parametrization assumes test scenario names match exactly:
test_fault_scenario[${{ matrix.test_scenario }}]. If a scenario name is malformed, pytest will fail with "not found" instead of a clear error.- Using
-s(no output capture) in pytest may leak secrets to logs if tests print environment variables.- PYTHONPATH includes
$(pwd)/components/src, but this directory's existence is not validated.- No explicit error handling for venv creation failures.
Consider:
- Adding validation to ensure scenario names conform to an expected pattern
- Using output redaction in pytest or ensuring tests don't print sensitive data
- Validating PYTHONPATH directories exist before test execution
570-591: Improve cleanup robustness and error visibility.Cleanup phase concerns:
- 5-minute timeout may be too aggressive for large namespace teardown (200+ test scenarios' resources). Kubernetes pod termination grace period + Helm chart uninstall can exceed this window.
- Error suppression with
|| truehides legitimate failures (failed helm uninstall, namespace deletion errors). Consider logging failures instead.- No polling for namespace deletion. The job may exit before cleanup completes, leaving resources orphaned until the 2-hour TTL cleanup kicks in.
.kubeconfigfile not explicitly cleaned up after the job, creating potential for accumulation or leaks across runs.Consider:
- Increasing cleanup timeout to 10-15 minutes or making it proportional to the number of resources
- Adding explicit error logging:
helm uninstall ... || echo "Helm uninstall failed: $?"- Polling for namespace deletion:
kubectl wait namespace/$NAMESPACE --for=condition=terminating --timeout=10mor similar- Explicitly cleaning up
.kubeconfig:rm -f .kubeconfig
| # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| name: Weekly Fault Tolerance Tests | ||
|
|
||
| on: | ||
| schedule: | ||
| # Run every Sunday at 5:00 PM PST (1:00 AM UTC Monday) | ||
| # Cron syntax: minute hour day-of-month month day-of-week | ||
| # Note: During PDT (daylight saving), this will run at 5:00 PM PDT (12:00 AM UTC Monday) | ||
| - cron: '0 1 * * 1' | ||
|
|
||
| # Allow manual triggering for testing | ||
| workflow_dispatch: | ||
| inputs: | ||
| test_scenarios: | ||
| description: 'Test scenarios to run (comma-separated or "all")' | ||
| required: false | ||
| default: 'all' | ||
| type: string | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-weekly-${{ github.ref_name || github.run_id }} | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: | ||
| # Check if we should run (skip if no changes in last 24h for scheduled runs) | ||
| should-run: | ||
| runs-on: ubuntu-latest | ||
| outputs: | ||
| run_tests: ${{ steps.check.outputs.run_tests }} | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
| with: | ||
| fetch-depth: 0 | ||
|
|
||
| - name: Check for recent activity | ||
| id: check | ||
| run: | | ||
| # Always run if manually triggered | ||
| if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then | ||
| echo "run_tests=true" >> $GITHUB_OUTPUT | ||
| echo "Manual trigger - running tests" | ||
| exit 0 | ||
| fi | ||
|
|
||
| # For scheduled runs, check if there were commits in last 24 hours | ||
| COMMITS_LAST_24H=$(git log --since="24 hours ago" --oneline | wc -l) | ||
| if [ "$COMMITS_LAST_24H" -gt 0 ]; then | ||
| echo "run_tests=true" >> $GITHUB_OUTPUT | ||
| echo "Found $COMMITS_LAST_24H commits in last 24 hours - running tests" | ||
| else | ||
| echo "run_tests=false" >> $GITHUB_OUTPUT | ||
| echo "No commits in last 24 hours - skipping tests" | ||
| fi | ||
|
|
||
| operator: | ||
| needs: should-run | ||
| if: needs.should-run.outputs.run_tests == 'true' | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| platform: | ||
| - { arch: amd64, runner: cpu-amd-m5-2xlarge } | ||
| name: operator (${{ matrix.platform.arch }}) | ||
| runs-on: ${{ matrix.platform.runner }} | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
| with: | ||
| driver: docker | ||
| - name: Install awscli | ||
| shell: bash | ||
| run: | | ||
| curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip" | ||
| unzip awscliv2.zip | ||
| sudo ./aws/install | ||
| - name: Login to ECR | ||
| shell: bash | ||
| env: | ||
| ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com | ||
| run: | | ||
| aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME} | ||
| - name: Build Container | ||
| id: build-image | ||
| shell: bash | ||
| env: | ||
| ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com | ||
| run: | | ||
| cd deploy/cloud/operator | ||
| docker buildx build --load \ | ||
| --platform linux/${{ matrix.platform.arch }} \ | ||
| --build-arg DOCKER_PROXY=${ECR_HOSTNAME}/dockerhub/ \ | ||
| -f Dockerfile \ | ||
| -t dynamo-operator:latest . | ||
| - name: Docker Tag and Push | ||
| uses: ./.github/actions/docker-tag-push | ||
| with: | ||
| local_image: dynamo-operator:latest | ||
| push_tag: ai-dynamo/dynamo:${{ github.sha }}-operator-${{ matrix.platform.arch }} | ||
| aws_push: 'false' | ||
| azure_push: 'true' | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }} | ||
| azure_acr_user: ${{ secrets.AZURE_ACR_USER }} | ||
| azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }} | ||
|
|
||
| vllm: | ||
| needs: should-run | ||
| if: needs.should-run.outputs.run_tests == 'true' | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| platform: | ||
| - { arch: amd64, runner: gpu-l40-amd64 } | ||
| name: vllm (${{ matrix.platform.arch }}) | ||
| runs-on: ${{ matrix.platform.runner }} | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
| - name: Build Container | ||
| id: build-image | ||
| uses: ./.github/actions/docker-build | ||
| with: | ||
| framework: vllm | ||
| target: runtime | ||
| platform: 'linux/${{ matrix.platform.arch }}' | ||
| ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }} | ||
| ci_token: ${{ secrets.CI_TOKEN }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }} | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| - name: Docker Tag and Push | ||
| uses: ./.github/actions/docker-tag-push | ||
| with: | ||
| local_image: ${{ steps.build-image.outputs.image_tag }} | ||
| push_tag: ai-dynamo/dynamo:${{ github.sha }}-vllm-${{ matrix.platform.arch }} | ||
| aws_push: 'false' | ||
| azure_push: 'true' | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }} | ||
| azure_acr_user: ${{ secrets.AZURE_ACR_USER }} | ||
| azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }} | ||
|
|
||
| trtllm: | ||
| needs: should-run | ||
| if: needs.should-run.outputs.run_tests == 'true' | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| platform: | ||
| - { arch: amd64, runner: gpu-l40-amd64 } | ||
| name: trtllm (${{ matrix.platform.arch }}) | ||
| runs-on: ${{ matrix.platform.runner }} | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
| - name: Build Container | ||
| id: build-image | ||
| uses: ./.github/actions/docker-build | ||
| with: | ||
| framework: trtllm | ||
| target: runtime | ||
| platform: 'linux/${{ matrix.platform.arch }}' | ||
| ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }} | ||
| ci_token: ${{ secrets.CI_TOKEN }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }} | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| - name: Docker Tag and Push | ||
| uses: ./.github/actions/docker-tag-push | ||
| with: | ||
| local_image: ${{ steps.build-image.outputs.image_tag }} | ||
| push_tag: ai-dynamo/dynamo:${{ github.sha }}-trtllm-${{ matrix.platform.arch }} | ||
| aws_push: 'false' | ||
| azure_push: 'true' | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }} | ||
| azure_acr_user: ${{ secrets.AZURE_ACR_USER }} | ||
| azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }} | ||
|
|
||
| sglang: | ||
| needs: should-run | ||
| if: needs.should-run.outputs.run_tests == 'true' | ||
| runs-on: gpu-l40-amd64 | ||
| name: sglang (amd64) | ||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
| - name: Build Container | ||
| id: build-image | ||
| uses: ./.github/actions/docker-build | ||
| with: | ||
| framework: sglang | ||
| target: runtime | ||
| platform: 'linux/amd64' | ||
| ngc_ci_access_token: ${{ secrets.NGC_CI_ACCESS_TOKEN }} | ||
| ci_token: ${{ secrets.CI_TOKEN }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| sccache_s3_bucket: ${{ secrets.SCCACHE_S3_BUCKET }} | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| - name: Docker Tag and Push | ||
| uses: ./.github/actions/docker-tag-push | ||
| with: | ||
| local_image: ${{ steps.build-image.outputs.image_tag }} | ||
| push_tag: ai-dynamo/dynamo:${{ github.sha }}-sglang-amd64 | ||
| aws_push: 'false' | ||
| azure_push: 'true' | ||
| aws_account_id: ${{ secrets.AWS_ACCOUNT_ID }} | ||
| aws_default_region: ${{ secrets.AWS_DEFAULT_REGION }} | ||
| azure_acr_hostname: ${{ secrets.AZURE_ACR_HOSTNAME }} | ||
| azure_acr_user: ${{ secrets.AZURE_ACR_USER }} | ||
| azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }} | ||
|
|
||
| deploy-test-fault-tolerance: | ||
| runs-on: cpu-amd-m5-2xlarge | ||
| needs: [should-run, operator, vllm, trtllm, sglang] | ||
| if: needs.should-run.outputs.run_tests == 'true' | ||
| permissions: | ||
| contents: read | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| test_scenario: | ||
| # SGLang scenarios | ||
| - sglang-agg-tp-1-dp-1-decode_worker | ||
| - sglang-agg-tp-1-dp-1-decode_worker_pod | ||
| - sglang-agg-tp-1-dp-1-frontend | ||
| - sglang-agg-tp-1-dp-1-frontend_pod | ||
| - sglang-agg-tp-1-dp-1-none | ||
| - sglang-agg-tp-1-dp-1-sglang_decode_detokenizer | ||
| - sglang-agg-tp-1-dp-1-sglang_decode_scheduler | ||
| - sglang-agg-tp-1-dp-2-decode_worker | ||
| - sglang-agg-tp-1-dp-2-decode_worker_pod | ||
| - sglang-agg-tp-1-dp-2-frontend | ||
| - sglang-agg-tp-1-dp-2-frontend_pod | ||
| - sglang-agg-tp-1-dp-2-none | ||
| - sglang-agg-tp-1-dp-2-sglang_decode_detokenizer | ||
| - sglang-agg-tp-1-dp-2-sglang_decode_scheduler | ||
| - sglang-agg-tp-2-dp-1-decode_worker | ||
| - sglang-agg-tp-2-dp-1-decode_worker_pod | ||
| - sglang-agg-tp-2-dp-1-frontend | ||
| - sglang-agg-tp-2-dp-1-frontend_pod | ||
| - sglang-agg-tp-2-dp-1-none | ||
| - sglang-agg-tp-2-dp-1-sglang_decode_detokenizer | ||
| - sglang-agg-tp-2-dp-1-sglang_decode_scheduler | ||
| - sglang-agg-tp-4-dp-1-decode_worker | ||
| - sglang-agg-tp-4-dp-1-decode_worker_pod | ||
| - sglang-agg-tp-4-dp-1-frontend | ||
| - sglang-agg-tp-4-dp-1-frontend_pod | ||
| - sglang-agg-tp-4-dp-1-none | ||
| - sglang-agg-tp-4-dp-1-sglang_decode_detokenizer | ||
| - sglang-agg-tp-4-dp-1-sglang_decode_scheduler | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-none | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_detokenizer | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_decode_scheduler | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_detokenizer | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-1-sglang_prefill_scheduler | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-none | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_detokenizer | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_decode_scheduler | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_detokenizer | ||
| - sglang-disagg-prefill-tp-1-decode-tp-1-dp-2-sglang_prefill_scheduler | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-none | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_detokenizer | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_decode_scheduler | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_detokenizer | ||
| - sglang-disagg-prefill-tp-2-decode-tp-2-dp-1-sglang_prefill_scheduler | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-none | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_detokenizer | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_decode_scheduler | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_detokenizer | ||
| - sglang-disagg-prefill-tp-4-decode-tp-4-dp-1-sglang_prefill_scheduler | ||
| # TensorRT-LLM scenarios | ||
| - trtllm-agg-tp-1-dp-1-decode_worker | ||
| - trtllm-agg-tp-1-dp-1-decode_worker_pod | ||
| - trtllm-agg-tp-1-dp-1-frontend | ||
| - trtllm-agg-tp-1-dp-1-frontend_pod | ||
| - trtllm-agg-tp-1-dp-1-none | ||
| - trtllm-agg-tp-1-dp-1-trtllm_decode_engine_core | ||
| - trtllm-agg-tp-1-dp-2-decode_worker | ||
| - trtllm-agg-tp-1-dp-2-decode_worker_pod | ||
| - trtllm-agg-tp-1-dp-2-frontend | ||
| - trtllm-agg-tp-1-dp-2-frontend_pod | ||
| - trtllm-agg-tp-1-dp-2-none | ||
| - trtllm-agg-tp-1-dp-2-trtllm_decode_engine_core | ||
| - trtllm-agg-tp-2-dp-1-decode_worker | ||
| - trtllm-agg-tp-2-dp-1-decode_worker_pod | ||
| - trtllm-agg-tp-2-dp-1-frontend | ||
| - trtllm-agg-tp-2-dp-1-frontend_pod | ||
| - trtllm-agg-tp-2-dp-1-none | ||
| - trtllm-agg-tp-2-dp-1-trtllm_decode_engine_core | ||
| - trtllm-agg-tp-4-dp-1-decode_worker | ||
| - trtllm-agg-tp-4-dp-1-decode_worker_pod | ||
| - trtllm-agg-tp-4-dp-1-frontend | ||
| - trtllm-agg-tp-4-dp-1-frontend_pod | ||
| - trtllm-agg-tp-4-dp-1-none | ||
| - trtllm-agg-tp-4-dp-1-trtllm_decode_engine_core | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_decode_engine_core | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-1-trtllm_prefill_engine_core | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_decode_engine_core | ||
| - trtllm-disagg-prefill-tp-1-decode-tp-1-dp-2-trtllm_prefill_engine_core | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_decode_engine_core | ||
| - trtllm-disagg-prefill-tp-2-decode-tp-2-dp-1-trtllm_prefill_engine_core | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_decode_engine_core | ||
| - trtllm-disagg-prefill-tp-4-decode-tp-4-dp-1-trtllm_prefill_engine_core | ||
| # vLLM scenarios | ||
| - vllm-agg-tp-1-dp-1-decode_worker | ||
| - vllm-agg-tp-1-dp-1-decode_worker_pod | ||
| - vllm-agg-tp-1-dp-1-frontend | ||
| - vllm-agg-tp-1-dp-1-frontend_pod | ||
| - vllm-agg-tp-1-dp-1-none | ||
| - vllm-agg-tp-1-dp-1-vllm_decode_engine_core | ||
| - vllm-agg-tp-1-dp-2-decode_worker | ||
| - vllm-agg-tp-1-dp-2-decode_worker_pod | ||
| - vllm-agg-tp-1-dp-2-frontend | ||
| - vllm-agg-tp-1-dp-2-frontend_pod | ||
| - vllm-agg-tp-1-dp-2-none | ||
| - vllm-agg-tp-1-dp-2-vllm_decode_engine_core | ||
| - vllm-agg-tp-2-dp-1-decode_worker | ||
| - vllm-agg-tp-2-dp-1-decode_worker_pod | ||
| - vllm-agg-tp-2-dp-1-frontend | ||
| - vllm-agg-tp-2-dp-1-frontend_pod | ||
| - vllm-agg-tp-2-dp-1-none | ||
| - vllm-agg-tp-2-dp-1-vllm_decode_engine_core | ||
| - vllm-agg-tp-4-dp-1-decode_worker | ||
| - vllm-agg-tp-4-dp-1-decode_worker_pod | ||
| - vllm-agg-tp-4-dp-1-frontend | ||
| - vllm-agg-tp-4-dp-1-frontend_pod | ||
| - vllm-agg-tp-4-dp-1-none | ||
| - vllm-agg-tp-4-dp-1-vllm_decode_engine_core | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-decode_worker_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-frontend_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-none | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-prefill_worker_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_decode_engine_core | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-1-vllm_prefill_engine_core | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-decode_worker_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-frontend_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-none | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-prefill_worker_pod | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_decode_engine_core | ||
| - vllm-disagg-prefill-tp-1-decode-tp-1-dp-2-vllm_prefill_engine_core | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-decode_worker_pod | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-frontend_pod | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-none | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-prefill_worker_pod | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_decode_engine_core | ||
| - vllm-disagg-prefill-tp-2-decode-tp-2-dp-1-vllm_prefill_engine_core | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-decode_worker_pod | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-frontend_pod | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-none | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-prefill_worker_pod | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_decode_engine_core | ||
| - vllm-disagg-prefill-tp-4-decode-tp-4-dp-1-vllm_prefill_engine_core | ||
| name: deploy-test-fault-tolerance (${{ matrix.test_scenario }}) | ||
| env: | ||
| DYNAMO_INGRESS_SUFFIX: dev.aire.nvidia.com | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0 | ||
|
|
||
| - name: Install awscli | ||
| shell: bash | ||
| run: | | ||
| curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip" | ||
| unzip awscliv2.zip | ||
| sudo ./aws/install | ||
|
|
||
| - name: Login to ECR | ||
| shell: bash | ||
| env: | ||
| ECR_HOSTNAME: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_DEFAULT_REGION }}.amazonaws.com | ||
| run: | | ||
| aws ecr get-login-password --region ${{ secrets.AWS_DEFAULT_REGION }} | docker login --username AWS --password-stdin ${ECR_HOSTNAME} | ||
|
|
||
| - name: Set namespace and install dependencies | ||
| run: | | ||
| # Extract framework from test scenario for unique namespace | ||
| FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1) | ||
| # Create unique namespace per matrix job with weekly prefix | ||
| echo "NAMESPACE=gh-weekly-${{ github.run_id }}-ft-${FRAMEWORK}" >> $GITHUB_ENV | ||
| set -x | ||
| # Install dependencies | ||
| sudo apt-get update && sudo apt-get install -y curl bash openssl gettext git jq python3 python3-pip python3-venv | ||
|
|
||
| # Install yq | ||
| echo "Installing yq..." | ||
| curl -L https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -o yq | ||
| sudo chmod 755 yq | ||
| sudo mv yq /usr/local/bin/ | ||
| # Install Helm | ||
| echo "Installing Helm..." | ||
| curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | ||
| sudo chmod 700 get_helm.sh | ||
| sudo ./get_helm.sh | ||
| # Install kubectl | ||
| curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" | ||
| sudo chmod 755 kubectl | ||
| sudo mv kubectl /usr/local/bin/ | ||
|
|
||
| # Setup kubeconfig | ||
| echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig | ||
| chmod 600 .kubeconfig | ||
| export KUBECONFIG=$(pwd)/.kubeconfig | ||
| kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}" | ||
| kubectl config current-context | ||
|
|
||
| - name: Deploy Operator | ||
| run: | | ||
| set -x | ||
| export KUBECONFIG=$(pwd)/.kubeconfig | ||
|
|
||
| # Create a namespace for this job | ||
| echo "Creating an ephemeral namespace..." | ||
| kubectl delete namespace $NAMESPACE || true | ||
| kubectl create namespace $NAMESPACE || true | ||
| echo "Attaching the labels for secrets and cleanup" | ||
| kubectl label namespaces ${NAMESPACE} nscleanup/enabled=true nscleanup/ttl=7200 gitlab-imagepull=enabled ngc-api=enabled nvcr-imagepull=enabled --overwrite=true | ||
|
|
||
| # Set the namespace as default | ||
| kubectl config set-context --current --namespace=$NAMESPACE | ||
|
|
||
| # Check if Istio is installed | ||
| kubectl get pods -n istio-system | ||
| # Check if default storage class exists | ||
| kubectl get storageclass | ||
|
|
||
| # Install Helm chart | ||
| export IMAGE_TAG=$(cat build.env) | ||
| echo $IMAGE_TAG | ||
| export VIRTUAL_ENV=/opt/dynamo/venv | ||
| export KUBE_NS=$NAMESPACE | ||
| export ISTIO_ENABLED=true | ||
| export ISTIO_GATEWAY=istio-system/ingress-alb | ||
| export VIRTUAL_SERVICE_SUPPORTS_HTTPS=true | ||
| export DYNAMO_CLOUD=https://${NAMESPACE}.${DYNAMO_INGRESS_SUFFIX} | ||
|
|
||
| # Install dynamo env secrets | ||
| kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true | ||
| # Create docker pull secret for operator image | ||
| kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE} | ||
| # Install helm dependencies | ||
| helm repo add bitnami https://charts.bitnami.com/bitnami | ||
| cd deploy/cloud/helm/platform/ | ||
| helm dep build . | ||
| # Install platform with namespace restriction for single profile testing | ||
| helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \ | ||
| --set dynamo-operator.namespaceRestriction.enabled=true \ | ||
| --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \ | ||
| --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \ | ||
| --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \ | ||
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret | ||
| # Wait for all deployments to be ready | ||
| timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch | ||
| cd - | ||
|
|
||
| export KUBECONFIG=$(pwd)/.kubeconfig | ||
| kubectl config set-context --current --namespace=$NAMESPACE | ||
|
|
||
| - name: Run Fault Tolerance Tests | ||
| run: | | ||
| set -x | ||
| export KUBECONFIG=$(pwd)/.kubeconfig | ||
| export NAMESPACE=$NAMESPACE | ||
|
|
||
| # Extract framework from test scenario | ||
| FRAMEWORK=$(echo "${{ matrix.test_scenario }}" | cut -d'-' -f1) | ||
| export IMAGE="${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo:${{ github.sha }}-${FRAMEWORK}-amd64" | ||
|
|
||
| # Set up Python virtual environment and install dependencies | ||
| python3 -m venv venv | ||
| source venv/bin/activate | ||
| pip install --upgrade pip | ||
|
|
||
| # Install core dependencies needed for tests (without full project install) | ||
| pip install -r container/deps/requirements.test.txt | ||
| pip install kubernetes==32.0.1 kubernetes_asyncio kr8s pyyaml requests tabulate pydantic | ||
|
|
||
| # Add project source to PYTHONPATH for test imports | ||
| export PYTHONPATH=$(pwd):$(pwd)/components/src:$PYTHONPATH | ||
|
|
||
| echo "Running weekly fault tolerance test: ${{ matrix.test_scenario }}" | ||
| echo "Using namespace: $NAMESPACE" | ||
| echo "Using image: $IMAGE" | ||
|
|
||
| # Run the pytest command | ||
| pytest tests/fault_tolerance/deploy/test_deployment.py::test_fault_scenario[${{ matrix.test_scenario }}] \ | ||
| -s -v \ | ||
| --namespace ${NAMESPACE} \ | ||
| --image ${IMAGE} \ | ||
| --client-type legacy | ||
|
|
||
| - name: Cleanup | ||
| if: always() | ||
| timeout-minutes: 5 | ||
| run: | | ||
| echo "${{ secrets.AZURE_AKS_CI_KUBECONFIG_B64 }}" | base64 -d > .kubeconfig | ||
| chmod 600 .kubeconfig | ||
| export KUBECONFIG=$(pwd)/.kubeconfig | ||
| kubectl config set-context --current --namespace=$NAMESPACE --kubeconfig "${KUBECONFIG}" | ||
|
|
||
| # For debugging purposes, list all the resources before we uninstall | ||
| kubectl get all | ||
|
|
||
| echo "Deleting all DynamoGraphDeployments in namespace $NAMESPACE..." | ||
| kubectl delete dynamographdeployments --all -n $NAMESPACE || true | ||
|
|
||
| # Uninstall the helm chart | ||
| helm ls | ||
| helm uninstall dynamo-platform || true | ||
|
|
||
| echo "Namespace $NAMESPACE deletion initiated, proceeding with cleanup..." | ||
| kubectl delete namespace $NAMESPACE || true | ||
| echo "Namespace $NAMESPACE completed." | ||
|
|
||
| # Status check job to verify all tests passed | ||
| weekly-status-check: | ||
| runs-on: ubuntu-latest | ||
| needs: [deploy-test-fault-tolerance] | ||
| if: always() | ||
| steps: | ||
| - name: Check all tests passed | ||
| run: | | ||
| echo '${{ toJson(needs) }}' | jq -e 'to_entries | map(.value.result) | all(. as $result | ["success", "skipped"] | any($result == .))' | ||
|
|
||
| - name: Send notification on failure | ||
| if: failure() | ||
| run: | | ||
| echo "Weekly fault tolerance tests failed!" | ||
| echo "Workflow run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" | ||
| # Add notification logic here (e.g., Slack, email, etc.) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Address static analysis warnings: runner labels and action metadata.
Actionlint reports two issues:
-
Custom runner labels not recognized (lines 67, 119, 159, 195, 228):
gpu-l40-amd64,cpu-amd-m5-2xlarge- These are self-hosted runner labels and should be configured in your
actionlint.yamlfile to suppress false positives.
- These are self-hosted runner labels and should be configured in your
-
Missing
namefield in.github/actions/docker-tag-push/action.yml(line 100)- The custom action lacks a required
namefield in its metadata. Add it to the action's metadata file.
- The custom action lacks a required
🌐 Web query:
What is the required format for the "name" field in GitHub Actions custom action metadata (action.yml)?
💡 Result:
The name field must be a string (required). Example:
name: 'My Action'. (docs.github.com)
Citations:
🏁 Script executed:
#!/bin/bash
# Check if docker-tag-push action exists and inspect its content
find . -name "action.yml" -o -name "action.yaml" | grep docker-tag-push
if [ -f ".github/actions/docker-tag-push/action.yml" ]; then
echo "=== docker-tag-push/action.yml ==="
head -20 .github/actions/docker-tag-push/action.yml
fi
# Check for actionlint config
echo ""
echo "=== Looking for actionlint configuration ==="
find . -name "actionlint.yaml" -o -name ".actionlintrc*" 2>/dev/nullLength of output: 601
🏁 Script executed:
#!/bin/bash
# Search for runner labels used in the workflow to verify they're defined
echo "=== Custom runner labels used in workflow ==="
grep -E "runs-on:|runner:" .github/workflows/weekly-fault-tolerance.yml | sort | uniq
# Check for self-hosted runner definitions in repo
echo ""
echo "=== Searching for runner configuration files ==="
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "self-hosted" 2>/dev/null | head -10Length of output: 331
Add required "name" field to .github/actions/docker-tag-push/action.yml and configure actionlint for custom runner labels.
The docker-tag-push action metadata is missing the required name field. Add this to the top of .github/actions/docker-tag-push/action.yml:
name: 'Tag and Push Docker Images'
description: 'Tag and Push Docker Images'Additionally, custom runner labels (gpu-l40-amd64, cpu-amd-m5-2xlarge) trigger actionlint warnings because they are not recognized as built-in GitHub runners. Create .github/actionlint.yaml to configure these self-hosted runner labels:
self-hosted-runner:
- cpu-amd-m5-2xlarge
- gpu-l40-amd64🧰 Tools
🪛 actionlint (1.7.8)
100-100: name is required in action metadata "/home/jailuser/git/.github/actions/docker-tag-push/action.yml"
(action)
195-195: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
228-228: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🤖 Prompt for AI Agents
In .github/workflows/weekly-fault-tolerance.yml lines 1-609: the workflow
references a local action .github/actions/docker-tag-push whose action.yml is
missing the required top-level name field (causing actionlint/metadata issues)
and the workflow uses custom self-hosted runner labels that actionlint flags as
unknown; fix by adding a name and description to
.github/actions/docker-tag-push/action.yml (e.g., "Tag and Push Docker Images"
and a short description) and add a new file .github/actionlint.yaml listing the
custom self-hosted runner labels (cpu-amd-m5-2xlarge, gpu-l40-amd64) under a
self-hosted-runner key so actionlint recognizes them.
Signed-off-by: Indrajit Bhosale <[email protected]>
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| test_scenario: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an automatic way to capture all the test cases? Otherwise we have to remember to change this file everytime we add a test case
| azure_acr_password: ${{ secrets.AZURE_ACR_PASSWORD }} | ||
|
|
||
| deploy-test-fault-tolerance: | ||
| runs-on: cpu-amd-m5-2xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many GPUs will be running these FT tests?
Will they be running in parallel?
As we add more and more tests will we need to change " runs-on: cpu-amd-m5-2xlarge"?
| # Install dynamo env secrets | ||
| kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${{ secrets.HF_TOKEN }} -n $KUBE_NS || true | ||
| # Create docker pull secret for operator image | ||
| kubectl create secret docker-registry docker-imagepullsecret --docker-server=${{ secrets.AZURE_ACR_HOSTNAME }} --docker-username=${{ secrets.AZURE_ACR_USER }} --docker-password=${{ secrets.AZURE_ACR_PASSWORD }} --namespace=${NAMESPACE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, Later for moe and elastic EP we might need to run on environment other than Azure
Overview:
Add Weekly CI tests for full fault tolerance test suite.
Runs once a week on Sunday evening PST
Details:
Where should the reviewer start?
.github/workflows/weekly-fault-tolerance.ymlSummary by CodeRabbit