Skip to content

Require launch-ec2-runner-with-fallback use for all ec2 runners #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/ci/ec2-runners.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# CI/CD with EC2 Runners

## Problem

Projects run E2E tests on EC2 runners for small, medium, and large jobs. (Some
projects may have other names for such jobs, like `smoke` in `training`).

These runners are used to get access to accelerated hardware (e.g., GPUs) to
run compute intensive processes.

Access to instances with such hardware is sometimes limited and depends on the
current demand among all EC2 users in a particular zone. This means that
sometimes requested instance types are not available, which makes jobs that
rely on these instances fail.

## Solution

Availability depends on a particular zone. If a zone is busy, we can try
another zone.

For this, a new
[`launch-ec2-runner-with-fallback`](https://github.com/instructlab/ci-actions/tree/main/actions/launch-ec2-runner-with-fallback)
action was implemented in `ci-actions` repository. If adopted, this action will
walk through AZs and try to request an instance in each AZ until it finds one.

All projects that rely on AWS EC2 runners should adopt the
`launch-ec2-runner-with-fallback` action in all of the jobs to avoid fluke test
failures.