Skip to content

Replace runs-on.com with ec2-github-runner for GPU CI#7

Closed
jack-champagne wants to merge 16 commits into
mainfrom
jc/gpu-runner-ec2
Closed

Replace runs-on.com with ec2-github-runner for GPU CI#7
jack-champagne wants to merge 16 commits into
mainfrom
jc/gpu-runner-ec2

Conversation

@jack-champagne

Copy link
Copy Markdown
Member

Summary

  • Delete .runs-on.yml (no longer using runs-on.com service)
  • Rewrite gpu-test as three-job pattern: start-gpu-runner → gpu-test → stop-gpu-runner
  • GPU tests now run on PRs (non-fork) and main pushes, not just main
  • Add gpu-benchmark.yml for on-demand benchmarks (T4, A10G, A100, H100)

Dependencies

  • harmoniqs/aws-infra#23 must be deployed first (creates IAM roles + security group)
  • GitHub secrets must be configured: GH_RUNNER_PAT, AWS_GPU_RUNNER_ROLE_ARN, AWS_GPU_RUNNER_SUBNET_ID, AWS_GPU_RUNNER_SG_ID, AWS_GPU_RUNNER_AMI_ID, AWS_GPU_RUNNER_INSTANCE_PROFILE

Test plan

  • aws-infra#23 merged and applied to staging/prod
  • GitHub secrets configured from terraform outputs
  • Classic PAT created with repo scope
  • Open test PR to verify GPU runner spins up, tests pass, instance terminates
  • Verify if: always() cleanup on failure

jack-champagne and others added 16 commits April 3, 2026 01:43
…line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Aqua and JET to [extras]/[targets], create test/aqua.jl with all
checks enabled (ambiguities disabled for CUDA.jl noise), add compat
entries for all extras, and suppress known stale-dep false positives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…m 0 to 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace anonymous NamedTuple refs with a concrete mutable CallbackRef type
that registers a finalizer to automatically unregister both forward and
gradient callbacks from the global registry when GC'd or explicitly closed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/catch finalizer safety

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Any-typed GC anchor fields in ElementaryOperator, MatrixOperator,
OperatorTerm, Operator, and WorkStream with concrete typed fields. Move
callbacks.jl include before operators.jl so CallbackRef is in scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meterize dtype/batch_size tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ase 6 gap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion

Added docstring for CUDENSITYMATError and all destroy_* functions that
lacked them. Added export statements for the utility functions in state.jl,
the batch/append/prepare/compute functions in operators.jl, and the
prepare/compute functions in expectation.jl and spectrum.jl. Removed
warnonly = [:missing_docs] from docs/make.jl now that all exported symbols
are documented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… sync explicitly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add .runs-on.yml defining a 'gpu' runner profile (g4dn.xlarge, T4,
ubuntu22-gpu-x64, spot, 45 min timeout) and update the gpu-test CI job to
use it via the runs-on label syntax.  Add JULIA_CUDA_MEMORY_POOL=none and
CUDA_VISIBLE_DEVICES='0' env vars.  Job remains gated to main/tag pushes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Delete .runs-on.yml (no longer using runs-on.com service)
- Rewrite gpu-test as three-job pattern: start-gpu-runner, gpu-test, stop-gpu-runner
- GPU tests now run on PRs (non-fork) and main pushes, not just main
- Add gpu-benchmark.yml for on-demand benchmarks (T4 through H100)
- Fork PR protection via repo name check on start-gpu-runner
@jack-champagne

Copy link
Copy Markdown
Member Author

Superseded by direct merge from clean branch off main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant