-
Notifications
You must be signed in to change notification settings - Fork 26
Cluster tooling: GPU heartbeat + in-container rebuild helper #397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,48 @@ | ||||||||||||
| """GPU heartbeat: keeps utilization above threshold to prevent job reclamation. | ||||||||||||
|
|
||||||||||||
| Monitors GPU utilization via nvidia-smi and performs matrix multiplications | ||||||||||||
| when utilization drops below THRESHOLD. Steps aside when real training is active. | ||||||||||||
| """ | ||||||||||||
|
|
||||||||||||
| import subprocess | ||||||||||||
| import time | ||||||||||||
| import torch | ||||||||||||
|
|
||||||||||||
| THRESHOLD = 65 # percent GPU utilization to maintain | ||||||||||||
| CHECK_INTERVAL = 0.05 # seconds between checks | ||||||||||||
| N = 6144 # matrix size for dummy work | ||||||||||||
| BURST_ITERATIONS = 60 # number of matmuls per burst | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def get_gpu_utilization(): | ||||||||||||
| try: | ||||||||||||
| result = subprocess.run( | ||||||||||||
| ["nvidia-smi", "--query-gpu=utilization.gpu", "--format=csv,noheader,nounits"], | ||||||||||||
| capture_output=True, | ||||||||||||
| text=True, | ||||||||||||
| timeout=5, | ||||||||||||
| ) | ||||||||||||
| return int(result.stdout.strip().split("\n")[0]) | ||||||||||||
| except Exception: | ||||||||||||
| return 100 # assume busy if query fails | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def main(): | ||||||||||||
|
||||||||||||
| def main(): | |
| def main(): | |
| if not torch.cuda.is_available(): | |
| print("GPU heartbeat skipped: CUDA is not available.") | |
| return |
Copilot
AI
Apr 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_gpu_utilization() only reads the first line of nvidia-smi output, and main() always uses torch.device('cuda') (i.e., the first visible GPU). For multi-GPU jobs, other GPUs can still sit at 0% utilization and may still trigger reclamation depending on policy. Consider querying/monitoring all visible GPUs and either keeping each above the threshold or using the minimum utilization across them to decide when to generate load.
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,155 @@ | ||||||||
| """Submit a SLURM job to rebuild the PufferDrive C extension inside the Singularity container. | ||||||||
|
|
||||||||
| Avoids the nested quoting hell of sbatch --wrap by writing a standalone bash script | ||||||||
| to a temp location and sbatch-ing that file. The script runs `setup.py build_ext` | ||||||||
| inside the container overlay where torch is installed. | ||||||||
|
|
||||||||
| Example: | ||||||||
| python scripts/rebuild_on_cluster.py | ||||||||
| python scripts/rebuild_on_cluster.py --account torch_pr_924_general | ||||||||
| python scripts/rebuild_on_cluster.py --project-root /scratch/$USER/code/PufferDrive --wait | ||||||||
| """ | ||||||||
|
|
||||||||
| import argparse | ||||||||
| import os | ||||||||
| import subprocess | ||||||||
| import sys | ||||||||
| import time | ||||||||
|
|
||||||||
|
|
||||||||
| DEFAULT_IMAGE = "/share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif" | ||||||||
|
|
||||||||
|
|
||||||||
| def parse_args(): | ||||||||
| user = os.environ.get("USER", "") | ||||||||
| parser = argparse.ArgumentParser(description="Rebuild PufferDrive C extension on SLURM cluster") | ||||||||
| parser.add_argument("--account", default="torch_pr_924_general", help="SLURM account") | ||||||||
| parser.add_argument("--user", default=user, help="Cluster username (default: $USER)") | ||||||||
| parser.add_argument( | ||||||||
| "--project-root", | ||||||||
| default=None, | ||||||||
| help="Path to PufferDrive on the cluster (default: /scratch/<user>/code/PufferDrive)", | ||||||||
| ) | ||||||||
| parser.add_argument( | ||||||||
| "--overlay", | ||||||||
| default=None, | ||||||||
| help="Singularity overlay path (default: /scratch/<user>/images/PufferDrive/overlay-15GB-500K.ext3)", | ||||||||
| ) | ||||||||
| parser.add_argument("--image", default=DEFAULT_IMAGE, help="Singularity image path") | ||||||||
| parser.add_argument("--time", default="15", help="SLURM time limit in minutes") | ||||||||
| parser.add_argument("--mem", default="16gb", help="SLURM memory") | ||||||||
| parser.add_argument("--cpus", default="8", help="SLURM cpus-per-task") | ||||||||
| parser.add_argument("--wait", action="store_true", help="Poll until the job finishes and print its log") | ||||||||
| parser.add_argument("--dry", action="store_true", help="Print the script and sbatch command without submitting") | ||||||||
|
||||||||
| parser.add_argument("--dry", action="store_true", help="Print the script and sbatch command without submitting") | |
| parser.add_argument("--dry", action="store_true", help="Print the script, destination, and log paths without submitting") |
Copilot
AI
Apr 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build_rebuild_script() interpolates project_root, overlay, and image directly into shell commands without quoting, and later run_ssh(...)/sbatch_cmd also embed user-controlled strings unquoted. This can break on paths with spaces and is also a shell-injection risk when args come from the CLI. Use shlex.quote (or otherwise safely quote) all interpolated shell values, including the remote cat > ... command and sbatch arguments.
Copilot
AI
Apr 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSH destination host is hardcoded as torch in both run_ssh() and the script upload call. This makes the helper unusable in environments where that host alias doesn’t exist. Consider adding a --host (or --ssh-target) argument with default torch, and using it consistently.
Copilot
AI
Apr 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sacct call uses -P, which outputs pipe-delimited fields and typically includes a trailing | (e.g., COMPLETED|). Since state is compared against strings like "COMPLETED", the loop may never break and --wait can hang indefinitely. Consider dropping -P or normalizing with state = state.split('|', 1)[0] before comparing.
| continue | |
| continue | |
| state = state.split("|", 1)[0] |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -83,6 +83,13 @@ def parse_args(): | |||||||||||
| "--args", type=str, nargs="+", default=None, help="Args to override/sweep (e.g., learning_rate=1e-4:3e-4)" | ||||||||||||
| ) | ||||||||||||
|
|
||||||||||||
| # GPU heartbeat: keeps utilization above threshold to prevent job reclamation on NYU cluster | ||||||||||||
| parser.add_argument( | ||||||||||||
| "--heartbeat", | ||||||||||||
| action="store_true", | ||||||||||||
| help="Run scripts/gpu_heartbeat.py in background alongside training", | ||||||||||||
| ) | ||||||||||||
|
|
||||||||||||
| # Container settings | ||||||||||||
| parser.add_argument("--container", action="store_true", help="Run inside Singularity container") | ||||||||||||
| parser.add_argument( | ||||||||||||
|
|
@@ -94,7 +101,7 @@ def parse_args(): | |||||||||||
| parser.add_argument( | ||||||||||||
| "--container_overlay", | ||||||||||||
| type=str, | ||||||||||||
| default="/scratch/ev2237/containers/pufferdrive/overlay.ext3", | ||||||||||||
| default=f"/scratch/{os.environ.get('USER', '')}/images/PufferDrive/overlay-15GB-500K.ext3", | ||||||||||||
| help="Singularity overlay path", | ||||||||||||
| ) | ||||||||||||
|
Comment on lines
101
to
106
|
||||||||||||
|
|
||||||||||||
|
|
@@ -355,6 +362,17 @@ def launch_training(args, from_config, cmd, save_dir, project_root, container_co | |||||||||||
| # Add save_dir to command | ||||||||||||
| full_cmd = base_cmd + cmd + ["--train.data-dir", save_dir] | ||||||||||||
|
|
||||||||||||
| # If heartbeat is enabled, wrap the training command in a brace group that: | ||||||||||||
| # 1. backgrounds python scripts/gpu_heartbeat.py | ||||||||||||
| # 2. runs training in the foreground | ||||||||||||
| # 3. kills the heartbeat on training exit, preserving training's exit code | ||||||||||||
| # Brace groups `{ ... ; }` run in the current shell (unlike parens) so the | ||||||||||||
| # preceding `cd` and env exports still apply to the training command. The `&` | ||||||||||||
| # backgrounds only the python call, not the whole compound statement. | ||||||||||||
| def wrap_with_heartbeat(train_cmd_str): | ||||||||||||
| hb = "python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 & HEARTBEAT_PID=$!" | ||||||||||||
|
||||||||||||
| hb = "python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 & HEARTBEAT_PID=$!" | |
| hb = ( | |
| 'HEARTBEAT_LOG="/tmp/gpu_heartbeat.${SLURM_JOB_ID:-$$}.log"; ' | |
| 'python scripts/gpu_heartbeat.py > "$HEARTBEAT_LOG" 2>&1 & HEARTBEAT_PID=$!' | |
| ) |
Copilot
AI
Apr 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The training command is converted to a shell string via ' '.join(full_cmd) and then embedded into bash -c (and potentially into a brace group). This will break if any argument contains spaces or shell-special characters, and it also makes quoting/escaping fragile. Prefer building the shell string with shlex.join(full_cmd) (Python 3.8+) and applying shlex.quote to interpolated paths like project_root/save_dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default heartbeat workload allocates two
6144x6144FP32 matrices (plus outputs), which can consume ~450MB+ of GPU memory and may be too large for smaller GPUs or memory-constrained jobs. Consider makingN,THRESHOLD, andBURST_ITERATIONSconfigurable via CLI flags/env vars (with a safer default), so users can tune the heartbeat to their hardware and policies.