Description
Description
At the moment, the command runner check fi a cluster node has gone up by running uptime
on it. This works fine in most cases but causes problems in cases where uptime
is not the best indicator for the node being set up.
It would be nice to be able to configure the node custom "wait_for_node" checks in the cluster-config.yml
. This way the user could wait for arbitrary startup actions to finish and not just uptime
that is currently hard-coded into the command runner.
Below's an example that's causing troubles for me.
The Deep Learning VM images on GCP (e.g. projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310
) contain a set of startup scripts as follows:
(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ ls -la /opt/c2d/scripts/
total 76
drwxr-xr-x 2 root root 4096 Sep 23 07:22 .
drwxr-xr-x 3 root root 4096 Sep 23 07:22 ..
-rwxr-xr-x 1 root root 7175 Sep 23 07:22 00-setup-api-services.sh
-rwxr-xr-x 1 root root 924 Sep 23 07:22 01-attempt-register-vm-on-proxy.sh
-rwxr-xr-x 1 root root 1092 Sep 23 07:22 01-install-gpu-driver.sh
-rwxr-xr-x 1 root root 998 Sep 23 07:22 02-swap-binaries.sh
-rwxr-xr-x 1 root root 3703 Sep 23 07:22 03-enable-jupyter.sh
-rwxr-xr-x 1 root root 2017 Sep 23 07:22 04-install-jupyter-extensions.sh
-rwxr-xr-x 1 root root 2789 Sep 23 07:22 06-set-metadata.sh
-rwxr-xr-x 1 root root 1208 Sep 23 07:22 07-enable-single-user-login.sh
-rwxr-xr-x 1 root root 856 Sep 23 07:22 92-notebook-security.sh
-rwxr-xr-x 1 root root 1612 Sep 23 07:22 93-configure-conda.sh
-rwxr-xr-x 1 root root 902 Sep 23 07:22 94-report-to-api-server.sh
-rwxr-xr-x 1 root root 1275 Sep 23 07:22 95-enable-monitoring.sh
-rwxr-xr-x 1 root root 1277 Sep 23 07:22 96-run-post-startup-script.sh
-rwxr-xr-x 1 root root 1554 Sep 23 07:22 97-initialize-executor-service.sh
-rwxr-xr-x 1 root root 873 Sep 23 07:22 98-enable-updates.sh
-rwxr-xr-x 1 root root 753 Sep 23 07:22 99-enable-ssh.sh
The last one, 99-enable-ssh.sh
, forcefully kills ssh sessions and restarts the ssh.service
(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ tail -n3 /opt/c2d/scripts/99-enable-ssh.sh
( sudo pgrep -a "sshd" && sudo pkill -f "sshd" ) || echo "No leftover SSH sessions found."
sudo systemctl enable ssh.service
sudo systemctl restart ssh.service
It takes around 2 minutes between when the uptime
command first succeeds and when 99-enable-ssh.sh
starts.
This is a problem with Ray Clusters because the cluster initialization fails if any of the cluster initialization/setup commands are still running when 99-enable-ssh.sh
is executed (automatically by the VM's main startup script). This is because the Ray cluster initialization can't recover from ssh failures/disconnects. So far I haven't found a way to make this work with Ray without disabling the startup script manually (see the workaround below).
Use case
I'd like to be able to run a longer-running setup_command
on my cluster with the following cluster-config.yml
:
cluster_name: test-cluster-setup
provider:
type: gcp
region: us-east1
availability_zone: us-east1-b
project_id: [project-id]
auth:
ssh_user: ubuntu
available_node_types:
gpu:
resources: {"CPU": 8, "GPU": 1}
node_config:
machineType: g2-standard-8
guestAccelerators:
- acceleratorType: nvidia-l4
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: TERMINATE
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310
head_node_type: gpu
initialization_commands: []
setup_commands:
- echo "Start. Sleeping for 5 minutes, which will fail!"
- for i in {1..100}; do echo "i=$i/100; sleeping for 5s." && sleep 5; done
- echo "Didn't fail."
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
And then start the cluster with:
ray up -y --no-config-cache ./cluster-config.yml
Currently, this fails with:
$ ray up -y --no-config-cache ./cluster-config-2.yml
Cluster: test-cluster-setup
[...]
[6/7] Running setup commands
(0/3) echo "Start. Sleeping for 5 mi...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Start. Sleeping for 5 minutes, which will fail!
Shared connection to 34.148.164.38 closed.
(1/3) for i in {1..100}; do echo "i=...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
i=1/100; sleeping for 5s.
i=2/100; sleeping for 5s.
i=3/100; sleeping for 5s.
i=4/100; sleeping for 5s.
i=5/100; sleeping for 5s.
i=6/100; sleeping for 5s.
i=7/100; sleeping for 5s.
i=8/100; sleeping for 5s.
i=9/100; sleeping for 5s.
i=10/100; sleeping for 5s.
i=11/100; sleeping for 5s.
i=12/100; sleeping for 5s.
i=13/100; sleeping for 5s.
i=14/100; sleeping for 5s.
i=15/100; sleeping for 5s.
i=16/100; sleeping for 5s.
i=17/100; sleeping for 5s.
i=18/100; sleeping for 5s.
i=19/100; sleeping for 5s.
i=20/100; sleeping for 5s.
i=21/100; sleeping for 5s.
i=22/100; sleeping for 5s.
Shared connection to 34.148.164.38 closed.
2024-10-20 15:56:54,943 INFO node.py:348 -- wait_for_compute_zone_operation: Waiting for operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 to finish...
2024-10-20 15:57:00,287 INFO node.py:367 -- wait_for_compute_zone_operation: Operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 finished.
New status: update-failed
!!!
Full traceback: Traceback (most recent call last):
File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
self.do_update()
File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 490, in do_update
self.cmd_runner.run(cmd, run_env="auto")
File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
Error message: SSH command failed.
!!!
Failed to setup head node.
So far I've worked around this with the following initialization command:
initialization_commands:
- sudo mv /opt/c2d/scripts/99-enable-ssh.sh /opt/c2d/scripts/99-enable-ssh.sh.disabled