[Ray Clusters] Allow cluster node start check to be customized through cluster config

### Description

At the moment, the command runner check fi a cluster node has gone up by running `uptime` on it. This works fine in most cases but causes problems in cases where `uptime` is not the best indicator for the node being set up.

It would be nice to be able to configure the node custom "wait_for_node" checks in the `cluster-config.yml`. This way the user could wait for arbitrary startup actions to finish and not just `uptime` that is currently hard-coded into the command runner.

Below's an example that's causing troubles for me.

The [Deep Learning VM images](https://cloud.google.com/deep-learning-vm/docs/images) on GCP (e.g. `projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310`) contain a set of startup scripts as follows:
```
(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ ls -la /opt/c2d/scripts/
total 76
drwxr-xr-x 2 root root 4096 Sep 23 07:22 .
drwxr-xr-x 3 root root 4096 Sep 23 07:22 ..
-rwxr-xr-x 1 root root 7175 Sep 23 07:22 00-setup-api-services.sh
-rwxr-xr-x 1 root root  924 Sep 23 07:22 01-attempt-register-vm-on-proxy.sh
-rwxr-xr-x 1 root root 1092 Sep 23 07:22 01-install-gpu-driver.sh
-rwxr-xr-x 1 root root  998 Sep 23 07:22 02-swap-binaries.sh
-rwxr-xr-x 1 root root 3703 Sep 23 07:22 03-enable-jupyter.sh
-rwxr-xr-x 1 root root 2017 Sep 23 07:22 04-install-jupyter-extensions.sh
-rwxr-xr-x 1 root root 2789 Sep 23 07:22 06-set-metadata.sh
-rwxr-xr-x 1 root root 1208 Sep 23 07:22 07-enable-single-user-login.sh
-rwxr-xr-x 1 root root  856 Sep 23 07:22 92-notebook-security.sh
-rwxr-xr-x 1 root root 1612 Sep 23 07:22 93-configure-conda.sh
-rwxr-xr-x 1 root root  902 Sep 23 07:22 94-report-to-api-server.sh
-rwxr-xr-x 1 root root 1275 Sep 23 07:22 95-enable-monitoring.sh
-rwxr-xr-x 1 root root 1277 Sep 23 07:22 96-run-post-startup-script.sh
-rwxr-xr-x 1 root root 1554 Sep 23 07:22 97-initialize-executor-service.sh
-rwxr-xr-x 1 root root  873 Sep 23 07:22 98-enable-updates.sh
-rwxr-xr-x 1 root root  753 Sep 23 07:22 99-enable-ssh.sh
```
The last one, `99-enable-ssh.sh`, forcefully kills ssh sessions and restarts the `ssh.service`
```
(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ tail -n3 /opt/c2d/scripts/99-enable-ssh.sh
( sudo pgrep -a "sshd" && sudo pkill -f "sshd" ) || echo "No leftover SSH sessions found."
sudo systemctl enable ssh.service
sudo systemctl restart ssh.service
```

It takes around 2 minutes between when the `uptime` command first succeeds and when `99-enable-ssh.sh` starts.

This is a problem with Ray Clusters because the cluster initialization fails if any of the cluster initialization/setup commands are still running when `99-enable-ssh.sh` is executed (automatically by the VM's main startup script). This is because the Ray cluster initialization can't recover from ssh failures/disconnects. So far I haven't found a way to make this work with Ray without disabling the startup script manually (see the workaround below).

### Use case

I'd like to be able to run a longer-running `setup_command` on my cluster with the following `cluster-config.yml`:
```
cluster_name: test-cluster-setup

provider:

  type: gcp
  region: us-east1
  availability_zone: us-east1-b
  project_id: [project-id]

auth:
  ssh_user: ubuntu

available_node_types:
  gpu:
    resources: {"CPU": 8, "GPU": 1}
    node_config:
      machineType: g2-standard-8
      guestAccelerators:
        - acceleratorType: nvidia-l4
          acceleratorCount: 1
      metadata:
        items:
          - key: install-nvidia-driver
            value: "True"
      scheduling:
        - onHostMaintenance: TERMINATE
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 50
            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310

head_node_type: gpu

initialization_commands: []

setup_commands:
  - echo "Start. Sleeping for 5 minutes, which will fail!"
  - for i in {1..100}; do echo "i=$i/100; sleeping for 5s." && sleep 5; done
  - echo "Didn't fail."

head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
  - ray stop
  - >-
    ray start
    --head
    --port=6379
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - ray stop
  - >-
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076
```

And then start the cluster with:
```
ray up -y  --no-config-cache ./cluster-config.yml
```

Currently, this fails with:
```
$ ray up -y  --no-config-cache ./cluster-config-2.yml
Cluster: test-cluster-setup

[...]

  [6/7] Running setup commands
    (0/3) echo "Start. Sleeping for 5 mi...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Start. Sleeping for 5 minutes, which will fail!
Shared connection to 34.148.164.38 closed.
    (1/3) for i in {1..100}; do echo "i=...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

i=1/100; sleeping for 5s.
i=2/100; sleeping for 5s.
i=3/100; sleeping for 5s.
i=4/100; sleeping for 5s.
i=5/100; sleeping for 5s.
i=6/100; sleeping for 5s.
i=7/100; sleeping for 5s.
i=8/100; sleeping for 5s.
i=9/100; sleeping for 5s.
i=10/100; sleeping for 5s.
i=11/100; sleeping for 5s.
i=12/100; sleeping for 5s.
i=13/100; sleeping for 5s.
i=14/100; sleeping for 5s.
i=15/100; sleeping for 5s.
i=16/100; sleeping for 5s.
i=17/100; sleeping for 5s.
i=18/100; sleeping for 5s.
i=19/100; sleeping for 5s.
i=20/100; sleeping for 5s.
i=21/100; sleeping for 5s.
i=22/100; sleeping for 5s.
Shared connection to 34.148.164.38 closed.
2024-10-20 15:56:54,943 INFO node.py:348 -- wait_for_compute_zone_operation: Waiting for operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 to finish...
2024-10-20 15:57:00,287 INFO node.py:367 -- wait_for_compute_zone_operation: Operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 finished.
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
    self.do_update()
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 490, in do_update
    self.cmd_runner.run(cmd, run_env="auto")
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!

  Failed to setup head node.
```

So far I've worked around this with the following initialization command:
```
initialization_commands:
  - sudo mv /opt/c2d/scripts/99-enable-ssh.sh /opt/c2d/scripts/99-enable-ssh.sh.disabled
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ray Clusters] Allow cluster node start check to be customized through cluster config #48110

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Ray Clusters] Allow cluster node start check to be customized through cluster config #48110

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions