Skip to content

[Ray Clusters] Allow cluster node start check to be customized through cluster config #48110

Open
@hartikainen

Description

@hartikainen

Description

At the moment, the command runner check fi a cluster node has gone up by running uptime on it. This works fine in most cases but causes problems in cases where uptime is not the best indicator for the node being set up.

It would be nice to be able to configure the node custom "wait_for_node" checks in the cluster-config.yml. This way the user could wait for arbitrary startup actions to finish and not just uptime that is currently hard-coded into the command runner.

Below's an example that's causing troubles for me.

The Deep Learning VM images on GCP (e.g. projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310) contain a set of startup scripts as follows:

(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ ls -la /opt/c2d/scripts/
total 76
drwxr-xr-x 2 root root 4096 Sep 23 07:22 .
drwxr-xr-x 3 root root 4096 Sep 23 07:22 ..
-rwxr-xr-x 1 root root 7175 Sep 23 07:22 00-setup-api-services.sh
-rwxr-xr-x 1 root root  924 Sep 23 07:22 01-attempt-register-vm-on-proxy.sh
-rwxr-xr-x 1 root root 1092 Sep 23 07:22 01-install-gpu-driver.sh
-rwxr-xr-x 1 root root  998 Sep 23 07:22 02-swap-binaries.sh
-rwxr-xr-x 1 root root 3703 Sep 23 07:22 03-enable-jupyter.sh
-rwxr-xr-x 1 root root 2017 Sep 23 07:22 04-install-jupyter-extensions.sh
-rwxr-xr-x 1 root root 2789 Sep 23 07:22 06-set-metadata.sh
-rwxr-xr-x 1 root root 1208 Sep 23 07:22 07-enable-single-user-login.sh
-rwxr-xr-x 1 root root  856 Sep 23 07:22 92-notebook-security.sh
-rwxr-xr-x 1 root root 1612 Sep 23 07:22 93-configure-conda.sh
-rwxr-xr-x 1 root root  902 Sep 23 07:22 94-report-to-api-server.sh
-rwxr-xr-x 1 root root 1275 Sep 23 07:22 95-enable-monitoring.sh
-rwxr-xr-x 1 root root 1277 Sep 23 07:22 96-run-post-startup-script.sh
-rwxr-xr-x 1 root root 1554 Sep 23 07:22 97-initialize-executor-service.sh
-rwxr-xr-x 1 root root  873 Sep 23 07:22 98-enable-updates.sh
-rwxr-xr-x 1 root root  753 Sep 23 07:22 99-enable-ssh.sh

The last one, 99-enable-ssh.sh, forcefully kills ssh sessions and restarts the ssh.service

(base) ubuntu@ray-test-cluster-head-a2213017-compute:~$ tail -n3 /opt/c2d/scripts/99-enable-ssh.sh
( sudo pgrep -a "sshd" && sudo pkill -f "sshd" ) || echo "No leftover SSH sessions found."
sudo systemctl enable ssh.service
sudo systemctl restart ssh.service

It takes around 2 minutes between when the uptime command first succeeds and when 99-enable-ssh.sh starts.

This is a problem with Ray Clusters because the cluster initialization fails if any of the cluster initialization/setup commands are still running when 99-enable-ssh.sh is executed (automatically by the VM's main startup script). This is because the Ray cluster initialization can't recover from ssh failures/disconnects. So far I haven't found a way to make this work with Ray without disabling the startup script manually (see the workaround below).

Use case

I'd like to be able to run a longer-running setup_command on my cluster with the following cluster-config.yml:

cluster_name: test-cluster-setup

provider:

  type: gcp
  region: us-east1
  availability_zone: us-east1-b
  project_id: [project-id]

auth:
  ssh_user: ubuntu

available_node_types:
  gpu:
    resources: {"CPU": 8, "GPU": 1}
    node_config:
      machineType: g2-standard-8
      guestAccelerators:
        - acceleratorType: nvidia-l4
          acceleratorCount: 1
      metadata:
        items:
          - key: install-nvidia-driver
            value: "True"
      scheduling:
        - onHostMaintenance: TERMINATE
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 50
            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu123-ubuntu-2204-py310

head_node_type: gpu

initialization_commands: []

setup_commands:
  - echo "Start. Sleeping for 5 minutes, which will fail!"
  - for i in {1..100}; do echo "i=$i/100; sleeping for 5s." && sleep 5; done
  - echo "Didn't fail."

head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
  - ray stop
  - >-
    ray start
    --head
    --port=6379
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - ray stop
  - >-
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076

And then start the cluster with:

ray up -y  --no-config-cache ./cluster-config.yml

Currently, this fails with:

$ ray up -y  --no-config-cache ./cluster-config-2.yml
Cluster: test-cluster-setup

[...]

  [6/7] Running setup commands
    (0/3) echo "Start. Sleeping for 5 mi...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Start. Sleeping for 5 minutes, which will fail!
Shared connection to 34.148.164.38 closed.
    (1/3) for i in {1..100}; do echo "i=...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

i=1/100; sleeping for 5s.
i=2/100; sleeping for 5s.
i=3/100; sleeping for 5s.
i=4/100; sleeping for 5s.
i=5/100; sleeping for 5s.
i=6/100; sleeping for 5s.
i=7/100; sleeping for 5s.
i=8/100; sleeping for 5s.
i=9/100; sleeping for 5s.
i=10/100; sleeping for 5s.
i=11/100; sleeping for 5s.
i=12/100; sleeping for 5s.
i=13/100; sleeping for 5s.
i=14/100; sleeping for 5s.
i=15/100; sleeping for 5s.
i=16/100; sleeping for 5s.
i=17/100; sleeping for 5s.
i=18/100; sleeping for 5s.
i=19/100; sleeping for 5s.
i=20/100; sleeping for 5s.
i=21/100; sleeping for 5s.
i=22/100; sleeping for 5s.
Shared connection to 34.148.164.38 closed.
2024-10-20 15:56:54,943 INFO node.py:348 -- wait_for_compute_zone_operation: Waiting for operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 to finish...
2024-10-20 15:57:00,287 INFO node.py:367 -- wait_for_compute_zone_operation: Operation operation-1729454214738-624edf01ff250-54f463fb-cc02b264 finished.
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
    self.do_update()
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 490, in do_update
    self.cmd_runner.run(cmd, run_env="auto")
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/ray-exec-stop-tmux/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!

  Failed to setup head node.

So far I've worked around this with the following initialization command:

initialization_commands:
  - sudo mv /opt/c2d/scripts/99-enable-ssh.sh /opt/c2d/scripts/99-enable-ssh.sh.disabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray Corecore-clustersFor launching and managing Ray clusters/jobs/kubernetesenhancementRequest for new feature and/or capability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions