-
Notifications
You must be signed in to change notification settings - Fork 286
Description
I recently spent some time trying to get SkyRL-tx working on a Ray cluster. I initially started with an approach similar to #955, but I don't think it's quite what we want because it still requires running the uv run command on Ray nodes with GPUs / TPUs. Instead, it should be possible to run an entrypoint script on the head node that will then distribute Ray actors, similar to how skyrl-train works today.
This work may require some larger refactoring in skyrl/tinker/api.py, due to the way the engine parameters are extracted from the parent uv command used for the tinker API server, and because the tinker API server / engine need to run on the same Ray node as worker0. We may need to decouple how the arguments for API server / engine are set and add some additional logic to ensure the tinker apiserver / engine and worker 0 are co-located. I think the end result should be something similar to the existing main entrypoint for skyrl-train but with some added logic for scheduling the tinker API server / engine.
I'm happy to drive this but looking for some guidance on the overall approach.