[tinker] Support using Ray to manage tinker API server, engine and Jax workers

I recently spent some time trying to get SkyRL-tx working on a Ray cluster. I initially started with an approach similar to https://github.com/NovaSky-AI/SkyRL/pull/955, but I don't think it's quite what we want because it still requires running the `uv run` command on Ray nodes with GPUs / TPUs. Instead, it should be possible to run an entrypoint script on the head node that will then distribute Ray actors, similar to how skyrl-train works today.

This work may require some larger refactoring in `skyrl/tinker/api.py`, due to the way the engine parameters are extracted from the parent uv command used for the tinker API server, and because the tinker API server / engine need to run on the same Ray node as worker0. We may need to decouple how the arguments for API server / engine are set and add some additional logic to ensure the tinker apiserver / engine and worker 0 are co-located. I think the end result should be something similar to the existing [main entrypoint for skyrl-train](https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl/train/entrypoints/main_base.py) but with some added logic for scheduling the tinker API server / engine. 

I'm happy to drive this but looking for some guidance on the overall approach. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tinker] Support using Ray to manage tinker API server, engine and Jax workers #1393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[tinker] Support using Ray to manage tinker API server, engine and Jax workers #1393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions