Skip to content

[tinker] Support using Ray to manage tinker API server, engine and Jax workers #1393

@andrewsykim

Description

@andrewsykim

I recently spent some time trying to get SkyRL-tx working on a Ray cluster. I initially started with an approach similar to #955, but I don't think it's quite what we want because it still requires running the uv run command on Ray nodes with GPUs / TPUs. Instead, it should be possible to run an entrypoint script on the head node that will then distribute Ray actors, similar to how skyrl-train works today.

This work may require some larger refactoring in skyrl/tinker/api.py, due to the way the engine parameters are extracted from the parent uv command used for the tinker API server, and because the tinker API server / engine need to run on the same Ray node as worker0. We may need to decouple how the arguments for API server / engine are set and add some additional logic to ensure the tinker apiserver / engine and worker 0 are co-located. I think the end result should be something similar to the existing main entrypoint for skyrl-train but with some added logic for scheduling the tinker API server / engine.

I'm happy to drive this but looking for some guidance on the overall approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions