Skip to content

[Core] Rewrite the Get Job API implementation #40829

Open
@rkooo567

Description

@rkooo567

What happened + What you expected to happen

Currently, the get job implementation works this way.

  1. Obtain all job info from Redis.
  2. To know which driver is currently doing any work, we ping every driver and wait until all driver replies
  3. combine all info and reply

2 is very fragile because it is possible driver doesn't reply RPCs because it is busy or it is killed abruptly. This especially happened commonly in HA cluster (because when driver in a head node dies unexpectedly, it takes at least a 5~10 minutes to detect the failure via keepalive). See #40431 for more details. The problem is I also observe this from non-HA cluster.

I think this style of protocol is really fragile. It is possible driver replies very slowly (in this case, this RPC becomes extremely slow) or even never reply (in this case, the API will hang). Although we have timeout to driver RPCs, the latency will be very unstable. We should avoid doing this. We should instead

  1. The driver activities have to be sent to the GCS periodically via existing plumbing path.
  2. The GCS should reply immediately.

NOTE: We need to discuss more details to make sure we can guarantee some level of correctness on job activity.

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-apicore-gcsRay core global control servicestability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions