[Core] Rewrite the Get Job API implementation

### What happened + What you expected to happen

Currently, the get job implementation works this way.

1. Obtain all job info from Redis.
2. To know which driver is currently doing any work, we ping every driver and wait until all driver replies
3. combine all info and reply

2 is very fragile because it is possible driver doesn't reply RPCs because it is busy or it is killed abruptly. This especially happened commonly in HA cluster (because when driver in a head node dies unexpectedly, it takes at least a 5~10 minutes to detect the failure via keepalive). See https://github.com/ray-project/ray/pull/40431 for more details. The problem is I also observe this from non-HA cluster. 

I think this style of protocol is really fragile. It is possible driver replies very slowly (in this case, this RPC becomes extremely slow) or even never reply (in this case, the API will hang). Although we have timeout to driver RPCs, the latency will be very unstable. We should avoid doing this. We should instead

1. The driver activities have to be sent to the GCS periodically via existing plumbing path.
2. The GCS should reply immediately.

NOTE: We need to discuss more details to make sure we can guarantee some level of correctness on job activity. 


### Versions / Dependencies

master

### Reproduction script

n/a

### Issue Severity

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Rewrite the Get Job API implementation #40829

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core] Rewrite the Get Job API implementation #40829

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions