Description
What happened + What you expected to happen
Currently, the get job implementation works this way.
- Obtain all job info from Redis.
- To know which driver is currently doing any work, we ping every driver and wait until all driver replies
- combine all info and reply
2 is very fragile because it is possible driver doesn't reply RPCs because it is busy or it is killed abruptly. This especially happened commonly in HA cluster (because when driver in a head node dies unexpectedly, it takes at least a 5~10 minutes to detect the failure via keepalive). See #40431 for more details. The problem is I also observe this from non-HA cluster.
I think this style of protocol is really fragile. It is possible driver replies very slowly (in this case, this RPC becomes extremely slow) or even never reply (in this case, the API will hang). Although we have timeout to driver RPCs, the latency will be very unstable. We should avoid doing this. We should instead
- The driver activities have to be sent to the GCS periodically via existing plumbing path.
- The GCS should reply immediately.
NOTE: We need to discuss more details to make sure we can guarantee some level of correctness on job activity.
Versions / Dependencies
master
Reproduction script
n/a
Issue Severity
None