-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vine: put files to worker may fail #4061
Comments
The symptom is the same as #4038 though they are different issues. |
Hmm, it sounds pretty similar to me. When the manager |
Oh, that's interesting. Hmm. See here for the manager setting the replica state. For items of type Now, I'm not understanding why the worker is generating an But I think you are on to something: if the manager sends the worker a file, then it's fair for anything else on that connection to assume that the file has arrived. But, if there is a lot of stuff queued up on that pipe, and the manager sends Perhaps we need to require all file types to return a |
@dthain You are right! Once I enforced the |
Ok... but also the worker needs to send cache-update messages for files/buffers received |
It seems this issue was caused by the fd accumulation on the transfer server (see #4076 ), the manager puts a file and assumes the file is there, but since the fd limit is hit the worker cannot transfer files to other workers. But I still wonder if we should delay setting the reaplica's |
Yes, I think it would be wise to make that change, although it is not urgent. |
Got it, I will close this issue and instead open another one with this specific task |
In a DV5 run this morning I disabled the worker transfers which means files are sent back to the manager for permanent storage, however in the end of the run I found the workflow stuck due to a circle of tasks being forsaken and being resubmitted.
Then I enabled the worker debug logs, all the indications from the logs being that the
put
requests from the manager to workers may fail, however that failure message doesn't come to the manager and it just simply assumes it to be succeeded, this causes transfer failures when other workers want to fetch files from the worker withput
failures.Some evidence:
From the worker debug log:
It says file
file-rnd-mqqhtoxkijnbuid
was put to the worker at2025/02/11 12:06:19.14
, but failed at2025/02/11 12:06:19.95
From the manager's point of view:
It puts file
file-rnd-mqqhtoxkijnbuid
at exactly the same time, however it doesn't seem to process the error messagetx: error file-rnd-mqqhtoxkijnbuid 2
from the worker.Therefore, the manager would assume that that file had been successfully put on the worker and will try
will get 6bb6170e-3f37-44bb-b695-efdd1a852f7a.p from url workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid
when a later task needs that file as input.This only sporadically happens when I repeat. Though I am not very sure if it was because the
put
failed or other kinds of failures in the worker cache?The text was updated successfully, but these errors were encountered: