Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: put files to worker may fail #4061

Closed
JinZhou5042 opened this issue Feb 11, 2025 · 9 comments
Closed

vine: put files to worker may fail #4061

JinZhou5042 opened this issue Feb 11, 2025 · 9 comments
Labels

Comments

@JinZhou5042
Copy link
Member

JinZhou5042 commented Feb 11, 2025

In a DV5 run this morning I disabled the worker transfers which means files are sent back to the manager for permanent storage, however in the end of the run I found the workflow stuck due to a circle of tasks being forsaken and being resubmitted.

Then I enabled the worker debug logs, all the indications from the logs being that the put requests from the manager to workers may fail, however that failure message doesn't come to the manager and it just simply assumes it to be succeeded, this causes transfer failures when other workers want to fetch files from the worker with put failures.

Some evidence:

From the worker debug log:

(coffea2024.4.0) [jzhou24@condorfe vine_scratch]$ grep "file-rnd-mqqhtoxkijnbuid" worker.8.log 
2025/02/11 12:06:19.14 vine_worker[2252195]vine: rx: put file-rnd-mqqhtoxkijnbuid 1 38000902
2025/02/11 12:06:19.14 vine_worker[2252195]vine: rx: file file-rnd-mqqhtoxkijnbuid 38000902 0644 1739293572
2025/02/11 12:06:19.95 vine_worker[2253526]vine: tx: error file-rnd-mqqhtoxkijnbuid 2
2025/02/11 12:06:19.95 vine_worker[2253527]vine: tx: error file-rnd-mqqhtoxkijnbuid 2
2025/02/11 12:06:19.97 vine_worker[2252195]vine: rx: infile file-rnd-mqqhtoxkijnbuid 16baa852-8508-4c5f-9cd2-518858f3d348.p 0
2025/02/11 12:06:19.97 vine_worker[2252195]vine: input: link /var/condor/execute/dir_2251572/worker-241627-2252195/cache/file-rnd-mqqhtoxkijnbuid -> /var/condor/execute/dir_2251572/worker-241627-2252195/task.3881/16baa852-8508-4c5f-9cd2-518858f3d348.p
2025/02/11 12:06:51.23 vine_worker[2252195]vine: rx: infile file-rnd-mqqhtoxkijnbuid 6bb6170e-3f37-44bb-b695-efdd1a852f7a.p 0
2025/02/11 12:06:51.23 vine_worker[2252195]vine: input: link /var/condor/execute/dir_2251572/worker-241627-2252195/cache/file-rnd-mqqhtoxkijnbuid -> /var/condor/execute/dir_2251572/worker-241627-2252195/task.3880/6bb6170e-3f37-44bb-b695-efdd1a852f7a.p
2025/02/11 12:06:51.26 vine_worker[2254061]vine: tx: file file-rnd-mqqhtoxkijnbuid 38000902 0644 1739293579
2025/02/11 12:11:07.15 vine_worker[2252195]debug: trashing file /var/condor/execute/dir_2251572/worker-241627-2252195/cache/file-rnd-mqqhtoxkijnbuid
2025/02/11 12:11:07.15 vine_worker[2252195]debug: trashing file /var/condor/execute/dir_2251572/worker-241627-2252195/cache/file-rnd-mqqhtoxkijnbuid.meta

It says file file-rnd-mqqhtoxkijnbuid was put to the worker at 2025/02/11 12:06:19.14, but failed at 2025/02/11 12:06:19.95

From the manager's point of view:

2025/02/11 12:06:19.14 vine_manager[539746]vine: tx to d12chas080.crc.nd.edu (10.32.82.29:38486): put file-rnd-mqqhtoxkijnbuid 1 38000902
2025/02/11 12:06:19.14 vine_manager[539746]vine: tx to d12chas080.crc.nd.edu (10.32.82.29:38486): file file-rnd-mqqhtoxkijnbuid 38000902 0644 1739293572
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d12chas080.crc.nd.edu (10.32.82.29:38486): infile file-rnd-mqqhtoxkijnbuid 16baa852-8508-4c5f-9cd2-518858f3d348.p 0
2025/02/11 12:06:19.94 vine_manager[539746]vine: d32cepyc129.crc.nd.edu (10.32.91.25:51780) will get 6bb6170e-3f37-44bb-b695-efdd1a852f7a.p from url workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d32cepyc129.crc.nd.edu (10.32.91.25:51780): puturl workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid file-rnd-mqqhtoxkijnbuid 0 38000902 0755 7A32ABC2-72CD-42B7-8989-1A4D055EFCCC
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d32cepyc129.crc.nd.edu (10.32.91.25:51780): infile file-rnd-mqqhtoxkijnbuid 6bb6170e-3f37-44bb-b695-efdd1a852f7a.p 0
2025/02/11 12:06:19.94 vine_manager[539746]vine: d32cepyc179.crc.nd.edu (10.32.91.125:48574) will get 7a398408-d73f-4e9b-b9fc-b35b378f8a80.p from url workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d32cepyc179.crc.nd.edu (10.32.91.125:48574): puturl workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid file-rnd-mqqhtoxkijnbuid 0 38000902 0755 36058395-DFDC-42E6-8B5E-2981992F90E2
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d32cepyc179.crc.nd.edu (10.32.91.125:48574): infile file-rnd-mqqhtoxkijnbuid 7a398408-d73f-4e9b-b9fc-b35b378f8a80.p 0
2025/02/11 12:06:19.94 vine_manager[539746]vine: tx to d12chas059.crc.nd.edu (10.32.81.243:46324): infile file-rnd-mqqhtoxkijnbuid 6300b7e0-8bc5-4ea9-a7de-ddfd84401cde.p 0
2025/02/11 12:06:19.95 vine_manager[539746]vine: rx from d32cepyc129.crc.nd.edu (10.32.91.25:51780): cache-invalid file-rnd-mqqhtoxkijnbuid 113 7A32ABC2-72CD-42B7-8989-1A4D055EFCCC
2025/02/11 12:06:19.95 vine_manager[539746]vine: d32cepyc129.crc.nd.edu (10.32.91.25:51780) invalidated file-rnd-mqqhtoxkijnbuid with error: error creating file at worker: Could not transfer file from workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid
2025/02/11 12:06:19.95 vine_manager[539746]vine: rx from d32cepyc179.crc.nd.edu (10.32.91.125:48574): cache-invalid file-rnd-mqqhtoxkijnbuid 113 36058395-DFDC-42E6-8B5E-2981992F90E2
2025/02/11 12:06:19.95 vine_manager[539746]vine: d32cepyc179.crc.nd.edu (10.32.91.125:48574) invalidated file-rnd-mqqhtoxkijnbuid with error: error creating file at worker: Could not transfer file from workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid

It puts file file-rnd-mqqhtoxkijnbuid at exactly the same time, however it doesn't seem to process the error message tx: error file-rnd-mqqhtoxkijnbuid 2 from the worker.

Therefore, the manager would assume that that file had been successfully put on the worker and will try will get 6bb6170e-3f37-44bb-b695-efdd1a852f7a.p from url workerip://10.32.82.29:1025/file-rnd-mqqhtoxkijnbuid when a later task needs that file as input.

This only sporadically happens when I repeat. Though I am not very sure if it was because the put failed or other kinds of failures in the worker cache?

@JinZhou5042
Copy link
Member Author

The symptom is the same as #4038 though they are different issues.

@dthain
Copy link
Member

dthain commented Feb 11, 2025

Hmm, it sounds pretty similar to me. When the manager puts a replica to a worker, it should record it as being in the PENDING state. When the worker sends back cache-update then the replica should be put in the READY state. Only files in the READY state should be the sources of transfers.

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Feb 11, 2025

VINE_FILE_REPLICA_STATE_PENDING is set in two places:

1 (here) When the manager puts a temp file as input
2 (here) When file_replica_create is called

Seems like putting a vine_file doesn't set the state to READY?

@dthain
Copy link
Member

dthain commented Feb 11, 2025

Oh, that's interesting. Hmm.

See here for the manager setting the replica state. For items of type VINE_FILE it does directly to READY because anything following that message can assume that the file has arrived. (And we are not expecting a cache-update back from the worker.)

Now, I'm not understanding why the worker is generating an error message. That would normally only happen as a result of a get failure. I'm curious what precedes that in the log.

But I think you are on to something: if the manager sends the worker a file, then it's fair for anything else on that connection to assume that the file has arrived. But, if there is a lot of stuff queued up on that pipe, and the manager sends puturl to another worker, it's possible that the get could be attempted before the file has fully arrived.

Perhaps we need to require all file types to return a cache-update message and set the replica state appropriately?

@JinZhou5042
Copy link
Member Author

@dthain You are right! Once I enforced the READY state to be set only in the cache-update handling function then I no longer saw tasks being forsaken and the workflow ended normally.

@dthain
Copy link
Member

dthain commented Feb 14, 2025

Ok... but also the worker needs to send cache-update messages for files/buffers received vine_cache_add_file, yes? Otherwise, they won't become available for purposes of P2P transfer.

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Feb 26, 2025

It seems this issue was caused by the fd accumulation on the transfer server (see #4076 ), the manager puts a file and assumes the file is there, but since the fd limit is hit the worker cannot transfer files to other workers.

But I still wonder if we should delay setting the reaplica's READY status until the corresponding cache-update message arrives?

@dthain
Copy link
Member

dthain commented Feb 26, 2025

Yes, I think it would be wise to make that change, although it is not urgent.

@JinZhou5042
Copy link
Member Author

Got it, I will close this issue and instead open another one with this specific task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants