-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vine: Worker transfer limit must be larger than manager limit. #4050
Vine: Worker transfer limit must be larger than manager limit. #4050
Conversation
The point here is that VINE_WORKER_TRANSFER_LIMIT cannot be smaller than q->worker_source_max_transfers. (Otherwise transfers fail and tasks get forsaken even when the manager is doing the right thing.) |
I think that this value should be defined in |
I like the idea of connecting the two together clearly in the code. |
update: per our discussion we found the process limit is capable of causing connection timeouts, therefore a potential problem I do not fully understand the problem at hand, but I do not think this define would have any conflicting interactions with the other parameters controlling worker transfers. This define limits the maximum number of forked transfer processes at a given moment. If the manager schedules a number of transfers that exceeds this define, the transfers will wait in a queue, but still eventually complete. The problem this solved was when a task has tens or hundreds of input files which need to be transferred. The manager does not respect transfer limits when it pertains to a single task. So it will queue tens or hundreds of transfers to occur in parallel, which caused socket timeouts. I think 128 is probably too high of a number. |
Sorry if repeating myself, but wanted to make sure this was reflected in the issue. 1 - The primary concurrency control is in the manager, who knows where all of the files are, and where all of the transfers are happening. It uses 2 - A secondary concurrency control is in the worker when it initiates transfers. The worker's cache manager can accumulate any number of transfers, but will not initiate more than 3 - The tertiary control is in the worker when it receives transfers. The transfer server will only fork When a large number of transfers are needed, our goal is to not to run them all concurrently, because they would all get bad performance simultaneously. Instead, we are relying on the manager to ensure that enough are scheduled to keep busy. So, the essential requirement is:
But we also want some flexibility so that the user can manipulate the total amount of transfers going on, from a single point of control at the manager. Hence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right change to solve the immediate problem. We can contemplate ways to make this more configurable from the manager down the road.
Proposed Changes
For part of #4038
Merge Checklist
The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.
make test
Run local tests prior to pushing.make format
Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lint
Run lint on source code prior to pushing.