Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vine: Worker transfer limit must be larger than manager limit. #4050

Conversation

JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Jan 30, 2025

Proposed Changes

For part of #4038

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 marked this pull request as ready for review January 30, 2025 22:11
@JinZhou5042 JinZhou5042 self-assigned this Jan 30, 2025
@dthain dthain changed the title vine: increase VINE_TRANSFER_PROC_MAX_CHILD to be very large Vine: Worker transfer limit must be larger than manager limit. Jan 30, 2025
@dthain
Copy link
Member

dthain commented Jan 30, 2025

The point here is that VINE_WORKER_TRANSFER_LIMIT cannot be smaller than q->worker_source_max_transfers. (Otherwise transfers fail and tasks get forsaken even when the manager is doing the right thing.)

@btovar
Copy link
Member

btovar commented Jan 31, 2025

I think that this value should be defined in vine_manager.h instead. In that way, it will be available to ensure a correct bound when using vine_manager_tune().

@dthain
Copy link
Member

dthain commented Jan 31, 2025

I like the idea of connecting the two together clearly in the code.

@colinthomas-z80
Copy link
Contributor

colinthomas-z80 commented Jan 31, 2025

update: per our discussion we found the process limit is capable of causing connection timeouts, therefore a potential problem

I do not fully understand the problem at hand, but I do not think this define would have any conflicting interactions with the other parameters controlling worker transfers.

This define limits the maximum number of forked transfer processes at a given moment. If the manager schedules a number of transfers that exceeds this define, the transfers will wait in a queue, but still eventually complete.

The problem this solved was when a task has tens or hundreds of input files which need to be transferred. The manager does not respect transfer limits when it pertains to a single task. So it will queue tens or hundreds of transfers to occur in parallel, which caused socket timeouts. I think 128 is probably too high of a number.

@dthain
Copy link
Member

dthain commented Feb 3, 2025

Sorry if repeating myself, but wanted to make sure this was reflected in the issue.
There are three different concurrency controls:

1 - The primary concurrency control is in the manager, who knows where all of the files are, and where all of the transfers are happening. It uses m->current_transfer_table to ensure that no more than m->worker_source_max_transfers are targeting any one worker at a time.

2 - A secondary concurrency control is in the worker when it initiates transfers. The worker's cache manager can accumulate any number of transfers, but will not initiate more than options->max_transfer_procs (default 10) at once. This addresses your concern about a single task with 100 input files. It will only fetch 10 at once.

3 - The tertiary control is in the worker when it receives transfers. The transfer server will only fork VINE_TRANSFER_PROC_MAX_CHILD processes at once. After that, it will delay accepting connections, and those connections will time out.

When a large number of transfers are needed, our goal is to not to run them all concurrently, because they would all get bad performance simultaneously. Instead, we are relying on the manager to ensure that enough are scheduled to keep busy.

So, the essential requirement is:

manager->worker_source_max_transfers
<= worker->options->max_transfer_procs
<= VINE_TRANSFER_PROC_MAX_CHILD

But we also want some flexibility so that the user can manipulate the total amount of transfers going on, from a single point of control at the manager. Hence, VINE_TRANSFER_PROC_MAX_CHILD should be finite, but unreasonably large, so that we have the flexibility to dial things up and down from the manager side.

Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right change to solve the immediate problem. We can contemplate ways to make this more configurable from the manager down the road.

@btovar btovar merged commit 1875bd2 into cooperative-computing-lab:master Feb 5, 2025
10 checks passed
btovar pushed a commit that referenced this pull request Feb 6, 2025
@JinZhou5042 JinZhou5042 deleted the increase_VINE_TRANSFER_PROC_MAX_CHILD branch February 6, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants