Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vine: Fix Crash After Idle Disconnect #4065

Merged

Conversation

dthain
Copy link
Member

@dthain dthain commented Feb 14, 2025

Proposed Changes

@btovar observed a problem in which the manager crashes after processing an idle-disconnect message and disconnecting the worker. The problem is that this invalidates the vine_worker_info object and the link object in the middle of handle_worker. If the idle-disconnect is followed by any other asynchronous message (which is likely) then the manager ends up seeing messages from a worker that no longer exists.

The solution here is to eliminate the special case: handle_idle_disconnect now sends an exit message to a worker, but does not disconnect it immediately. This allows the manager to process any following asynchronous messages, and the worker can perform whatever cleanup and communication it needs to do before disconnecting. And then the disconnect will be handled in the single (normal) place when the connection is dropped.

@colinthomas-z80 the crash was occurring in the recently-modified location of link_poll_active_workers but really the problem was deeper in that we shouldn't be invalidating the worker object in multiple places.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

- Modify idle-disconnect to send "exit" message but not disconnect immediately.  (let the worker do it)
@dthain dthain added bug For modifications that fix a flaw in the code. critical TaskVine labels Feb 14, 2025
@dthain dthain merged commit f38ba02 into cooperative-computing-lab:master Feb 14, 2025
10 checks passed
btovar pushed a commit that referenced this pull request Feb 26, 2025
* - Commentary on poll_active_workers to clarify purpose.
- Modify idle-disconnect to send "exit" message but not disconnect immediately.  (let the worker do it)

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. critical TaskVine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants