Skip to content

Conversation

@djmb
Copy link
Contributor

@djmb djmb commented Apr 4, 2024

If there's an error in one thread, we wait for earlier hosts to complete but not for later ones.

Ensure consistent behaviour across all hosts by saving the exception and joining each thread in turn before returning.

If there's an error in one thread, we wait for earlier hosts to
complete but not for later ones.

Ensure consistent behaviour across all hosts by saving the exception and
joining each thread in turn before returning.
private

def wait_for_threads(threads)
exception = nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps storing all of the exceptions would be more informative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a nice improvement - if something fails on multiple hosts it can be tempting to assume the problem is specific just to the host that the error is reported on.

We'd need a way to do it without breaking the current API though which is an ExecuteError with a single cause. We could store multiple causes I guess and have ExecuteError#cause return the first one?

But I think that would be something for another PR.

@mattbrictson
Copy link
Member

Hi @djmb , could you give me a little more backstory on this proposal?

My interpretation is that SSHKit's parallel runner, in its current form, is effectively "fail-fast". In other words, as soon a host fails, an exception is raised right away. If the exception is not rescued, then the Ruby process exits and any subsequent hosts are not allowed to run to completion.

The change being proposed in this PR is to remove "fail-fast" for in: :parallel execution and instead explicitly wait for all hosts to run to completion. Once all hosts are allowed to complete, then an exception is raised corresponding to the first failure. (However, if in: :groups is used, the first group that fails would still short-circuit subsequent groups and prevent them from running.)

I hesitate to accept this PR as-is because this seems like a significant change that could be considered breaking. Although I was not present for the original creation of SSHKit, I have to assume that the fail-fast behavior was an intentional design decision.

Are there other solutions that wouldn't involve changing the default behavior? Could you register a custom runner for your particular use case? For example, you should be able to set the default using:

SSHKit.config.default_runner = MyRunner

Or as a one-off:

on(hosts, in: MyRunner) do
  # ...
end

Would either of those work?

@djmb
Copy link
Contributor Author

djmb commented Apr 15, 2024

HI @mattbrictson,

My reasoning for the change is that the current behaviour is not really fail fast - the parallel runner joins the threads in turn so if the second host thread has an exception, we wait for the first host thread, but not the third.

Since we sometimes wait and sometimes do not, it seems more consistent to always wait.

However always waiting is also the behaviour that I'm looking for, so maybe there's some motivated reasoning going on here 😅.

Using the default runner doesn't work completely in my case, as the group runner uses the parallel runner internally so I need create a custom parallel runner and a custom group runner. But I can certainly work around this if we want to keep the current behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants