-
Notifications
You must be signed in to change notification settings - Fork 697
Fix autoscaling deadlock caused by terminating processes #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 5.x
Are you sure you want to change the base?
Conversation
|
Hey there - why the removal of the Countable interface? 🤔 |
|
As the content of However, I might have overlooked something making |
|
Hey @nckrtl - do you know if there is a way to fix this without changing the behavior of |
|
I'm also a bit torn on if this is actually a bug, primarily because of this comment:
Your configuration specifies you don't want more than 5 processes (your max process count). It sounds like Horizon is indeed respecting that and not initiating another process until the terminating one finishes its job and is terminated. If we change this I worry we're going to have situations where people are going to have more processes running than are specified in their max process count, possibility causing out of memory exceptions, etc. |
|
Hey @taylorotwell I did some additional testing and you're right. My changes do indeed result in exceeding the maximum configured processes. Yet the scaling down issue was fixed, with no processes sitting idle. I think we can have best of both worlds. Assume all my changes are undone and we have a pool that needs to scale down for the second time:
Currently, when determining the total process count we always include the terminating processes. But when scaling down only the desired worker count and the (non-terminating) processes matter. If the desired worker count is lower than the processes count, we should keep marking them for termination up until the desired count is reached. Yet when we have terminating processes in a pool they block further scaling down for as long as they are processing work (current issue) as they influence the scale down calculations. When scaling up the terminating processes should be included in the calculations to not exceed the worker/process limit. When scaling down we don't have this issue. I'm now proposing just this change in scalePool() in $totalProcessCount = $pool->processes()->count(); // <-- ignore terminatingProcesses initially
if($desiredProcessCount > $totalProcessCount) {
$totalProcessCount = $pool->totalProcessCount(); // <-- include the terminatingProcesses to prevent exceeding the desired worker count
}
...I tested this manually and everything works as described. I then tried to create tests to verify specific scenarios but the FakePool needs quite some changes to this properly. I'm more than happy to do so but wanted to get your opinion first and see if you agree with the proposed fix. |
This PR addresses #1539: Processes not auto scaled when long running jobs are being processed.
I created a reproducible scenario in https://github.com/nckrtl/horizon-i1539, confirming the described issue. By following the README instructions the following scenario is executed:
Root cause
When a pool scales down, processes that should be terminated are marked and moved to
$terminatingProcesses, see scaleDown(). However, totalProcessCount() inProcessPool.phpincludes these terminating processes:This is used in scalePool() and is used to determine the worker count to scale to. This is passed to scale() in the ProcessPool, and this is where the mismatch happens
A count is performed on the processes, but not including the terminating processes. Leading to this flow happening after the first scale down:
long_runningqueue it should scale down to 3:test_shortcant scale up.long_runningqueue making processes available for theshort_queueto scale up.The solution
Exclude terminating processes from totalProcessCount() so AutoScaler and ProcessPool use the same count.
Test
Added a test to proof the described issue, could be extended with additional tests to cover specific scaling scenarios.
Cleanup
scale()in ProcessPool also usetotalProcessCountfor consistency