Fix autoscaling deadlock caused by terminating processes #1636

nckrtl · 2025-10-20T18:57:05Z

This PR addresses #1539: Processes not auto scaled when long running jobs are being processed.

I created a reproducible scenario in https://github.com/nckrtl/horizon-i1539, confirming the described issue. By following the README instructions the following scenario is executed:

Two queues, 1 for long running jobs, 1 for short running jobs
On the first queue 2 long running jobs of 5 minutes are dispatched, shortly after 100 additional shorter running jobs are dispatched (purely for demonstration purposes).
The long running queue scales up nicely to 4 workers, after several scaling cycles.
Then the short running queue get filled with jobs.
The long running queue scales down by 1 worker but the short running queue workers aren't increased.
In the following scaling cycles the long running queue stays "stuck" at 3 workers and the short running queue stays at 1 worker.
Only after the long running tasks in the long running queue finishes, the scaling is being corrected.

Root cause

When a pool scales down, processes that should be terminated are marked and moved to $terminatingProcesses, see scaleDown(). However, totalProcessCount() in ProcessPool.php includes these terminating processes:

public function totalProcessCount()
{
    return count($this->processes()) + count($this->terminatingProcesses);
}

This is used in scalePool() and is used to determine the worker count to scale to. This is passed to scale() in the ProcessPool, and this is where the mismatch happens

public function scale($processes)
{
    $processes = max(0, (int) $processes);

    if ($processes === count($this->processes)) {
        return;
    }

    if ($processes > count($this->processes)) {
        $this->scaleUp($processes);
    } else {
        $this->scaleDown($processes);
    }
}

A count is performed on the processes, but not including the terminating processes. Leading to this flow happening after the first scale down:

AutoScaler sees: totalProcessCount() = 4 (3 active + 1 terminating)
Calculates the workers it should scale to, in case of the long_running queue it should scale down to 3:

$pool->scale(
    max(
        $totalProcessCount - $maxDownShift, // 4 - 1 = 3
        $supervisor->options->minProcesses, // 1
        $desiredProcessCount // 2
    ) // 3 is being passed to scale()
);

The scale() method receives 3, but that method counts 3 as well. Resulting in no down scale happening, and staying stuck at 3 workers.
As there are no additional workers made available, test_short cant scale up.
After the first process marked for termination finishes its job and terminates, then the scale down can continue for the long_running queue making processes available for the short_queue to scale up.

The solution

Exclude terminating processes from totalProcessCount() so AutoScaler and ProcessPool use the same count.

Test

Added a test to proof the described issue, could be extended with additional tests to cover specific scaling scenarios.

Cleanup

Removed count() from ProcessPool, as totalProcessCount() does the same. As far as i could find, it was not being used. Good to double check.
Make scale() in ProcessPool also use totalProcessCount for consistency

taylorotwell · 2025-10-20T21:48:20Z

Hey there - why the removal of the Countable interface? 🤔

nckrtl · 2025-10-20T21:57:39Z

As the content of count() ended up being the same as the content of totalProcessCount() it felt it would be confusing which one should be used or is being used. I also couldn't find where count() was being used, so I thought it made sense to remove it. When count is removed, there is no need for the Countable interface anymore.

However, I might have overlooked something making count() still necessary.

taylorotwell · 2025-10-24T14:32:47Z

Hey @nckrtl - do you know if there is a way to fix this without changing the behavior of totalProcessCount? Just to minimize the possibility of breaking changes elsewhere that actually do depend on it returning the count + terminating process count.

taylorotwell · 2025-10-24T14:38:14Z

I'm also a bit torn on if this is actually a bug, primarily because of this comment:

After the first process marked for termination finishes its job and terminates, then the scale down can continue for the long_running queue making processes available for the short_queue to scale up.

Your configuration specifies you don't want more than 5 processes (your max process count). It sounds like Horizon is indeed respecting that and not initiating another process until the terminating one finishes its job and is terminated. If we change this I worry we're going to have situations where people are going to have more processes running than are specified in their max process count, possibility causing out of memory exceptions, etc.

…count

nckrtl · 2025-10-25T15:30:35Z

Hey @taylorotwell I did some additional testing and you're right. My changes do indeed result in exceeding the maximum configured processes. Yet the scaling down issue was fixed, with no processes sitting idle. I think we can have best of both worlds.

Assume all my changes are undone and we have a pool that needs to scale down for the second time:

Desired worker count: 1
Processes: 3
Terminating processes: 1

Currently, when determining the total process count we always include the terminating processes. But when scaling down only the desired worker count and the (non-terminating) processes matter. If the desired worker count is lower than the processes count, we should keep marking them for termination up until the desired count is reached. Yet when we have terminating processes in a pool they block further scaling down for as long as they are processing work (current issue) as they influence the scale down calculations.

When scaling up the terminating processes should be included in the calculations to not exceed the worker/process limit. When scaling down we don't have this issue.

I'm now proposing just this change in scalePool() in AutoScale.php

$totalProcessCount = $pool->processes()->count(); // <-- ignore terminatingProcesses initially

if($desiredProcessCount > $totalProcessCount) {
    $totalProcessCount = $pool->totalProcessCount(); // <-- include the terminatingProcesses to prevent exceeding the desired worker count
}
...

I tested this manually and everything works as described. I then tried to create tests to verify specific scenarios but the FakePool needs quite some changes to this properly. I'm more than happy to do so but wanted to get your opinion first and see if you agree with the proposed fix.

nckrtl added 3 commits October 20, 2025 16:42

Bug fix

a17b9ab

Cleanup ProcessPool

83eacb6

Use totalProcessCount() for consistency

8bd33a8

taylorotwell marked this pull request as draft October 24, 2025 14:38

nckrtl added 5 commits October 25, 2025 16:09

Undo changes

cd9c2f7

Include the terminating processes count based on the desired process …

551e173

…count

undo repositioning of setting $totalProcessCount

7cc5b48

add style fix

811675d

fix tests

7018424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix autoscaling deadlock caused by terminating processes #1636

Fix autoscaling deadlock caused by terminating processes #1636

nckrtl commented Oct 20, 2025

Uh oh!

taylorotwell commented Oct 20, 2025

Uh oh!

nckrtl commented Oct 20, 2025 •

edited

Loading

Uh oh!

taylorotwell commented Oct 24, 2025

Uh oh!

taylorotwell commented Oct 24, 2025

Uh oh!

nckrtl commented Oct 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix autoscaling deadlock caused by terminating processes #1636

Are you sure you want to change the base?

Fix autoscaling deadlock caused by terminating processes #1636

Conversation

nckrtl commented Oct 20, 2025

Root cause

The solution

Test

Cleanup

Uh oh!

taylorotwell commented Oct 20, 2025

Uh oh!

nckrtl commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorotwell commented Oct 24, 2025

Uh oh!

taylorotwell commented Oct 24, 2025

Uh oh!

nckrtl commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nckrtl commented Oct 20, 2025 •

edited

Loading

nckrtl commented Oct 25, 2025 •

edited

Loading