Skip to content

Conversation

@nckrtl
Copy link
Contributor

@nckrtl nckrtl commented Oct 20, 2025

This PR addresses #1539: Processes not auto scaled when long running jobs are being processed.

I created a reproducible scenario in https://github.com/nckrtl/horizon-i1539, confirming the described issue. By following the README instructions the following scenario is executed:

  1. Two queues, 1 for long running jobs, 1 for short running jobs
  2. On the first queue 2 long running jobs of 5 minutes are dispatched, shortly after 100 additional shorter running jobs are dispatched (purely for demonstration purposes).
  3. The long running queue scales up nicely to 4 workers, after several scaling cycles.
  4. Then the short running queue get filled with jobs.
  5. The long running queue scales down by 1 worker but the short running queue workers aren't increased.
  6. In the following scaling cycles the long running queue stays "stuck" at 3 workers and the short running queue stays at 1 worker.
  7. Only after the long running tasks in the long running queue finishes, the scaling is being corrected.

Root cause

When a pool scales down, processes that should be terminated are marked and moved to $terminatingProcesses, see scaleDown(). However, totalProcessCount() in ProcessPool.php includes these terminating processes:

public function totalProcessCount()
{
    return count($this->processes()) + count($this->terminatingProcesses);
}

This is used in scalePool() and is used to determine the worker count to scale to. This is passed to scale() in the ProcessPool, and this is where the mismatch happens

public function scale($processes)
{
    $processes = max(0, (int) $processes);

    if ($processes === count($this->processes)) {
        return;
    }

    if ($processes > count($this->processes)) {
        $this->scaleUp($processes);
    } else {
        $this->scaleDown($processes);
    }
}

A count is performed on the processes, but not including the terminating processes. Leading to this flow happening after the first scale down:

  1. AutoScaler sees: totalProcessCount() = 4 (3 active + 1 terminating)
  2. Calculates the workers it should scale to, in case of the long_running queue it should scale down to 3:
$pool->scale(
    max(
        $totalProcessCount - $maxDownShift, // 4 - 1 = 3
        $supervisor->options->minProcesses, // 1
        $desiredProcessCount // 2
    ) // 3 is being passed to scale()
);
  1. The scale() method receives 3, but that method counts 3 as well. Resulting in no down scale happening, and staying stuck at 3 workers.
  2. As there are no additional workers made available, test_short cant scale up.
  3. After the first process marked for termination finishes its job and terminates, then the scale down can continue for the long_running queue making processes available for the short_queue to scale up.

The solution

Exclude terminating processes from totalProcessCount() so AutoScaler and ProcessPool use the same count.

Test

Added a test to proof the described issue, could be extended with additional tests to cover specific scaling scenarios.

Cleanup

  • Removed count() from ProcessPool, as totalProcessCount() does the same. As far as i could find, it was not being used. Good to double check.
  • Make scale() in ProcessPool also use totalProcessCount for consistency

@taylorotwell
Copy link
Member

Hey there - why the removal of the Countable interface? 🤔

@nckrtl
Copy link
Contributor Author

nckrtl commented Oct 20, 2025

As the content of count() ended up being the same as the content of totalProcessCount() it felt it would be confusing which one should be used or is being used. I also couldn't find where count() was being used, so I thought it made sense to remove it. When count is removed, there is no need for the Countable interface anymore.

However, I might have overlooked something making count() still necessary.

@taylorotwell
Copy link
Member

Hey @nckrtl - do you know if there is a way to fix this without changing the behavior of totalProcessCount? Just to minimize the possibility of breaking changes elsewhere that actually do depend on it returning the count + terminating process count.

@taylorotwell
Copy link
Member

I'm also a bit torn on if this is actually a bug, primarily because of this comment:

After the first process marked for termination finishes its job and terminates, then the scale down can continue for the long_running queue making processes available for the short_queue to scale up.

Your configuration specifies you don't want more than 5 processes (your max process count). It sounds like Horizon is indeed respecting that and not initiating another process until the terminating one finishes its job and is terminated. If we change this I worry we're going to have situations where people are going to have more processes running than are specified in their max process count, possibility causing out of memory exceptions, etc.

@taylorotwell taylorotwell marked this pull request as draft October 24, 2025 14:38
@nckrtl
Copy link
Contributor Author

nckrtl commented Oct 25, 2025

Hey @taylorotwell I did some additional testing and you're right. My changes do indeed result in exceeding the maximum configured processes. Yet the scaling down issue was fixed, with no processes sitting idle. I think we can have best of both worlds.

Assume all my changes are undone and we have a pool that needs to scale down for the second time:

  • Desired worker count: 1
  • Processes: 3
  • Terminating processes: 1

Currently, when determining the total process count we always include the terminating processes. But when scaling down only the desired worker count and the (non-terminating) processes matter. If the desired worker count is lower than the processes count, we should keep marking them for termination up until the desired count is reached. Yet when we have terminating processes in a pool they block further scaling down for as long as they are processing work (current issue) as they influence the scale down calculations.

When scaling up the terminating processes should be included in the calculations to not exceed the worker/process limit. When scaling down we don't have this issue.

I'm now proposing just this change in scalePool() in AutoScale.php

$totalProcessCount = $pool->processes()->count(); // <-- ignore terminatingProcesses initially

if($desiredProcessCount > $totalProcessCount) {
    $totalProcessCount = $pool->totalProcessCount(); // <-- include the terminatingProcesses to prevent exceeding the desired worker count
}
...

I tested this manually and everything works as described. I then tried to create tests to verify specific scenarios but the FakePool needs quite some changes to this properly. I'm more than happy to do so but wanted to get your opinion first and see if you agree with the proposed fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants