Dynamically adjust fetch debounce with PID controller #23313

ballard26 · 2024-09-12T22:01:28Z

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.
This increased batching frees up CPU resources for other operations to
use and tends to decrease end-to-end latency.

From testing its been found empirically that when Redpanda is at
saturation restricting the fetch scheduling group to only consume 20% of
overall reactor utilization will improve end-to-end latency for a
variety of workloads.

This commit implements a PID controller that will dynamically adjust
fetch debounce to ensure that the fetch scheduling group is only
consuming 20% of overall reactor utilization when Redpanda is at
saturation.

The test results for the controller can be found here

Backports Required

Release Notes

none

StephanDollberg

few mechanical comments.

Will review the actual function later.

StephanDollberg · 2024-09-16T13:52:45Z

src/v/config/configuration.cc

+      "The target resource utilization of the fetch scheduling group between 1 "
+      "and 10,000",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      2000,


What unit is this?

+1 We always include the units in docs

I guess that it's technically unit-less. Let me change it to be a percentage between 0.0 and 1.0 and then convert to the int representation internally.

Changed this property to be a percentage between 0.0 and 1.0. Hopefully that'll be more intuitive.

StephanDollberg · 2024-09-16T13:55:19Z

src/v/kafka/server/handlers/fetch.cc

+
+        // Ensure both fetch debounce and the fetch scheduling group are enabled
+        // before trying to apply any delay.
+        if (_debounce && ss::current_scheduling_group() == fetch_sg) {


I don't think the check for the fetch sg will work?

fetch_scheduling_group will just return the main/current schedudling group and the check will just pass as well?

Nice catch, for some reason I was assuming it returned the scheduling group regardless. Changed it to use the config property.

StephanDollberg · 2024-09-16T13:57:28Z

src/v/kafka/server/handlers/fetch.cc

@@ -1018,10 +1164,15 @@ class nonpolling_fetch_plan_executor final : public fetch_plan_executor::impl {
     * Executes the supplied `plan` until `octx.should_stop_fetch` returns true.
     */
    ss::future<> execute_plan(op_context& octx, fetch_plan plan) final {
-        if (_debounce) {
-            co_await ss::sleep(std::min(
-              config::shard_local_cfg().fetch_reads_debounce_timeout(),


So fetch_reads_debounce_timeout is dead now?

It's an option, mainly removed it in the PR to start a discussion on whether we should kill it off or not.

Added another value to the fetch_read_strategy enum for the pid controller.

Feediver1

please add units in src/v/config/configuration.cc L687-8

Often when Redpanda is at CPU saturation the fetch scheduling group can starve other operations. In these situations increasing the time a fetch request waits on the server before starting allows for Redpanda to apply backpressure to the clients and increase batching for fetch responses. This increased batching frees up CPU resources for other operations to use and tends to decrease end-to-end latency. From testing its been found empirically that when Redpanda is at saturation restricting the fetch scheduling group to only consume 20% of overall reactor utilization will improve end-to-end latency for a variety of workloads. This commit implements a PID controller that will dynamically adjust fetch debounce to ensure that the fetch scheduling group is only consuming 20% of overall reactor utilization when Redpanda is at saturation.

StephanDollberg · 2024-09-21T11:09:31Z

Just restating what we discussed in person here:

I think we should just go ahead and merge the simplest form (PID controller on the coordinator shard) behind a feature flag. At the same time add some metrics that help us with judging how good it works. Then we can selectively enable it in cloud.

ballard26 requested review from travisdowns and StephanDollberg September 12, 2024 22:01

ballard26 requested a review from a team as a code owner September 12, 2024 22:01

github-actions bot added the area/redpanda label Sep 12, 2024

StephanDollberg reviewed Sep 16, 2024

View reviewed changes

Feediver1 previously approved these changes Sep 16, 2024

View reviewed changes

ballard26 dismissed Feediver1’s stale review via e1e8a4a September 18, 2024 13:40

ballard26 force-pushed the fetch-pid-controller branch from 321dfdd to e1e8a4a Compare September 18, 2024 13:40

ballard26 added 2 commits September 20, 2024 18:12

config: add properties for fetch pid controller

87a16ce

ballard26 force-pushed the fetch-pid-controller branch from e1e8a4a to 9cb4422 Compare September 20, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically adjust fetch debounce with PID controller #23313

Dynamically adjust fetch debounce with PID controller #23313

ballard26 commented Sep 12, 2024

StephanDollberg left a comment

StephanDollberg Sep 16, 2024

Feediver1 Sep 16, 2024

ballard26 Sep 18, 2024

ballard26 Sep 20, 2024

StephanDollberg Sep 16, 2024

ballard26 Sep 18, 2024

StephanDollberg Sep 16, 2024

ballard26 Sep 18, 2024

ballard26 Sep 20, 2024

Feediver1 left a comment

StephanDollberg commented Sep 21, 2024

Dynamically adjust fetch debounce with PID controller #23313

Are you sure you want to change the base?

Dynamically adjust fetch debounce with PID controller #23313

Conversation

ballard26 commented Sep 12, 2024

Backports Required

Release Notes

StephanDollberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Feediver1 left a comment

Choose a reason for hiding this comment

StephanDollberg commented Sep 21, 2024