Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically adjust fetch debounce with PID controller #23313

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

ballard26
Copy link
Contributor

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.
This increased batching frees up CPU resources for other operations to
use and tends to decrease end-to-end latency.

From testing its been found empirically that when Redpanda is at
saturation restricting the fetch scheduling group to only consume 20% of
overall reactor utilization will improve end-to-end latency for a
variety of workloads.

This commit implements a PID controller that will dynamically adjust
fetch debounce to ensure that the fetch scheduling group is only
consuming 20% of overall reactor utilization when Redpanda is at
saturation.

The test results for the controller can be found here

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few mechanical comments.

Will review the actual function later.

"The target resource utilization of the fetch scheduling group between 1 "
"and 10,000",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
2000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What unit is this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We always include the units in docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that it's technically unit-less. Let me change it to be a percentage between 0.0 and 1.0 and then convert to the int representation internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this property to be a percentage between 0.0 and 1.0. Hopefully that'll be more intuitive.


// Ensure both fetch debounce and the fetch scheduling group are enabled
// before trying to apply any delay.
if (_debounce && ss::current_scheduling_group() == fetch_sg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the check for the fetch sg will work?

fetch_scheduling_group will just return the main/current schedudling group and the check will just pass as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, for some reason I was assuming it returned the scheduling group regardless. Changed it to use the config property.

@@ -1018,10 +1164,15 @@ class nonpolling_fetch_plan_executor final : public fetch_plan_executor::impl {
* Executes the supplied `plan` until `octx.should_stop_fetch` returns true.
*/
ss::future<> execute_plan(op_context& octx, fetch_plan plan) final {
if (_debounce) {
co_await ss::sleep(std::min(
config::shard_local_cfg().fetch_reads_debounce_timeout(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fetch_reads_debounce_timeout is dead now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an option, mainly removed it in the PR to start a discussion on whether we should kill it off or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another value to the fetch_read_strategy enum for the pid controller.

Feediver1
Feediver1 previously approved these changes Sep 16, 2024
Copy link

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add units in src/v/config/configuration.cc L687-8

Often when Redpanda is at CPU saturation the fetch scheduling group can
starve other operations. In these situations increasing the time a fetch
request waits on the server before starting allows for Redpanda to apply
backpressure to the clients and increase batching for fetch responses.
This increased batching frees up CPU resources for other operations to
use and tends to decrease end-to-end latency.

From testing its been found empirically that when Redpanda is at
saturation restricting the fetch scheduling group to only consume 20% of
overall reactor utilization will improve end-to-end latency for a
variety of workloads.

This commit implements a PID controller that will dynamically adjust
fetch debounce to ensure that the fetch scheduling group is only
consuming 20% of overall reactor utilization when Redpanda is at
saturation.
@StephanDollberg
Copy link
Member

Just restating what we discussed in person here:

I think we should just go ahead and merge the simplest form (PID controller on the coordinator shard) behind a feature flag. At the same time add some metrics that help us with judging how good it works. Then we can selectively enable it in cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants