Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] design doc: proposal for off-cpu profiling #144

Merged
merged 8 commits into from
Oct 28, 2024

Conversation

florianl
Copy link
Contributor

This proposal builds on and follows the approach described in #114.

@florianl florianl requested review from a team August 31, 2024 10:55
@florianl florianl changed the title design doc: proposal for off-cpu profiling [RFC] design doc: proposal for off-cpu profiling Aug 31, 2024
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very promising proposal, thanks for submitting it 🙇 .

I think only the capabilities in Option B make sense to me. Capturing Off-CPU stack trace samples without duration doesn't seem useful to me.

The other important question is sampling. I suspect that sampling based on Off-CPU duration is going to produce more useful results than giving each event an equal chance of being captured. JFR implements this with thresholds, but we could also explore something along the lines of Go's Allocation sampling strategy that sets randomized sampling points drawn from an exponential distribution with a desired mean.

As far as other implementation details are concerned, it might be worth considering the following prior art:

My approval indicates that I support the general proposal along the lines of option B, and that we should to hash out sampling strategies and implementation details during prototyping. The results of the prototype should inform the final decision about including the feature.

Oh and I guess another obvious thing to point out: The approach outlined here only seems applicable to programs that rely on blocking I/O using native OS threads. It's not going to work for light weight threads in Go (Goroutines) or Java (Project Loom) or async I/O approaches like Node.js. But I'd imagine it will work well for many C/C++/Ruby/Python/PHP/etc. applications. This should probably be clarified in the document.

design-docs/00001-off-cpu-profiling/README.md Show resolved Hide resolved
@tsint
Copy link
Contributor

tsint commented Sep 5, 2024

Considering that off-CPU profiling is more expensive, it is recommended to add a process filter so that only processes meeting certain criteria are analyzed.

@florianl
Copy link
Contributor Author

florianl commented Sep 5, 2024

[..], it is recommended to add a process filter so that only processes meeting certain criteria are analyzed.

Filtering for a specific process is, from my point of view, not a viable option, as

  • the complete architecture of the profiling agent is focused on complete system profiling, so the architecture would need to change.
  • filtering for a process would not mitigate the named risks
  • filtering for a process would also add significant additional (management) complexity, as process can fork, become zombies etc.

@tsint
Copy link
Contributor

tsint commented Sep 5, 2024

If off-CPU profiling consumes too many resources, it becomes unsuitable for high-load servers. In many cases, people use profiler to solve problems, and the servers experiencing issues often have insufficient resources. On servers with ample resources, the need for a profiler is not as urgent.

@florianl
Copy link
Contributor Author

florianl commented Sep 5, 2024

[..] it becomes unsuitable for high-load servers. In many cases, people use profiler to solve problems, and the servers experiencing issues often have insufficient resources.

The Risks section is exactly about this. And this is the reason, why off-CPU profiling should be added as optional functionality. Meaning, it can be turned on/off as required.
The whole system sampling approach of this agent does work well in production since February 2021 - iirc the date correctly. Also on very high load systems. Consequently one key mitigation strategy to minimize risks for off-CPU profiling is also sampling. In the provided concept scripts, only 3% of the scheduling events (after the dismissal of idle tasks), is considered for further stack unwinding. Depending on the use case and system 3% of scheduling events can be a high or low number. Ultimately the threshold for the sampling (here 3%) should not be hard coded but configurable.

Signed-off-by: Florian Lehner <[email protected]>
Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this, and we have thought about implementing something like this in a very similar way.

Regarding overhead, I think since this is entirely opt-in, I see no issues regarding it being system-wide, as that's the entire design of the profiler and there's nothing stopping other projects from taking different approaches.

Copy link
Contributor

@iogbole iogbole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see this coming to fruition.

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved
Signed-off-by: Florian Lehner <[email protected]>
@florianl florianl requested review from a team as code owners October 3, 2024 12:24
@felixge
Copy link
Member

felixge commented Oct 13, 2024

Looks like we have enough consensus here? Unless there are any more concerns, I (or somebody else) can merge this in the next few days.

Comment on lines 96 to 98
it informs the user space component about this new process. The entry hook for off-CPU profiling
will also have to do this check, as information needs to be available to be able to unwind the stack,
but should not inform the user space component, if the process is not known yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not inform userspace? We have safeguards in place to avoid flooding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity userspace should not be informed at this stage. While there are safeguards in place it is hard to predict the impact and name numbers. So maybe informing of userspace can be a future iteration when the design and idea is implemented and such iterations can be backed by benchmarks?

Copy link
Member

@christos68k christos68k Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be explored in the initial iteration. Our default rate limiting is fairly conservative and I don't anticipate major issues with typical Linux scheduler tick rates (e.g. 250Hz), but if that's not the case we need to know about it and maybe introduce an additional rate limiting option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risks section was written in mind with the potential downsides of off-CPU profiling and how they can be mitigated. Overall, the mitigation strategies are just a suggestion and I would also consider them just an implementation detail. Otherwise these mitigation strategies should have been part of Option A and/or Option B, if they impact the design of the proposals.

#196 just reports every unknown PID to userspace and over the last four weeks I didn't run into issues ™️. When writing this design doc, I wanted to be open about potential risks and how they can be mitigates. Back then, I had a limiter of reporting unknown PIDs to userspace in place. So far, it seems I wasn't right with the risk.

if (tid == 0) {
// Skip the idle task
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the actual eBPF implementation, should we also skip kernel-only tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping kernel-only tasks makes sense. I would consider this an implementation detail, which does not have an effect on the overall design/idea of off-cpu profiling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to the design document for future reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do consider this an implementation detail as it does not have an effect on the overall designs of the proposal. Skipping certain profiles just reduces the work that is done and makes sure the component is not using resources, without providing value for the user.

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved
Comment on lines +121 to +124
if (rand % 100 > 3 ) {
// Overload prevention - make sure only 3% of scheduling events are profiled
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be configurable in the actual eBPF implementation.

Copy link
Contributor Author

@florianl florianl Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With system_config there is already a map that can be used to also make this configurable. To showcase the idea and for simplicity, rand % 100 > 3 was chosen. But I don't see a reason not to make it configurable.

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved
design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved
Comment on lines +198 to +200
`tracepoint:sched:sched_switch`. The additional hook on `kprobe:finish_task_switch` for Option B
might also introduce some latency, as kprobes are less performant than tracepoints. But the latency
information along with the off-CPU stack trace justify these drawbacks from my point of view.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On supported kernels, we can use trampolines (fentry program) that are faster than kprobes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initial design documents tries not to change the minimum supported kernel version and tries to bring the off-cpu functionality to the given circumstances. With newer kernel version things can be done in a simpler way in more places. Using these advanced capabilities for newer kernels should be a future iteration.

Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Oct 20, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
@florianl florianl mentioned this pull request Oct 20, 2024
@florianl
Copy link
Contributor Author

With #196 I have opened a draft, that backs this RFC. Once this RFC got merged, I will continue with #196.

@rockdaboot
Copy link
Contributor

Forgot to mention before: I'm voting for option B

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved
It is the schedulers responsibility in the Linux kernel, to manage tasks[^2] and provide tasks with
CPU resources. In this concept [__schedule()](https://github.com/torvalds/linux/blob/5be63fc19fcaa4c236b307420483578a56986a37/kernel/sched/core.c#L6398)
is the central function that takes and provides CPU resources to tasks and does the CPU context
switch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some context as to how often this function gets called:

Suggested change
switch.
switch.
Typical Linux scheduler tick rates are 100, 250, 500 or 1000Hz.

Copy link
Contributor Author

@florianl florianl Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming frequencies might suggest to readers, that the Linux scheduler is called in some constant frequency - which is not correct. Usually, the scheduler provides timeslots for processing to events. The size of the timeslots is configurable, via /proc/sys/kernel/sched_rr_timeslice_ms, and is by default 100 ms. Often it is a problem, that events don't make full use of these timeslots - e.g. by calling a syscall or running into a lock. So how often the Linux kernel scheduler is called and in which freuqnecy, is depending on various variables, that are configurable but also can change dynamically depending on the work of events.

## Sampling vs. Aggregation

Both proposed options leverage sampling techniques for off-CPU profiling. While aggregation in the
eBPF space can potentially reduce performance overhead by communicating only aggregated data to the
Copy link
Member

@christos68k christos68k Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

On the eBPF side no aggregation or merging of events for off-CPU data is intended or planed. Once off-CPU hooks are triggered the stack unwinding is using existing eBPF programs and reporting.
In stead of reusing the existing stack unwinding and reporting approach, it would be possible to aggregate off-CPU data and just report merged events to user space. My personal suggestion and the purpose of this document is, not to go this path.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

The overhead of the sampling approach is directly related to the sampling threshold. In the above shown concepts, it is 3% and in #196 it is configurable by the user via the CLI flag -off-cpu-threshold. The higher this sampling threshold, the more overhead is introduced by the sampling approach.
To follow the concepts as shown in this document, if 3 is replaced with 99, then nerarly every scheduling event triggers the off-CPU hooks. In such a case it would make sense, to aggregate off-CPU data in eBPF space for a summarized report to user space, instead of reporting every single stack unwinding.

Overall, sampling and aggregation can be done in various places and it depends on configuration and workload of the system. This paragraph should make the user aware of alternative concepts and focuses on the eBPF space for the sampling vs. aggregation approach. The proposed concepts should be fit most use cases - optimization for special cases is always an option.

Signed-off-by: Florian Lehner <[email protected]>
Copy link
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and I am voting also option B. Seems all agree on this, so we have consensus.

@florianl Could you update the Decision in the end, and potential Meta in the beginning too. We can merge this right after this. Thank you!

Signed-off-by: Florian Lehner <[email protected]>
@florianl
Copy link
Contributor Author

Added the decision to go forward with Option B.

@fabled fabled merged commit e3120e4 into open-telemetry:main Oct 28, 2024
23 checks passed
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Oct 29, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Nov 1, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Nov 8, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Nov 18, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Nov 26, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Dec 11, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Dec 16, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Dec 23, 2024
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
florianl added a commit to florianl/opentelemetry-ebpf-profiler that referenced this pull request Jan 6, 2025
This is the code that backs
open-telemetry#144.
It can be reused to add features like requested in
open-telemetry#33 and
therefore can be an alternative to
open-telemetry#192.

The idea that enables off CPU profiling is, that perf event and kprobe eBPF
programs are quite similar and can be converted. This allows, with the
dynamic rewrite of tail call maps, the reuse of existing eBPF programs and
concepts.

This proposal adds the new flag '-off-cpu-threshold' that enables off CPU
profiling and attaches the two additional hooks, as discussed in Option B
in open-telemetry#144.

Outstanding work:
- [ ] Handle off CPU traces in the reporter package
- [ ] Handle off CPU traces in the user space side

Signed-off-by: Florian Lehner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants