[RFC] design doc: proposal for off-cpu profiling #144

florianl · 2024-08-31T10:55:36Z

This proposal builds on and follows the approach described in #114.

Signed-off-by: Florian Lehner <[email protected]>

felixge

This is a very promising proposal, thanks for submitting it 🙇 .

I think only the capabilities in Option B make sense to me. Capturing Off-CPU stack trace samples without duration doesn't seem useful to me.

The other important question is sampling. I suspect that sampling based on Off-CPU duration is going to produce more useful results than giving each event an equal chance of being captured. JFR implements this with thresholds, but we could also explore something along the lines of Go's Allocation sampling strategy that sets randomized sampling points drawn from an exponential distribution with a desired mean.

As far as other implementation details are concerned, it might be worth considering the following prior art:

Off-CPU Analysis article from Brendan Gregg
bcc's offcputime.py: Which seems to only use a finish_task_switch kprobe without instrumenting a tracepoint.
Perfetto's CPU Scheduling Event Docs: Which discusses using sched_waking as an alternative to sched_switch.

My approval indicates that I support the general proposal along the lines of option B, and that we should to hash out sampling strategies and implementation details during prototyping. The results of the prototype should inform the final decision about including the feature.

Oh and I guess another obvious thing to point out: The approach outlined here only seems applicable to programs that rely on blocking I/O using native OS threads. It's not going to work for light weight threads in Go (Goroutines) or Java (Project Loom) or async I/O approaches like Node.js. But I'd imagine it will work well for many C/C++/Ruby/Python/PHP/etc. applications. This should probably be clarified in the document.

design-docs/00001-off-cpu-profiling/README.md

tsint · 2024-09-05T06:34:46Z

Considering that off-CPU profiling is more expensive, it is recommended to add a process filter so that only processes meeting certain criteria are analyzed.

florianl · 2024-09-05T07:00:01Z

[..], it is recommended to add a process filter so that only processes meeting certain criteria are analyzed.

Filtering for a specific process is, from my point of view, not a viable option, as

the complete architecture of the profiling agent is focused on complete system profiling, so the architecture would need to change.
filtering for a process would not mitigate the named risks
filtering for a process would also add significant additional (management) complexity, as process can fork, become zombies etc.

tsint · 2024-09-05T07:46:09Z

If off-CPU profiling consumes too many resources, it becomes unsuitable for high-load servers. In many cases, people use profiler to solve problems, and the servers experiencing issues often have insufficient resources. On servers with ample resources, the need for a profiler is not as urgent.

florianl · 2024-09-05T08:01:24Z

[..] it becomes unsuitable for high-load servers. In many cases, people use profiler to solve problems, and the servers experiencing issues often have insufficient resources.

The Risks section is exactly about this. And this is the reason, why off-CPU profiling should be added as optional functionality. Meaning, it can be turned on/off as required.
The whole system sampling approach of this agent does work well in production since February 2021 - iirc the date correctly. Also on very high load systems. Consequently one key mitigation strategy to minimize risks for off-CPU profiling is also sampling. In the provided concept scripts, only 3% of the scheduling events (after the dismissal of idle tasks), is considered for further stack unwinding. Depending on the use case and system 3% of scheduling events can be a high or low number. Ultimately the threshold for the sampling (here 3%) should not be hard coded but configurable.

Signed-off-by: Florian Lehner <[email protected]>

brancz

Love this, and we have thought about implementing something like this in a very similar way.

Regarding overhead, I think since this is entirely opt-in, I see no issues regarding it being system-wide, as that's the entire design of the profiler and there's nothing stopping other projects from taking different approaches.

design-docs/00001-off-cpu-profiling/README.md

iogbole

I'm glad to see this coming to fruition.

design-docs/00001-off-cpu-profiling/README.md

Signed-off-by: Florian Lehner <[email protected]>

felixge · 2024-10-13T12:42:24Z

Looks like we have enough consensus here? Unless there are any more concerns, I (or somebody else) can merge this in the next few days.

christos68k · 2024-10-15T07:45:14Z

design-docs/00001-off-cpu-profiling/README.md

+it informs the user space component about this new process. The entry hook for off-CPU profiling
+will also have to do this check, as information needs to be available to be able to unwind the stack,
+but should not inform the user space component, if the process is not known yet.


Why not inform userspace? We have safeguards in place to avoid flooding.

For simplicity userspace should not be informed at this stage. While there are safeguards in place it is hard to predict the impact and name numbers. So maybe informing of userspace can be a future iteration when the design and idea is implemented and such iterations can be backed by benchmarks?

This should be explored in the initial iteration. Our default rate limiting is fairly conservative and I don't anticipate major issues with typical Linux scheduler tick rates (e.g. 250Hz), but if that's not the case we need to know about it and maybe introduce an additional rate limiting option.

The risks section was written in mind with the potential downsides of off-CPU profiling and how they can be mitigated. Overall, the mitigation strategies are just a suggestion and I would also consider them just an implementation detail. Otherwise these mitigation strategies should have been part of Option A and/or Option B, if they impact the design of the proposals.

#196 just reports every unknown PID to userspace and over the last four weeks I didn't run into issues ™️. When writing this design doc, I wanted to be open about potential risks and how they can be mitigates. Back then, I had a limiter of reporting unknown PIDs to userspace in place. So far, it seems I wasn't right with the risk.

christos68k · 2024-10-15T07:46:54Z

design-docs/00001-off-cpu-profiling/README.md

+    if (tid == 0) {
+        // Skip the idle task
+        return
+    }


In the actual eBPF implementation, should we also skip kernel-only tasks?

Skipping kernel-only tasks makes sense. I would consider this an implementation detail, which does not have an effect on the overall design/idea of off-cpu profiling.

We should add this to the design document for future reference.

I do consider this an implementation detail as it does not have an effect on the overall designs of the proposal. Skipping certain profiles just reduces the work that is done and makes sure the component is not using resources, without providing value for the user.

design-docs/00001-off-cpu-profiling/README.md

christos68k · 2024-10-15T07:50:11Z

design-docs/00001-off-cpu-profiling/README.md

+    if (rand % 100 > 3 ) {
+        // Overload prevention - make sure only 3% of scheduling events are profiled
+        return
+    }


This should also be configurable in the actual eBPF implementation.

With system_config there is already a map that can be used to also make this configurable. To showcase the idea and for simplicity, rand % 100 > 3 was chosen. But I don't see a reason not to make it configurable.

design-docs/00001-off-cpu-profiling/README.md

christos68k · 2024-10-15T07:59:22Z

design-docs/00001-off-cpu-profiling/README.md

+`tracepoint:sched:sched_switch`. The additional hook on `kprobe:finish_task_switch` for Option B
+might also introduce some latency, as kprobes are less performant than tracepoints. But the latency
+information along with the off-CPU stack trace justify these drawbacks from my point of view.


On supported kernels, we can use trampolines (fentry program) that are faster than kprobes.

This initial design documents tries not to change the minimum supported kernel version and tries to bring the off-cpu functionality to the given circumstances. With newer kernel version things can be done in a simpler way in more places. Using these advanced capabilities for newer kernels should be a future iteration.

Signed-off-by: Florian Lehner <[email protected]>

This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>

florianl · 2024-10-20T17:23:14Z

With #196 I have opened a draft, that backs this RFC. Once this RFC got merged, I will continue with #196.

rockdaboot · 2024-10-24T16:30:01Z

Forgot to mention before: I'm voting for option B

design-docs/00001-off-cpu-profiling/README.md

christos68k · 2024-10-24T19:54:03Z

design-docs/00001-off-cpu-profiling/README.md

+It is the schedulers responsibility in the Linux kernel, to manage tasks[^2] and provide tasks with
+CPU resources. In this concept [__schedule()](https://github.com/torvalds/linux/blob/5be63fc19fcaa4c236b307420483578a56986a37/kernel/sched/core.c#L6398)
+is the central function that takes and provides CPU resources to tasks and does the CPU context
+switch.


Let's add some context as to how often this function gets called:

Suggested change

switch.

switch.

Typical Linux scheduler tick rates are 100, 250, 500 or 1000Hz.

Naming frequencies might suggest to readers, that the Linux scheduler is called in some constant frequency - which is not correct. Usually, the scheduler provides timeslots for processing to events. The size of the timeslots is configurable, via /proc/sys/kernel/sched_rr_timeslice_ms, and is by default 100 ms. Often it is a problem, that events don't make full use of these timeslots - e.g. by calling a syscall or running into a lock. So how often the Linux kernel scheduler is called and in which freuqnecy, is depending on various variables, that are configurable but also can change dynamically depending on the work of events.

christos68k · 2024-10-24T20:01:57Z

design-docs/00001-off-cpu-profiling/README.md

+## Sampling vs. Aggregation
+
+Both proposed options leverage sampling techniques for off-CPU profiling. While aggregation in the
+eBPF space can potentially reduce performance overhead by communicating only aggregated data to the


I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

On the eBPF side no aggregation or merging of events for off-CPU data is intended or planed. Once off-CPU hooks are triggered the stack unwinding is using existing eBPF programs and reporting.
In stead of reusing the existing stack unwinding and reporting approach, it would be possible to aggregate off-CPU data and just report merged events to user space. My personal suggestion and the purpose of this document is, not to go this path.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

The overhead of the sampling approach is directly related to the sampling threshold. In the above shown concepts, it is 3% and in #196 it is configurable by the user via the CLI flag -off-cpu-threshold. The higher this sampling threshold, the more overhead is introduced by the sampling approach.
To follow the concepts as shown in this document, if 3 is replaced with 99, then nerarly every scheduling event triggers the off-CPU hooks. In such a case it would make sense, to aggregate off-CPU data in eBPF space for a summarized report to user space, instead of reporting every single stack unwinding.

Overall, sampling and aggregation can be done in various places and it depends on configuration and workload of the system. This paragraph should make the user aware of alternative concepts and focuses on the eBPF space for the sampling vs. aggregation approach. The proposed concepts should be fit most use cases - optimization for special cases is always an option.

Signed-off-by: Florian Lehner <[email protected]>

fabled

LGTM, and I am voting also option B. Seems all agree on this, so we have consensus.

@florianl Could you update the Decision in the end, and potential Meta in the beginning too. We can merge this right after this. Thank you!

Signed-off-by: Florian Lehner <[email protected]>

florianl · 2024-10-28T09:50:54Z

Added the decision to go forward with Option B.

This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>

design doc: proposal for off-cpu profiling

8cb2b24

Signed-off-by: Florian Lehner <[email protected]>

florianl added the design-document label Aug 31, 2024

florianl requested review from a team August 31, 2024 10:55

florianl changed the title ~~design doc: proposal for off-cpu profiling~~ [RFC] design doc: proposal for off-cpu profiling Aug 31, 2024

felixge approved these changes Sep 3, 2024

View reviewed changes

design-docs/00001-off-cpu-profiling/README.md Show resolved Hide resolved

fixup: add Limitations section

a0a4527

Signed-off-by: Florian Lehner <[email protected]>

brancz approved these changes Sep 9, 2024

View reviewed changes

rockdaboot reviewed Sep 16, 2024

View reviewed changes

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved

rockdaboot reviewed Sep 16, 2024

View reviewed changes

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved

rockdaboot approved these changes Sep 16, 2024

View reviewed changes

iogbole reviewed Sep 18, 2024

View reviewed changes

design-docs/00001-off-cpu-profiling/README.md Outdated Show resolved Hide resolved

fixup: fix typos

bb6c478

Signed-off-by: Florian Lehner <[email protected]>

florianl requested review from a team as code owners October 3, 2024 12:24

christos68k reviewed Oct 15, 2024

View reviewed changes

fabled mentioned this pull request Oct 15, 2024

Proposal: support for third party OSS to use otel-ebpf-profiler #192

Merged

florianl added 3 commits October 16, 2024 08:47

fixup: apply feedback

40e6656

Signed-off-by: Florian Lehner <[email protected]>

fixup: apply feedback

bdbd5d6

Signed-off-by: Florian Lehner <[email protected]>

fixup: drop the word additional

b8f2593

Signed-off-by: Florian Lehner <[email protected]>

florianl mentioned this pull request Oct 20, 2024

Off CPU profiling #196

Open

christos68k approved these changes Oct 24, 2024

View reviewed changes

fixup: apply feedback

415eb40

Signed-off-by: Florian Lehner <[email protected]>

fabled approved these changes Oct 28, 2024

View reviewed changes

fixup: add decision

e0406b7

Signed-off-by: Florian Lehner <[email protected]>

fabled merged commit e3120e4 into open-telemetry:main Oct 28, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] design doc: proposal for off-cpu profiling #144

[RFC] design doc: proposal for off-cpu profiling #144

florianl commented Aug 31, 2024

felixge left a comment •

edited

Loading

tsint commented Sep 5, 2024

florianl commented Sep 5, 2024

tsint commented Sep 5, 2024

florianl commented Sep 5, 2024

brancz left a comment

iogbole left a comment

felixge commented Oct 13, 2024

christos68k Oct 15, 2024

florianl Oct 16, 2024

christos68k Oct 16, 2024 •

edited

Loading

florianl Oct 24, 2024

christos68k Oct 15, 2024

florianl Oct 16, 2024

christos68k Oct 16, 2024

florianl Oct 24, 2024

christos68k Oct 15, 2024

florianl Oct 16, 2024 •

edited

Loading

christos68k Oct 15, 2024

florianl Oct 16, 2024

florianl commented Oct 20, 2024

rockdaboot commented Oct 24, 2024

christos68k Oct 24, 2024

florianl Oct 25, 2024 •

edited

Loading

christos68k Oct 24, 2024 •

edited

Loading

florianl Oct 25, 2024

fabled left a comment

florianl commented Oct 28, 2024

-switch.
+switch.
+Typical Linux scheduler tick rates are 100, 250, 500 or 1000Hz.

[RFC] design doc: proposal for off-cpu profiling #144

[RFC] design doc: proposal for off-cpu profiling #144

Conversation

florianl commented Aug 31, 2024

felixge left a comment • edited Loading

Choose a reason for hiding this comment

tsint commented Sep 5, 2024

florianl commented Sep 5, 2024

tsint commented Sep 5, 2024

florianl commented Sep 5, 2024

brancz left a comment

Choose a reason for hiding this comment

iogbole left a comment

Choose a reason for hiding this comment

felixge commented Oct 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christos68k Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

florianl Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

florianl commented Oct 20, 2024

rockdaboot commented Oct 24, 2024

Choose a reason for hiding this comment

florianl Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

christos68k Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabled left a comment

Choose a reason for hiding this comment

florianl commented Oct 28, 2024

felixge left a comment •

edited

Loading

christos68k Oct 16, 2024 •

edited

Loading

florianl Oct 16, 2024 •

edited

Loading

florianl Oct 25, 2024 •

edited

Loading

christos68k Oct 24, 2024 •

edited

Loading