-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] design doc: proposal for off-cpu profiling #144
Conversation
Signed-off-by: Florian Lehner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very promising proposal, thanks for submitting it 🙇 .
I think only the capabilities in Option B make sense to me. Capturing Off-CPU stack trace samples without duration doesn't seem useful to me.
The other important question is sampling. I suspect that sampling based on Off-CPU duration is going to produce more useful results than giving each event an equal chance of being captured. JFR implements this with thresholds, but we could also explore something along the lines of Go's Allocation sampling strategy that sets randomized sampling points drawn from an exponential distribution with a desired mean.
As far as other implementation details are concerned, it might be worth considering the following prior art:
- Off-CPU Analysis article from Brendan Gregg
- bcc's offcputime.py: Which seems to only use a
finish_task_switch
kprobe without instrumenting a tracepoint. - Perfetto's CPU Scheduling Event Docs: Which discusses using
sched_waking
as an alternative tosched_switch
.
My approval indicates that I support the general proposal along the lines of option B, and that we should to hash out sampling strategies and implementation details during prototyping. The results of the prototype should inform the final decision about including the feature.
Oh and I guess another obvious thing to point out: The approach outlined here only seems applicable to programs that rely on blocking I/O using native OS threads. It's not going to work for light weight threads in Go (Goroutines) or Java (Project Loom) or async I/O approaches like Node.js. But I'd imagine it will work well for many C/C++/Ruby/Python/PHP/etc. applications. This should probably be clarified in the document.
Considering that off-CPU profiling is more expensive, it is recommended to add a process filter so that only processes meeting certain criteria are analyzed. |
Filtering for a specific process is, from my point of view, not a viable option, as
|
If off-CPU profiling consumes too many resources, it becomes unsuitable for high-load servers. In many cases, people use profiler to solve problems, and the servers experiencing issues often have insufficient resources. On servers with ample resources, the need for a profiler is not as urgent. |
The |
Signed-off-by: Florian Lehner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this, and we have thought about implementing something like this in a very similar way.
Regarding overhead, I think since this is entirely opt-in, I see no issues regarding it being system-wide, as that's the entire design of the profiler and there's nothing stopping other projects from taking different approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see this coming to fruition.
Signed-off-by: Florian Lehner <[email protected]>
Looks like we have enough consensus here? Unless there are any more concerns, I (or somebody else) can merge this in the next few days. |
it informs the user space component about this new process. The entry hook for off-CPU profiling | ||
will also have to do this check, as information needs to be available to be able to unwind the stack, | ||
but should not inform the user space component, if the process is not known yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not inform userspace? We have safeguards in place to avoid flooding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity userspace should not be informed at this stage. While there are safeguards in place it is hard to predict the impact and name numbers. So maybe informing of userspace can be a future iteration when the design and idea is implemented and such iterations can be backed by benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be explored in the initial iteration. Our default rate limiting is fairly conservative and I don't anticipate major issues with typical Linux scheduler tick rates (e.g. 250Hz), but if that's not the case we need to know about it and maybe introduce an additional rate limiting option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risks section was written in mind with the potential downsides of off-CPU profiling and how they can be mitigated. Overall, the mitigation strategies are just a suggestion and I would also consider them just an implementation detail. Otherwise these mitigation strategies should have been part of Option A
and/or Option B
, if they impact the design of the proposals.
#196 just reports every unknown PID to userspace and over the last four weeks I didn't run into issues ™️. When writing this design doc, I wanted to be open about potential risks and how they can be mitigates. Back then, I had a limiter of reporting unknown PIDs to userspace in place. So far, it seems I wasn't right with the risk.
if (tid == 0) { | ||
// Skip the idle task | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the actual eBPF implementation, should we also skip kernel-only tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping kernel-only tasks makes sense. I would consider this an implementation detail, which does not have an effect on the overall design/idea of off-cpu profiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this to the design document for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do consider this an implementation detail as it does not have an effect on the overall designs of the proposal. Skipping certain profiles just reduces the work that is done and makes sure the component is not using resources, without providing value for the user.
if (rand % 100 > 3 ) { | ||
// Overload prevention - make sure only 3% of scheduling events are profiled | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be configurable in the actual eBPF implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With system_config
there is already a map that can be used to also make this configurable. To showcase the idea and for simplicity, rand % 100 > 3
was chosen. But I don't see a reason not to make it configurable.
`tracepoint:sched:sched_switch`. The additional hook on `kprobe:finish_task_switch` for Option B | ||
might also introduce some latency, as kprobes are less performant than tracepoints. But the latency | ||
information along with the off-CPU stack trace justify these drawbacks from my point of view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On supported kernels, we can use trampolines (fentry program) that are faster than kprobes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initial design documents tries not to change the minimum supported kernel version and tries to bring the off-cpu functionality to the given circumstances. With newer kernel version things can be done in a simpler way in more places. Using these advanced capabilities for newer kernels should be a future iteration.
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
Forgot to mention before: I'm voting for option B |
It is the schedulers responsibility in the Linux kernel, to manage tasks[^2] and provide tasks with | ||
CPU resources. In this concept [__schedule()](https://github.com/torvalds/linux/blob/5be63fc19fcaa4c236b307420483578a56986a37/kernel/sched/core.c#L6398) | ||
is the central function that takes and provides CPU resources to tasks and does the CPU context | ||
switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some context as to how often this function gets called:
switch. | |
switch. | |
Typical Linux scheduler tick rates are 100, 250, 500 or 1000Hz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming frequencies might suggest to readers, that the Linux scheduler is called in some constant frequency - which is not correct. Usually, the scheduler provides timeslots for processing to events. The size of the timeslots is configurable, via /proc/sys/kernel/sched_rr_timeslice_ms
, and is by default 100 ms. Often it is a problem, that events don't make full use of these timeslots - e.g. by calling a syscall or running into a lock. So how often the Linux kernel scheduler is called and in which freuqnecy, is depending on various variables, that are configurable but also can change dynamically depending on the work of events.
## Sampling vs. Aggregation | ||
|
||
Both proposed options leverage sampling techniques for off-CPU profiling. While aggregation in the | ||
eBPF space can potentially reduce performance overhead by communicating only aggregated data to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.
Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.
On the eBPF side no aggregation or merging of events for off-CPU data is intended or planed. Once off-CPU hooks are triggered the stack unwinding is using existing eBPF programs and reporting.
In stead of reusing the existing stack unwinding and reporting approach, it would be possible to aggregate off-CPU data and just report merged events to user space. My personal suggestion and the purpose of this document is, not to go this path.
Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?
The overhead of the sampling approach is directly related to the sampling threshold. In the above shown concepts, it is 3% and in #196 it is configurable by the user via the CLI flag -off-cpu-threshold
. The higher this sampling threshold, the more overhead is introduced by the sampling approach.
To follow the concepts as shown in this document, if 3 is replaced with 99, then nerarly every scheduling event triggers the off-CPU hooks. In such a case it would make sense, to aggregate off-CPU data in eBPF space for a summarized report to user space, instead of reporting every single stack unwinding.
Overall, sampling and aggregation can be done in various places and it depends on configuration and workload of the system. This paragraph should make the user aware of alternative concepts and focuses on the eBPF space for the sampling vs. aggregation approach. The proposed concepts should be fit most use cases - optimization for special cases is always an option.
Signed-off-by: Florian Lehner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and I am voting also option B. Seems all agree on this, so we have consensus.
@florianl Could you update the Decision in the end, and potential Meta in the beginning too. We can merge this right after this. Thank you!
Signed-off-by: Florian Lehner <[email protected]>
Added the decision to go forward with Option B. |
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This is the code that backs open-telemetry#144. It can be reused to add features like requested in open-telemetry#33 and therefore can be an alternative to open-telemetry#192. The idea that enables off CPU profiling is, that perf event and kprobe eBPF programs are quite similar and can be converted. This allows, with the dynamic rewrite of tail call maps, the reuse of existing eBPF programs and concepts. This proposal adds the new flag '-off-cpu-threshold' that enables off CPU profiling and attaches the two additional hooks, as discussed in Option B in open-telemetry#144. Outstanding work: - [ ] Handle off CPU traces in the reporter package - [ ] Handle off CPU traces in the user space side Signed-off-by: Florian Lehner <[email protected]>
This proposal builds on and follows the approach described in #114.