Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] design doc: proposal for off-cpu profiling #144
[RFC] design doc: proposal for off-cpu profiling #144
Changes from 1 commit
8cb2b24
a0a4527
bb6c478
40e6656
bdbd5d6
b8f2593
415eb40
e0406b7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some context as to how often this function gets called:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming frequencies might suggest to readers, that the Linux scheduler is called in some constant frequency - which is not correct. Usually, the scheduler provides timeslots for processing to events. The size of the timeslots is configurable, via
/proc/sys/kernel/sched_rr_timeslice_ms
, and is by default 100 ms. Often it is a problem, that events don't make full use of these timeslots - e.g. by calling a syscall or running into a lock. So how often the Linux kernel scheduler is called and in which freuqnecy, is depending on various variables, that are configurable but also can change dynamically depending on the work of events.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not inform userspace? We have safeguards in place to avoid flooding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity userspace should not be informed at this stage. While there are safeguards in place it is hard to predict the impact and name numbers. So maybe informing of userspace can be a future iteration when the design and idea is implemented and such iterations can be backed by benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be explored in the initial iteration. Our default rate limiting is fairly conservative and I don't anticipate major issues with typical Linux scheduler tick rates (e.g. 250Hz), but if that's not the case we need to know about it and maybe introduce an additional rate limiting option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risks section was written in mind with the potential downsides of off-CPU profiling and how they can be mitigated. Overall, the mitigation strategies are just a suggestion and I would also consider them just an implementation detail. Otherwise these mitigation strategies should have been part of
Option A
and/orOption B
, if they impact the design of the proposals.#196 just reports every unknown PID to userspace and over the last four weeks I didn't run into issues ™️. When writing this design doc, I wanted to be open about potential risks and how they can be mitigates. Back then, I had a limiter of reporting unknown PIDs to userspace in place. So far, it seems I wasn't right with the risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the actual eBPF implementation, should we also skip kernel-only tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping kernel-only tasks makes sense. I would consider this an implementation detail, which does not have an effect on the overall design/idea of off-cpu profiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this to the design document for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do consider this an implementation detail as it does not have an effect on the overall designs of the proposal. Skipping certain profiles just reduces the work that is done and makes sure the component is not using resources, without providing value for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be configurable in the actual eBPF implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
system_config
there is already a map that can be used to also make this configurable. To showcase the idea and for simplicity,rand % 100 > 3
was chosen. But I don't see a reason not to make it configurable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.
Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the eBPF side no aggregation or merging of events for off-CPU data is intended or planed. Once off-CPU hooks are triggered the stack unwinding is using existing eBPF programs and reporting.
In stead of reusing the existing stack unwinding and reporting approach, it would be possible to aggregate off-CPU data and just report merged events to user space. My personal suggestion and the purpose of this document is, not to go this path.
The overhead of the sampling approach is directly related to the sampling threshold. In the above shown concepts, it is 3% and in #196 it is configurable by the user via the CLI flag
-off-cpu-threshold
. The higher this sampling threshold, the more overhead is introduced by the sampling approach.To follow the concepts as shown in this document, if 3 is replaced with 99, then nerarly every scheduling event triggers the off-CPU hooks. In such a case it would make sense, to aggregate off-CPU data in eBPF space for a summarized report to user space, instead of reporting every single stack unwinding.
Overall, sampling and aggregation can be done in various places and it depends on configuration and workload of the system. This paragraph should make the user aware of alternative concepts and focuses on the eBPF space for the sampling vs. aggregation approach. The proposed concepts should be fit most use cases - optimization for special cases is always an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On supported kernels, we can use trampolines (fentry program) that are faster than kprobes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initial design documents tries not to change the minimum supported kernel version and tries to bring the off-cpu functionality to the given circumstances. With newer kernel version things can be done in a simpler way in more places. Using these advanced capabilities for newer kernels should be a future iteration.