Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] design doc: proposal for off-cpu profiling #144

Merged
merged 8 commits into from
Oct 28, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions design-docs/00001-off-cpu-profiling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
Off-CPU Profiling
=============================

# Meta

- **Author(s)**: Florian Lehner
- **Start Date**: 2024-06-01
- **Goal End Date**: to be defined
- **Primary Reviewers**: to be defined

# Abstract

The OTel Profiling Agent, while effective for on-CPU profiling, faces limitations in identifying
application blockages that introduce latency.

```mermaid
gantt
dateFormat SSS
axisFormat %L
title Database query of 100ms
section Thread Execution
On-CPU: on, 0, 20ms
Off-CPU: after on, 80ms
```
Latency impact example[^1].

To address this, the OTel Profiling Agent should extend its capabilities to include off-CPU
profiling. By combining on-CPU and off-CPU profiling, the OTel Profiling Agent can provide a more
comprehensive understanding of application and system performance. This enables identifying
bottlenecks and optimization for resource utilization, which leads to reduced energy consumption
and a smaller environmental footprints.
florianl marked this conversation as resolved.
Show resolved Hide resolved

# Scope

This document focuses on the hook points and the additional value that off-CPU profiling can provide
to the OTel Profiling Agent.

## Success criteria

The OTel Profiling Agent should be extended in a way, that existing profiling and stack unwinding
capabilities are reused to enable off-CPU profiling. Off-CPU profiling should be an optional
feature, that can be enabled additional to sampling based on-CPU profiling.

## Non-success criteria

Off-CPU profiling is not a replacement for dedicated disk I/O, memory allocation, network I/O, lock
contention or other specific performance topics. It can just be the indicator to investigate further
into dedicated areas.

Visualization and analysis of the off-CPU profiling information as well as correlating this data
with on-CPU profiling information is not within the scope of this proposal.

# Proposal

The OTel Profiling Agent is a sampling based profiler that utilizes the perf subsystem as entry
point for frequent stack unwinding. By default a sampling frequency of [20Hz](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/cli_flags.go#L24)
is used.

The eBPF program [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36)
is the entry program that starts the stack unwinding. To do so, it collects information like the
data stored in the CPU registers before starting the stack unwinding routine via tail calls. The
tail call destinations for the stack unwinding, like [`perf_event/unwind_native`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L751),
are generic eBPF programs that should be repurposed for off-CPU profiling.

In the following proposal options are evaluated to use additional hooks as entry points for stack
unwinding in order to enable off-CPU profiling capabilities.

With tracepoints and kprobes the Linux kernel provides two mechanisms for instrumentation that allow
to monitor and analyze the behavior of the system. To keep the impact of the profiling minimal
tracepoints are preferred over kprobes, as the former are more performant and statically defined in
the Linux kernel code.

A potential list of all possible tracepoints in the scope of the Linux kernel scheduler can be
retrieved with `sudo bpftrace -l 'tracepoint:sched*'`. While most of these potential tracepoints in
the Linux kernel scheduler are specific to a process, kernel or other event, this proposal focuses
on generic scheduler tracepoints.

## Technical background

It is the schedulers responsibility in the Linux kernel, to manage tasks[^2] and provide tasks with
CPU resources. In this concept [__schedule()](https://github.com/torvalds/linux/blob/5be63fc19fcaa4c236b307420483578a56986a37/kernel/sched/core.c#L6398)
is the central function that takes and provides CPU resources to tasks and does the CPU context
switch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some context as to how often this function gets called:

Suggested change
switch.
switch.
Typical Linux scheduler tick rates are 100, 250, 500 or 1000Hz.

Copy link
Contributor Author

@florianl florianl Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming frequencies might suggest to readers, that the Linux scheduler is called in some constant frequency - which is not correct. Usually, the scheduler provides timeslots for processing to events. The size of the timeslots is configurable, via /proc/sys/kernel/sched_rr_timeslice_ms, and is by default 100 ms. Often it is a problem, that events don't make full use of these timeslots - e.g. by calling a syscall or running into a lock. So how often the Linux kernel scheduler is called and in which freuqnecy, is depending on various variables, that are configurable but also can change dynamically depending on the work of events.


## Risks

All the following proposed options face the same common challenge, that it is possible to overload
the system by profiling every scheduling event. All proposed options mitigate this risk by

1. Ignoring the schedulers idle task.
2. Use a sampling approach to reduce the number of profiled scheduling events. The exact amount of
sampling should be configurable.
florianl marked this conversation as resolved.
Show resolved Hide resolved

The OTel Profiling Agent is using a technique that can be described as "lazy loading". Every time
the eBPF program of the OTel Profiling Agent [encounters a PID that is unknown](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L845-L846),
it informs the user space component about this new process. The entry hook for off-CPU profiling
will also have to do this check, as information needs to be available to be able to unwind the stack,
but should not inform the user space component, if the process is not known yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not inform userspace? We have safeguards in place to avoid flooding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity userspace should not be informed at this stage. While there are safeguards in place it is hard to predict the impact and name numbers. So maybe informing of userspace can be a future iteration when the design and idea is implemented and such iterations can be backed by benchmarks?

Copy link
Member

@christos68k christos68k Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be explored in the initial iteration. Our default rate limiting is fairly conservative and I don't anticipate major issues with typical Linux scheduler tick rates (e.g. 250Hz), but if that's not the case we need to know about it and maybe introduce an additional rate limiting option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risks section was written in mind with the potential downsides of off-CPU profiling and how they can be mitigated. Overall, the mitigation strategies are just a suggestion and I would also consider them just an implementation detail. Otherwise these mitigation strategies should have been part of Option A and/or Option B, if they impact the design of the proposals.

#196 just reports every unknown PID to userspace and over the last four weeks I didn't run into issues ™️. When writing this design doc, I wanted to be open about potential risks and how they can be mitigates. Back then, I had a limiter of reporting unknown PIDs to userspace in place. So far, it seems I wasn't right with the risk.

florianl marked this conversation as resolved.
Show resolved Hide resolved

## Option A

Attach stack unwinding functionallity to the tracepoint `tracepoint:sched:sched_switch`. This
florianl marked this conversation as resolved.
Show resolved Hide resolved
tracepoint is called everytime the Linux kernel scheduler takes resources from a task before
assigning these resources to another task.

Similar to the eBPF program [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36)
a new eBPF program of type `tracepoint` needs to be written, that can act as entry point and tail
call into the generic stack unwinding routines.

### Concept
The following [bpftrace](https://github.com/bpftrace/bpftrace) script showcases Option A:
```bash
#!/usr/bin/env bpftrace

tracepoint:sched:sched_switch
{
if (tid == 0) {
// Skip the idle task
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the actual eBPF implementation, should we also skip kernel-only tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping kernel-only tasks makes sense. I would consider this an implementation detail, which does not have an effect on the overall design/idea of off-cpu profiling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to the design document for future reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do consider this an implementation detail as it does not have an effect on the overall designs of the proposal. Skipping certain profiles just reduces the work that is done and makes sure the component is not using resources, without providing value for the user.

if (rand % 100 > 3 ) {
// Overload prevention - make sure only 3% of scheduling events are profiled
return
}
Comment on lines +122 to +125
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be configurable in the actual eBPF implementation.

Copy link
Contributor Author

@florianl florianl Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With system_config there is already a map that can be used to also make this configurable. To showcase the idea and for simplicity, rand % 100 > 3 was chosen. But I don't see a reason not to make it configurable.


printf("PID %d is taken off from CPU\n", pid);
printf("%s", ustack());
printf("\n");
}
```

## Option B

Use a two step method to not only get stack information but also record for how long tasks were
taken off from CPU.

In a first step use the tracepoint `tracepoint:sched:sched_switch` to record which task was taken
off from CPU and a timestamp. In a second hook at `kprobe:finish_task_switch.isra.0` check if the
task was seen before. If the task was seen before in the tracepoint, calculate the time the task was
off CPU and unwind the stack. Only the second step should tail call into further stack unwinding
routines, similar to [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36).
To communicate tasks between the two hooks a `BPF_MAP_TYPE_LRU_HASH` eBPF map should be used with
with the return of `bpf_get_current_pid_tgid()` as key and the timestamp in nanoseconds as value.
florianl marked this conversation as resolved.
Show resolved Hide resolved

### Concept
The following [bpftrace](https://github.com/bpftrace/bpftrace) script showcases Option B:
```bash
#!/usr/bin/env bpftrace

tracepoint:sched:sched_switch
{
if (tid == 0) {
// Skip the idle task
return
}
if (rand % 100 > 3 ) {
// Overload prevention - make sure only 3% of scheduling events are profiled
return
}
@task[tid] = nsecs;
}

kprobe:finish_task_switch.isra.0
/@task[tid]/
{
$off_start = @task[tid];
delete(@task[tid]);
printf("PID %d was off CPU for %d nsecs\n", pid, nsecs - $off_start);
printf("%s", ustack());
printf("\n");
}
```

## Sampling vs. Aggregation

Both proposed options leverage sampling techniques for off-CPU profiling. While aggregation in the
eBPF space can potentially reduce performance overhead by communicating only aggregated data to the
Copy link
Member

@christos68k christos68k Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is meant by aggregation here (merging of events?), e.g. I'm assuming the sampling approach will also aggregate data in maps which will be periodically cleared by userspace.

On the eBPF side no aggregation or merging of events for off-CPU data is intended or planed. Once off-CPU hooks are triggered the stack unwinding is using existing eBPF programs and reporting.
In stead of reusing the existing stack unwinding and reporting approach, it would be possible to aggregate off-CPU data and just report merged events to user space. My personal suggestion and the purpose of this document is, not to go this path.

Then, given that unwinding takes up the majority of eBPF processing time, how can aggregation in the eBPF space that is not sampling (e.g. unwinding every call) be more performant than the sampling approach which will only unwind a subset of these calls?

The overhead of the sampling approach is directly related to the sampling threshold. In the above shown concepts, it is 3% and in #196 it is configurable by the user via the CLI flag -off-cpu-threshold. The higher this sampling threshold, the more overhead is introduced by the sampling approach.
To follow the concepts as shown in this document, if 3 is replaced with 99, then nerarly every scheduling event triggers the off-CPU hooks. In such a case it would make sense, to aggregate off-CPU data in eBPF space for a summarized report to user space, instead of reporting every single stack unwinding.

Overall, sampling and aggregation can be done in various places and it depends on configuration and workload of the system. This paragraph should make the user aware of alternative concepts and focuses on the eBPF space for the sampling vs. aggregation approach. The proposed concepts should be fit most use cases - optimization for special cases is always an option.

user space component, it introduces additional complexity in managing the data. Additionally it can
be more challenging to analyze the aggregated data effectively, as it requires careful consideration
of aggregation techniques.
As the architecture of the stack unwinding routines in the OTel Profiling Agent are focused on a
sampling approach, the proposed options follow this idea.

# Author's preference

My preference is Option B, as it provides latency information additional to off-CPU stack traces,
which is crutual for latency analysis.
florianl marked this conversation as resolved.
Show resolved Hide resolved

Option B might be a bit more complex, as it utilizes two additional hooks along with an additional
florianl marked this conversation as resolved.
Show resolved Hide resolved
eBPF map for them to communicate, compared to Option A with a single hook on
florianl marked this conversation as resolved.
Show resolved Hide resolved
`tracepoint:sched:sched_switch`. The additional hook on `kprobe:finish_task_switch` for Option B
might also introduce some latency, as kprobes are less performant than tracepoints. But the latency
information along with the off-CPU stack trace justify these drawbacks from my point of view.
Comment on lines +199 to +201
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On supported kernels, we can use trampolines (fentry program) that are faster than kprobes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initial design documents tries not to change the minimum supported kernel version and tries to bring the off-cpu functionality to the given circumstances. With newer kernel version things can be done in a simpler way in more places. Using these advanced capabilities for newer kernels should be a future iteration.


As both options are attaching to very frequently called scheduler events, they face the same risks.
Mitigating these risks with the [described approaches](#risks) is essential.

# Decision

to be defined

[^1]: Inspired by `Systems Performance` by Brendan Gregg, Figure 1.3 `Disk I/O latency example`.
[^2]: The scheduler does not know about the concept of processes and process groups and treats
everything as a task.