Skip to content

Improve startup time #502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Improve startup time #502

wants to merge 10 commits into from

Conversation

atenart
Copy link
Contributor

@atenart atenart commented Mar 6, 2025

As of now Retis can take quite some time to handle various things before starting a collection. Investigating where time is spent boils down to three main points:

  • A good amount of time is spent reading /proc/kallsyms; not much we can do here.
  • It turns out removing noops when loading BPF programs is far from being optimized. This is fixed by btf/meta filter: introduce boolean expressions #496.
  • Loading BPF and attaching programs take a lot of time and are the major contributor to slowness. This is the goal of this PR (mainly reducing the number of programs loaded).

Combining this and #496 and using the generic profile, we can go from ~6.5s to ~2.5s starting time. In addition to this using kprobe multi helps when using lots of kprobes, especially at exit time (1 link vs 1 per probe). The raw tracepoints benefit from an optimization too, which helps a lot when using a lot of them — this was a major pain point in the past and one of the important reasons Retis could be slow to start.

Note that the last commit, grouping targeted probes with no hooks, might not show a lot of improvement atm but could be in the future. Nevertheless it is more logical, future proof and the changes are really not invasive.

This is based on multiple libbpf-rs changes, which have been merged upstream already but not released yet. This is the only reason this is marked as an RFC.

Some extra details in the commit logs.

@atenart atenart added the run-functional-tests Request functional tests to be run by CI label Mar 6, 2025
@atenart atenart force-pushed the at/startup-perfs branch 2 times, most recently from 7300217 to e315fee Compare March 6, 2025 17:35
@atenart
Copy link
Contributor Author

atenart commented Mar 7, 2025

I just realized support for getting the BPF cookie in raw tracepoints was only added a year ago[1]. For those kernels we could not use the cookie in raw tracepoints and keep the legacy one program per probe logic.

[1] 68ca5d4eebb8 ("bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs")

@atenart
Copy link
Contributor Author

atenart commented Mar 11, 2025

I just realized support for getting the BPF cookie in raw tracepoints was only added a year ago[1]. For those kernels we could not use the cookie in raw tracepoints and keep the legacy one program per probe logic.

[1] 68ca5d4eebb8 ("bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs")

Setting and getting cookies, as well as kprobe multi, are not supported on all kernel versions we want to support (e.g. rhel8). For this reason the old behavior was kept as a fallback when the wanted solution is not available. I also merged the kprobe and kretprobe support to ease such changes and maintenance.

@atenart
Copy link
Contributor Author

atenart commented Mar 11, 2025

TODO: regenerate both vmlinux.h files to latest kernel.

Ok(())
}

fn attach_raw_tracepoints_no_cookie(&mut self, probes: &[Probe]) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks nice. It looks like this goes in the direction of limiting (non-coalesceable/common) rodata.
I can't think of specific reasons why this could be limiting, but in general it seems the idea here is to.
For this to work, we need to adopt cookies usage for kernels supporting them, but still handling the "legacy" case (here I wonder if/when we may want to start considering removing the support for those kernels rather than special casing legacy scenarios).
Also, it's fair to say the more we squash, the more the gain, but it's also fair to say that start-up delay is generally more acceptable.

Furthermore, I'm thinking about freplace removal (that probably is not impactful, but potentially consuming .rodata gaining dead code removal), but I'm also thinking about another non-rodata scenario that could lead to handling the exception, and it's about per-probe filtering.
My point is of course raising potential (at least half envisioned) points that are worth a discussion WRT this change.

Do you have some numbers handy about the specific gain here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks nice. It looks like this goes in the direction of limiting (non-coalesceable/common) rodata. I can't think of specific reasons why this could be limiting, but in general it seems the idea here is to.

Yes, that's right. This was already the case in practice for kprobes, as the same loaded program was reused. Kprobe multi does not change this. The new part is about raw tracepoints, where not having per-probe rodata helps with loading time.

I think that's fine, the ideal scenario is using rodata for shared capabilities / configuration and the per-symbol map for per-probe configuration. I actually wanted to go further and provide nargs as part of this (allowing to load a single tp program) but that doesn't work with the verifier.

IMO it's actually not a design change, only tp were following this and only because of limitations.

For this to work, we need to adopt cookies usage for kernels supporting them, but still handling the "legacy" case (here I wonder if/when we may want to start considering removing the support for those kernels rather than special casing legacy scenarios).

Yes, 100% agree there will be a time when we'll drop support for this. For now only c8s kernel do not support multi kprobe and cookies. Raw tp cookies are less well supported as support was added ~a year ago (they are in recent c9s kernels).

Also, it's fair to say the more we squash, the more the gain, but it's also fair to say that start-up delay is generally more acceptable.

If we have a really good reason and increasing startup time is the only way, that might be ok. Otherwise > 6 secs seems quite long IMO.

Furthermore, I'm thinking about freplace removal (that probably is not impactful, but potentially consuming .rodata gaining dead code removal), but I'm also thinking about another non-rodata scenario that could lead to handling the exception, and it's about per-probe filtering. My point is of course raising potential (at least half envisioned) points that are worth a discussion WRT this change.

I think it comes down to the need to load a given program + its configuration for a set of probes that can share it. If a single program can, then that's fine. Otherwise that's better. Then it's about how we implement our features to maximize reuse of loaded programs, but it's not any different from the current situation with kprobes. That's a good point to keep in mind though.

Do you have some numbers handy about the specific gain here?

The load of raw tp are currently the most significant reason startup time is long. The change to share the same program for some of them (based on nargs) is thus the most significant gain in this PR. For example adding multi kprobe does not change things much. IIRC sharing raw tp programs alone was a gain of at least 1.5s, which is half the total gain here.

(For completeness, the above statement is when using the generic profile. When adding even more tracepoints the gain is even better, but also there is only a limited number of them. For kprobes, multi support really shows when adding 100s of probes, and mostly at exit time — one link vs one per-probe).

atenart added 10 commits June 2, 2025 12:06
Allowing to use the newly added program attach types (kprobe w/ opts,
raw tp w/ opts and kprobe multi).

Only required change is the way to access rodata, to take into account
that old kernels might not support memory-mapped global variable maps.
This should not change anything for us.

Signed-off-by: Antoine Tenart <[email protected]>
Generating a unified vmlinux.h file from BTF is tricky when more than a
one module is needed. As the module BTF are split BTFs, conflicts in
definitions and re-definitions of types can happen, which is hard to
detect and remove. We used a simple logic, using diff, but that turned
out not to work as expected.

Instead define a list of types we use in Retis and use pahole to
generate the minimum set of definitions required for compilation. This
can be used using multiple modules and avoid conflicts and
re-definitions.

The drawback is having to maintain a list of types we're interested of,
but that is probably the least bad approach.

Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Most of the logic is shared, merge the two to improve maintenance.

Signed-off-by: Antoine Tenart <[email protected]>
Instead of attaching all probes one by one, add them to the builders
(transferring ownership, which is more logical) and only then attach all
probes at once. Attaching probes added to a builder does not consume the
builder, it can be reused for later probe(s).

This does not change the current behavior but opens the door for
optimizations.

Signed-off-by: Antoine Tenart <[email protected]>
The kernel symbol address is used as the primary probe configuration
key. It could be retrieved dynamically in k(ret)probes but not in
raw_tracepoints and was there set as rodata. In an effort to move away
from probe-specific rodata, set the symbol address as the BPF cookie
when attaching all probes and use it to get the symbol address.

When setting cookies is not supported, fallback to the old behavior.
This is mainly for c8s.

Signed-off-by: Antoine Tenart <[email protected]>
Instead of attaching kprobes one by one, use kprobe_multi to attach them
all at once, speeding up the attaching time a lot when using lots of
kprobes.

Signed-off-by: Antoine Tenart <[email protected]>
…me nargs

Raw tracepoints are special as the verifier checks at load time no out
of bound argument is accessed, which prevents us from loading a program
once and reusing it for all raw tracepoint probes. However since we
moved the symbol address to the BPF cookie the only probe-specific
variable in our tracepoints is the number of arguments. This one has to
be rodata for the verifier to work, but nothing prevents us from sharing
the same program for tracepoints have the same number of arguments.

Do this. This speed up loading time a lot when using tracepoints, e.g.
the generic profile startup time is divided by two as we have the
following and as loading a raw tracepoint program always take quite some
time:

 nargs 1: [
        "tp:net:napi_gro_receive_entry",
        "tp:net:napi_gro_frags_entry",
        "tp:net:netif_rx",
        "tp:net:net_dev_queue",
        "tp:net:netif_receive_skb"]
 nargs 2: ["tp:net:net_dev_start_xmit"]
 nargs 4: ["tp:skb:kfree_skb"]

Signed-off-by: Antoine Tenart <[email protected]>
Targeted probes use a single use builder. But when they have no hook
attached, they can instead use shared builders. This speeds up attaching
time for those (e.g. skb tracking probes).

Signed-off-by: Antoine Tenart <[email protected]>
@atenart atenart force-pushed the at/startup-perfs branch from 522286d to 39c839c Compare June 5, 2025 14:21
@atenart
Copy link
Contributor Author

atenart commented Jun 5, 2025

Libbpf-rs was released which allowed to complete this PR, however the new functionalities in the probe builders require an updated vmlinux.h header. It turned out generating one was again broken and this took some time as we hit multiple issues.

Things looks OKish now, but the updated gen_vmlinux.sh script relies on non-upstream pahole fixes. We might want to wait for those to be submitted to see if the approach (and the generated vmlinux.h file) is right.

@atenart atenart changed the title [RFC] Improve startup time Improve startup time Jun 5, 2025
@atenart atenart added this to the v1.6 milestone Jun 5, 2025
Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 0% with 332 lines in your changes missing coverage. Please review.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #502      +/-   ##
==========================================
- Coverage   31.67%   31.23%   -0.44%     
==========================================
  Files          95       94       -1     
  Lines       11026    11180     +154     
  Branches    11026    11180     +154     
==========================================
  Hits         3492     3492              
- Misses       7314     7468     +154     
  Partials      220      220              
Flag Coverage Δ
aarch64 31.23% <0.00%> (-0.44%) ⬇️
x86_64 31.23% <0.00%> (-0.44%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


fn attach(&mut self) -> Result<()> {
let tmp = std::mem::take(&mut self.probes);
self.attach_kprobes(&tmp)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not draining self.probes inside attach_kprobes or just inlining it here?

const volatile u32 nargs = 0;

/* Only used in legacy mode where the raw tracepoint program can't be reused */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we are kind of messy with this but at least in this file we are adding full stops at the end of sentences...

@@ -38,6 +38,27 @@ impl Inspector {
}
}

/// Parses a struct and returns its variant names, trimed if asked to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/a struct/an enum

// We're not supporting every variant in between (kprobe + cookie, kprobe
// multi - cookies) to keep things reasonable. It's also fine to fallback in
// case the kernel is not recent enough.
#[default]
KprobeNoCookie,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a two variant enum that needs such an explanation suggests to me that we could replace this with a simpler bool legacy, or `bool compat'.

#[allow(clippy::map_entry)] // Fixes double mutable refs.
if !self.skels.contains_key(&nargs) {
let new = self.init_skel(nargs, None)?;
self.skels.insert(nargs, new);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would self.skels contain the key? AFAICS, it is called once per distinct nargs.
Isn't it cleaner to always create the skel and insert it into skels at end?


/// Populates generic builders.
fn gen_generic_builders(&mut self) -> Result<()> {
// Already initialized? Bail out early.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment is rather redundant.

@@ -401,6 +402,7 @@ pub(crate) struct ProbeRuntimeManager {
#[cfg(not(test))]
counters_map: libbpf_rs::MapHandle,
generic_builders: HashMap<usize, Box<dyn ProbeBuilder>>,
targeted_nohook_builders: HashMap<usize, Box<dyn ProbeBuilder>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unrelated but when trying to read this code after a while its not trivial what the key of this HashMap (and the others) is. Maybe we could type ProbeTypeIdx = usize

Ok(())
}

/// Populates targeted nohhok builders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nohhok/nohook/


/// Populates targeted nohhok builders.
fn gen_targeted_nohook_builders(&mut self) -> Result<()> {
// Already initialized? Bail out early.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -460,12 +462,7 @@ impl ProbeRuntimeManager {
}

/// Populates generic builders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is now confusing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-functional-tests Request functional tests to be run by CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants