Skip to content

Heap profiler with leak tracking #5763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 14, 2025
Merged

Heap profiler with leak tracking #5763

merged 13 commits into from
May 14, 2025

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Apr 29, 2025

Description
This PR adds a very lightweight, feature gated, heap profiling endpoint. It superseeds the previous attempt in #5724

Using the jemalloc native profiling is complicated, it's hard to asses its performance impact, and the resulting traces are hard to interpret.

This PR proposes an custom and simple heap profiler:

  • written so that the profiler has almost no performance impact when it's not being used actively. Every allocation/deallocation simply checks for an atomic ENABLED flag with relaxed ordering and does no extra work if that flag is false.
  • the profiler can be activated with a webservice. It builds allocation statistics by call site and prints a backtrace when they exceed their last highest by a given threshold.
  • it uses backtrace to identify allocation call sites and track cumulated amount of allocation at any given time
  • it prints both a backtrace and a tokio trace when a call site allocates a lot of memory, enabling a view on both the stack trace and the tokio future call chain

When started, the profiling allocates ~30MB of symbols.

How was this PR tested?
Python scripts, see results below.

@rdettai rdettai self-assigned this Apr 29, 2025
@rdettai rdettai requested a review from PSeitz April 29, 2025 15:04
@rdettai rdettai changed the base branch from custom-heap-prof to main April 30, 2025 10:18
@rdettai
Copy link
Collaborator Author

rdettai commented May 6, 2025

I ran some tests both during indexing (ingest V2) and search (float aggregation). On a pretty system (10 indexes with 1 milion docs ingested to each index):

Without the feature flag:

  • indexing duration: 96.81
  • avg search duration: 0.0058

With the feature flag but without starting the profiling:

  • indexing duration: 96.80
  • avg search duration: 0.0060

With the profiling activated with min_alloc_size=64KiB

  • indexing duration: 96.77
  • avg search duration: 0.0059

With the profiling activated for search only and min_alloc_size=1KiB

  • indexing duration: 96.80
  • avg search duration: 0.014

With the profiling activeted for indexing and min_alloc_size=1KiB

  • heap profiling stopped, memory_locations full

@rdettai rdettai force-pushed the custom-heap-pprof2 branch 2 times, most recently from 725eb82 to 010e717 Compare May 13, 2025 15:08
@rdettai
Copy link
Collaborator Author

rdettai commented May 13, 2025

New measurement with the tokio tracing:

Without the feature flag:
indexing duration: 96.82606220245361
avg search duration: 0.004787445068359375

With the profiling activated with min_alloc_size=64KiB
indexing duration: 96.7932481765747
avg search duration: 0.005725836753845215

@rdettai rdettai force-pushed the custom-heap-pprof2 branch 3 times, most recently from 2e4b375 to 13a7447 Compare May 14, 2025 08:31
@rdettai rdettai requested a review from fulmicoton-dd May 14, 2025 08:33
}
}

/// Updates the the memory location and size of an existing allocation. Only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Updates the the memory location and size of an existing allocation. Only
/// Updates the memory location and size of an existing allocation. Only

};
let (callsite_hash, old_size_bytes) = if old_ptr != new_ptr {
let Some(old_alloc) = self.memory_locations.remove(&(old_ptr as usize)) else {
return ReallocRecordingResponse::ThresholdNotExceeded;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to never happen or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocations that exist before the profiler is started are not tracked in self.memory_locations. For those untracked allocations we decide not to track reallocations either. It's a small blind spot in the profiler's capability. For instance we are not able to pick up the growth of a long lived memory arena.

let Some(old_alloc) = self.memory_locations.remove(&(old_ptr as usize)) else {
return ReallocRecordingResponse::ThresholdNotExceeded;
};
self.memory_locations.insert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we certain that this insert does not allocate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I assume this is because you have done a remove before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this insert does not grow the map because of the remove right before. At a load factor of 0.5 I'm pretty confident no resize will be triggered.

Copy link
Collaborator

@fulmicoton-dd fulmicoton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool job

@rdettai rdettai force-pushed the custom-heap-pprof2 branch from a16a059 to 1459304 Compare May 14, 2025 14:07
@rdettai rdettai enabled auto-merge (squash) May 14, 2025 14:10
@rdettai rdettai merged commit 6ec11c5 into main May 14, 2025
8 checks passed
@rdettai rdettai deleted the custom-heap-pprof2 branch May 14, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants