Skip to content

Tracing tests, fixes, and optimizations #8223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rustyrussell
Copy link
Contributor

Recommended by @cdecker , tests were a great idea!

With an average runtime of 18.7674, this implies 1876ns
per trace, which is far in excess of the 370ns claimed in
doc/developers-guide/tracing-cln-performance.md.

We also add a tag in there, so we measure that!

Results on my laptop:
	real	0m18.524000-19.100000(18.7674+/-0.21)s
	user	0m16.171000-16.833000(16.424+/-0.26)s
	sys	0m2.259000-2.400000(2.337+/-0.059)s

Signed-off-by: Rusty Russell <[email protected]>
I added this debugging because the next test revealed a mismatch, so
I wanted to see where it was happening.

The comment in lightningd suggests it's possible, but I can't see any
code which suspends in the lightningd io_loop, so I cannot see how
this is triggered.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell added this to the v25.05 milestone Apr 10, 2025
@rustyrussell rustyrussell force-pushed the guilt/tracing-fixes-and-optimization branch from 37ed16a to e88dec6 Compare April 10, 2025 06:52
Suggested-by: Christian Decker
Signed-off-by: Rusty Russell <[email protected]>
Testing parenting handling revealed several issues:

1. By calling "trace_span_start" when CLN_TRACEPARENT is set produces a bogus
   entry, for which the span_id is overwritten so we never end it.
2. We don't need to close the remote parent when we close the first child: in
   fact, this causes the remaining traces to be detached from the parent!
3. Suspension should return current to the parent, not to NULL.

Now the traces balance as we expect.

Signed-off-by: Rusty Russell <[email protected]>
1. trace_span_start() is always called with a string literal, so
   no copy needed (and we can use a macro to enforce this).
2. trace_span_tag() name and value are always longer-lived than
   the span, so no need to copy these either.

Before:
	real	0m18.524000-19.100000(18.7674+/-0.21)s
	user	0m16.171000-16.833000(16.424+/-0.26)s
	sys	0m2.259000-2.400000(2.337+/-0.059)s

After:
	real	0m16.421000-18.407000(17.8128+/-0.72)s
	user	0m14.242000-16.041000(15.5382+/-0.67)s
	sys	0m2.179000-2.363000(2.273+/-0.061)s

Signed-off-by: Rusty Russell <[email protected]>
Avoids allocations.  Also assume that name and value parameters
outlive the trace span, so don't copy.

Before:
	real	0m16.421000-18.407000(17.8128+/-0.72)s
	user	0m14.242000-16.041000(15.5382+/-0.67)s
	sys	0m2.179000-2.363000(2.273+/-0.061)s

After:
	real	0m13.441000-14.592000(14.2686+/-0.43)s
	user	0m11.265000-12.289000(11.9626+/-0.37)s
	sys	0m2.175000-2.381000(2.3048+/-0.072)s

Signed-off-by: Rusty Russell <[email protected]>
There's an EBPF limit anyway, so stick with a 512-byte buffer.

This brings us back to 621ns per trace:

Before:
	real	0m13.441000-14.592000(14.2686+/-0.43)s
	user	0m11.265000-12.289000(11.9626+/-0.37)s
	sys	0m2.175000-2.381000(2.3048+/-0.072)s

After:
	real	0m5.819000-6.472000(6.2064+/-0.26)s
	user	0m3.779000-4.101000(3.956+/-0.12)s
	sys	0m2.040000-2.431000(2.2496+/-0.15)s

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/tracing-fixes-and-optimization branch from e88dec6 to e60e873 Compare April 10, 2025 09:13
This is much faster to give 64 bits of data, and we don't need
cryptographic randomness.

This brings us back to 413ns per trace.

Before:
	real	0m5.819000-6.472000(6.2064+/-0.26)s
	user	0m3.779000-4.101000(3.956+/-0.12)s
	sys	0m2.040000-2.431000(2.2496+/-0.15)s

After:
	real	0m3.981000-4.247000(4.1276+/-0.11)s
	user	0m3.979000-4.245000(4.126+/-0.11)s
	sys	0m0.000000-0.002000(0.001+/-0.00063)s

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: lightingd: trimmed overhead of tracing infrastructure.
We don't ever actually close the remote span (we don't have its key,
after all), and we keep a pointer to the parent.

Signed-off-by: Rusty Russell <[email protected]>
This doesn't happen very often, but can with autoclean.

However, we rarely traverse to the end, since we always expect to find
what we're looking for, and we fill from the front.  So even a large
array (unless it's used) is fine.

Subtle: when we injected a parent, we used "active_spans" as the (arbitrary)
key.  That can now change with reallocation, and so if that memory were reused
we could have a key clash.  So we use "&active_spans" which doesn't change.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/tracing-fixes-and-optimization branch from e60e873 to ab4d1be Compare April 14, 2025 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant