nix: change default GC for cardano-node #6222

mgmeier · 2025-05-16T10:12:12Z

Description

This PR changes the default garbage collector (GC) settings for the cardano-node nix service config (SVC).
Sequential GC is replaced by parallel, load-balanced GC for all but the youngest generation.

Changing the default compiler to GHC9.6 prompted scrutinizing GC settings, as the ones most beneficial on GHC8.10 can't be assumed to automatically be the best pick for GHC9.6. The change in this PR showed slight improvement in network performance, and solid reduction of GC pauses in our benchmarks (see below).

Please note that these settings are not imposed to operate a cardano-node - however, they're recommended for efficiency of block producers.

The Haskell package itself leaves GC unspecified (besides -I0, which is strongly advised to minimize GC pauses).
The nix SVC offers attrs rtsArgs to replace all default RTS arguments with custom settings, and rts_flags_override to override individual arguments from rtsArgs.

Checklist

Commit sequence broadly makes sense and commits have useful messages
CI passes. See note on CI. The following CI checks are required:
- Code is linted with hlint. See .github/workflows/check-hlint.yml to get the hlint version
- Code is formatted with stylish-haskell. See .github/workflows/stylish-haskell.yml to get the stylish-haskell version
- Code builds on Linux, MacOS and Windows for ghc-9.6 and ghc-9.12
Self-reviewed the diff

mgmeier · 2025-05-16T10:14:37Z

End-to-end propagation times (s) on benchmarking cluster - saturation workload
-> 24ms - 64ms (4% - 6%) improvement

mgmeier · 2025-05-16T10:16:21Z

End-to-end propagation times (s) on benchmarking cluster - light submission workload
-> 11ms - 27ms (3% - 5%) improvement

mgmeier · 2025-05-16T10:20:13Z

As far as resource usage goes, we found that:

under saturation, CPU usage was slightly higher and RAM footprint slightly lower
under light submission, CPU usage was slightly lower and RAM footprint slightly higher
CPU spikes were shorter in duration by at least 15% in either case

mgmeier · 2025-05-16T10:24:19Z

This table illustrates the shorter average duration and greatly reduced frequency of GC pauses of parallel (par) vs. sequential (seq) GC settings. It lists a subset of individual benchmarking nodes as examples, as well as the cumulative values from all nodes; the count refers to absolute number of pauses:

TerminadaPool · 2025-05-20T21:33:02Z

Do you have any advice regarding performance when using the non-moving GC?

mgmeier · 2025-05-22T07:42:48Z

Do you have any advice regarding performance when using the non-moving GC?

What we've observed at least in a benchmarking setting (i.e. network under saturation for an extended period of time) is that the non-moving GC led to higher resource usage overall; CPU moderately, and RAM footprint significantly. That bought you higher responsiveness, but unfortunately not much advantage as far as block production metrics are concerned.
This was measured on GHC9.6 - our currently recommended/best supported compiler version.

So internally, optimizing for the use of the non-moving GC hasn't been a priority. Hence, my advice:

If you're using the non-moving GC, do so only if you have good familiarity with other runtime (RTS) parameters. You might need to tune them as well to make the node purr.
Non-moving GC has received various improvements in recent GHC releases. Consider a build using GHC9.12. The codebase supports that version, but apart from that, again, you're on your own.
Beware of the extra RAM you require. I've heard of several SPOs testing this GC and crashing their node due to the RAM limit.
Know what you want to optimize for. If you've identified pain points or bottlenecks in your environment - another GC might help, but might not. Don't jump onto it because it's "fancy", and It's definitely not a silver bullet.

TerminadaPool · 2025-05-22T11:19:28Z

I had been experimenting with the nonmoving-gc on a low performance ARM machine I use as a backup over the past couple of years. I did try various rts settings a couple of years ago but found that the only way I could have that machine not miss slot leadership checks in BP mode with previous cardano-node versions was by using the nonmoving-gc. The "stop the world" garbage collector caused too much delay resulting in missed slot checks when it ran no matter what other rts settings I tested. I had been compiling the cardano-node using various different versions of ghc 9.x. over that time but most recently I was able to compile cardano-node version 10.3.1 with ghc 9.12.2.

This is what free shows after 12 days running cardano-node as a non block producer on this low performance ARM machine:

               total        used        free      shared  buff/cache   available
Mem:        30752440    19703704     4572700         160     6815584    11048736
Swap:       25163764     4683472    20480292

cardano-node version 10.3.1 compiled with ghc 9.12.2 using rts settings +RTS -N --nonmoving-gc -RTS

TerminadaPool · 2025-05-22T11:53:36Z

Another question I have relates to your choice of setting -N2 I have been using simply -N as I thought that would make use of all available processors, but then I am confused when reading about whether the -AL setting should be separately set if I use -A16m

Can you provide some guidance about whether -AL needs changing if the machine has more than 2 processors or am I safe in using +RTS -N -I0 -A16m -qg1 -qb1 --disable-delayed-os-memory-return -RTS and allowing the Haskell runtime to figure out how many processors are available and default -AL be the same as -A (ie: 16m)?

mgmeier · 2025-05-23T16:35:24Z

Your case is a great example of optimizing against an environment-specific bottleneck: Increased responsiveness was obviously worth the higher RAM requirements.

As to -N: You're right, that distributes Haskell threads across all available cores, whereas -N<n> only distributes across n cores. As to why we chose 2 - see below. As the PR states, these settings are never imposed, they can be replaced or overridden, with or without nix.

As to -AL: Large objects (>3kB) are allocated differently by the runtime, and have a global size limit independent of cores used. This limit can be hit prematurely when many many cores allocate many many large objects - so it can be raised independently. To my knowledge, cardano-node does not allocate many large objects, so you should be fine with omitting -AL and letting it default to the same value as -A.

mgmeier · 2025-05-23T16:51:38Z

For the number of cores used:
There's a compromise to make between scaling workload to many cores, and reaping the benefits of cache locality. In our benchmarks we found that distributing concurrent computations of cardano-node across more than 4 cores generally does not outweigh the loss of locality - i.e. overall performance didn't improve, or was even slightly worse. At least for saturation workloads, 2 cores have so far shown best results on average, for block producers. Hence, the default.

However, there's a ton of things to consider:

We assume running on an i5/i7 like architecture or comparable Ryzen familiy. ARM might be an entirely different story wrt. caching and cache synchronization.
Not sure about caching implications on virtual CPUs found with many cloud providers... Trade-offs might differ compared with your own bare metal.
This always presupposes the level of concurrency implemented in the current cardano-node. In case more tasks are parallelized by changing implementation, using more cores might turn out more beneficial. There's experimental work on this going on.
By pinning certain threads to certain CPU cores, you can make sure that even when scaling to many cores, threads that profit most from locality end up running on the same core. Also, there's experimental work on this being done.

So the value we use here has shown to cater to the widest array of environments, and across different node roles (block producer, relay, full node wallet), given the system specs that we recommend in the release notes.

Thanks for bearing with me, I hope that brought some clarity. Even though the whole RTS tuning seems pretty complicated, you have to give it to Haskell that the same build can adapt pretty much to any environment - whereas in other languages, many performance characteristics can only be adapted by actually changing the implementation.

TerminadaPool · 2025-05-24T00:33:41Z

Thank you for your time, considered thoughts, and all the work you are doing. I am totally amazed at how robust cardano-node is and how you are able to rewrite Haskell functional code to be more optimised. 🥇

nix: parallel GC as default (more suitable for GHC9.6)

c0eb50c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nix: change default GC for cardano-node #6222

nix: change default GC for cardano-node #6222

Uh oh!

mgmeier commented May 16, 2025

Uh oh!

mgmeier commented May 16, 2025 •

edited

Loading

Uh oh!

mgmeier commented May 16, 2025 •

edited

Loading

Uh oh!

mgmeier commented May 16, 2025

Uh oh!

mgmeier commented May 16, 2025

Uh oh!

TerminadaPool commented May 20, 2025

Uh oh!

mgmeier commented May 22, 2025 •

edited

Loading

Uh oh!

TerminadaPool commented May 22, 2025

Uh oh!

TerminadaPool commented May 22, 2025

Uh oh!

mgmeier commented May 23, 2025

Uh oh!

mgmeier commented May 23, 2025 •

edited

Loading

Uh oh!

TerminadaPool commented May 24, 2025

Uh oh!

Uh oh!

nix: change default GC for cardano-node #6222

Are you sure you want to change the base?

nix: change default GC for cardano-node #6222

Uh oh!

Conversation

mgmeier commented May 16, 2025

Description

Checklist

Uh oh!

mgmeier commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmeier commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgmeier commented May 16, 2025

Uh oh!

mgmeier commented May 16, 2025

Uh oh!

TerminadaPool commented May 20, 2025

Uh oh!

mgmeier commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TerminadaPool commented May 22, 2025

Uh oh!

TerminadaPool commented May 22, 2025

Uh oh!

mgmeier commented May 23, 2025

Uh oh!

mgmeier commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TerminadaPool commented May 24, 2025

Uh oh!

Uh oh!

mgmeier commented May 16, 2025 •

edited

Loading

mgmeier commented May 16, 2025 •

edited

Loading

mgmeier commented May 22, 2025 •

edited

Loading

mgmeier commented May 23, 2025 •

edited

Loading