Skip to content

nix: change default GC for cardano-node #6222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mgmeier
Copy link
Contributor

@mgmeier mgmeier commented May 16, 2025

Description

This PR changes the default garbage collector (GC) settings for the cardano-node nix service config (SVC).
Sequential GC is replaced by parallel, load-balanced GC for all but the youngest generation.

Changing the default compiler to GHC9.6 prompted scrutinizing GC settings, as the ones most beneficial on GHC8.10 can't be assumed to automatically be the best pick for GHC9.6. The change in this PR showed slight improvement in network performance, and solid reduction of GC pauses in our benchmarks (see below).

Please note that these settings are not imposed to operate a cardano-node - however, they're recommended for efficiency of block producers.

  • The Haskell package itself leaves GC unspecified (besides -I0, which is strongly advised to minimize GC pauses).
  • The nix SVC offers attrs rtsArgs to replace all default RTS arguments with custom settings, and rts_flags_override to override individual arguments from rtsArgs.

Checklist

  • Commit sequence broadly makes sense and commits have useful messages
  • CI passes. See note on CI. The following CI checks are required:
    • Code is linted with hlint. See .github/workflows/check-hlint.yml to get the hlint version
    • Code is formatted with stylish-haskell. See .github/workflows/stylish-haskell.yml to get the stylish-haskell version
    • Code builds on Linux, MacOS and Windows for ghc-9.6 and ghc-9.12
  • Self-reviewed the diff

@mgmeier
Copy link
Contributor Author

mgmeier commented May 16, 2025

End-to-end propagation times (s) on benchmarking cluster - saturation workload
-> 24ms - 64ms (4% - 6%) improvement

qg1_saturation

@mgmeier
Copy link
Contributor Author

mgmeier commented May 16, 2025

End-to-end propagation times (s) on benchmarking cluster - light submission workload
-> 11ms - 27ms (3% - 5%) improvement

qg1_light

@mgmeier
Copy link
Contributor Author

mgmeier commented May 16, 2025

As far as resource usage goes, we found that:

  • under saturation, CPU usage was slightly higher and RAM footprint slightly lower
  • under light submission, CPU usage was slightly lower and RAM footprint slightly higher
  • CPU spikes were shorter in duration by at least 15% in either case

@mgmeier
Copy link
Contributor Author

mgmeier commented May 16, 2025

This table illustrates the shorter average duration and greatly reduced frequency of GC pauses of parallel (par) vs. sequential (seq) GC settings. It lists a subset of individual benchmarking nodes as examples, as well as the cumulative values from all nodes; the count refers to absolute number of pauses:

silences_short

@TerminadaPool
Copy link

Do you have any advice regarding performance when using the non-moving GC?

@mgmeier
Copy link
Contributor Author

mgmeier commented May 22, 2025

Do you have any advice regarding performance when using the non-moving GC?

What we've observed at least in a benchmarking setting (i.e. network under saturation for an extended period of time) is that the non-moving GC led to higher resource usage overall; CPU moderately, and RAM footprint significantly. That bought you higher responsiveness, but unfortunately not much advantage as far as block production metrics are concerned.
This was measured on GHC9.6 - our currently recommended/best supported compiler version.

So internally, optimizing for the use of the non-moving GC hasn't been a priority. Hence, my advice:

  • If you're using the non-moving GC, do so only if you have good familiarity with other runtime (RTS) parameters. You might need to tune them as well to make the node purr.
  • Non-moving GC has received various improvements in recent GHC releases. Consider a build using GHC9.12. The codebase supports that version, but apart from that, again, you're on your own.
  • Beware of the extra RAM you require. I've heard of several SPOs testing this GC and crashing their node due to the RAM limit.
  • Know what you want to optimize for. If you've identified pain points or bottlenecks in your environment - another GC might help, but might not. Don't jump onto it because it's "fancy", and It's definitely not a silver bullet.

@TerminadaPool
Copy link

I had been experimenting with the nonmoving-gc on a low performance ARM machine I use as a backup over the past couple of years. I did try various rts settings a couple of years ago but found that the only way I could have that machine not miss slot leadership checks in BP mode with previous cardano-node versions was by using the nonmoving-gc. The "stop the world" garbage collector caused too much delay resulting in missed slot checks when it ran no matter what other rts settings I tested. I had been compiling the cardano-node using various different versions of ghc 9.x. over that time but most recently I was able to compile cardano-node version 10.3.1 with ghc 9.12.2.

This is what free shows after 12 days running cardano-node as a non block producer on this low performance ARM machine:

               total        used        free      shared  buff/cache   available
Mem:        30752440    19703704     4572700         160     6815584    11048736
Swap:       25163764     4683472    20480292

cardano-node version 10.3.1 compiled with ghc 9.12.2 using rts settings +RTS -N --nonmoving-gc -RTS

@TerminadaPool
Copy link

Another question I have relates to your choice of setting -N2 I have been using simply -N as I thought that would make use of all available processors, but then I am confused when reading about whether the -AL setting should be separately set if I use -A16m

Can you provide some guidance about whether -AL needs changing if the machine has more than 2 processors or am I safe in using +RTS -N -I0 -A16m -qg1 -qb1 --disable-delayed-os-memory-return -RTS and allowing the Haskell runtime to figure out how many processors are available and default -AL be the same as -A (ie: 16m)?

@mgmeier
Copy link
Contributor Author

mgmeier commented May 23, 2025

Your case is a great example of optimizing against an environment-specific bottleneck: Increased responsiveness was obviously worth the higher RAM requirements.

As to -N: You're right, that distributes Haskell threads across all available cores, whereas -N<n> only distributes across n cores. As to why we chose 2 - see below. As the PR states, these settings are never imposed, they can be replaced or overridden, with or without nix.

As to -AL: Large objects (>3kB) are allocated differently by the runtime, and have a global size limit independent of cores used. This limit can be hit prematurely when many many cores allocate many many large objects - so it can be raised independently. To my knowledge, cardano-node does not allocate many large objects, so you should be fine with omitting -AL and letting it default to the same value as -A.

@mgmeier
Copy link
Contributor Author

mgmeier commented May 23, 2025

For the number of cores used:
There's a compromise to make between scaling workload to many cores, and reaping the benefits of cache locality. In our benchmarks we found that distributing concurrent computations of cardano-node across more than 4 cores generally does not outweigh the loss of locality - i.e. overall performance didn't improve, or was even slightly worse. At least for saturation workloads, 2 cores have so far shown best results on average, for block producers. Hence, the default.

However, there's a ton of things to consider:

  • We assume running on an i5/i7 like architecture or comparable Ryzen familiy. ARM might be an entirely different story wrt. caching and cache synchronization.
  • Not sure about caching implications on virtual CPUs found with many cloud providers... Trade-offs might differ compared with your own bare metal.
  • This always presupposes the level of concurrency implemented in the current cardano-node. In case more tasks are parallelized by changing implementation, using more cores might turn out more beneficial. There's experimental work on this going on.
  • By pinning certain threads to certain CPU cores, you can make sure that even when scaling to many cores, threads that profit most from locality end up running on the same core. Also, there's experimental work on this being done.

So the value we use here has shown to cater to the widest array of environments, and across different node roles (block producer, relay, full node wallet), given the system specs that we recommend in the release notes.

Thanks for bearing with me, I hope that brought some clarity. Even though the whole RTS tuning seems pretty complicated, you have to give it to Haskell that the same build can adapt pretty much to any environment - whereas in other languages, many performance characteristics can only be adapted by actually changing the implementation.

@TerminadaPool
Copy link

Thank you for your time, considered thoughts, and all the work you are doing. I am totally amazed at how robust cardano-node is and how you are able to rewrite Haskell functional code to be more optimised. 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants