Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gc.ramp_{up,down} to avoid wasted collection work during program initialization #13861

Open
wants to merge 9 commits into
base: trunk
Choose a base branch
from

Conversation

gasche
Copy link
Member

@gasche gasche commented Mar 8, 2025

This is another iteration to improve the performance of Coq/Rocq on 5.x, a proposal for fixing the GC pacing issues around unmarshalling discussed in #13300 ( the root cause was identified by @NickBarnes in #13300 (comment) ) which is an alternative to #13320.

In short:

  • Avoid excessive GC work during unmarshalling-heavy ramp-up phases  #13320 uses a size-based heuristic to guess automatically whether a deserialization will be short-lived or long-lived; it gives good result on many OCaml programs (without changing their code), but it could result in (silent) memory blowup on some rarer programs.

  • The current PR provides a manual control in the Gc module for programmers to express this knowledge explicitly. It is more robust and more expressive, but requires users to change their code so it will be rarely used.

  • I have not been able to evaluate the impact of this approach on Rocq itself, but results on a synthetic benchmark suggest that it does improve program performance.

The problem

By default, when we allocate 1Mio in the major heap (via large allocations or promotions from the minor heap), the major GC assumes that about 1Mio of memory has become dead at roughly the same time, because it assumes that the program is a steady-state of peak memory consumption. So it asks mutator to traverse the major heap to find 1Mio of dead memory to collect.

As discussed in #13300, this assumption is wrong during initialization phases where programs allocate a lot more memory than they collect -- we call this a "ramp up" behavior. In particular, Coq/Rocq typically starts proof scripts with a Require Import instruction, which unmarshalls a lot of files recursively, allocating a lot of memory at the start of the program that typically remains live until the end. (OCaml may have similar behaviors with the .cmi of module dependencies which are long-lived). The GC tries to collect inexistant dead memory at the same time, and slows down the progrma. As observed in #13320 (comment), the OCaml 4 GC was more robust to this sort of situation: for some obscure technical reasons, it behaves better when the flawed heuristic demands a lot of useless work -- and so these "ramp up" situations become performance regressions of OCaml 5 compared to OCaml 4.

Past proposal

In #13320 I tried to fix the issue by saying bluntly that unmarshalling, above a certain payload size, should be handled as part of a ramp-up phase -- it should not hasten the GC. This is heuristically true for most OCaml programs that look like type-checkers or proof-checkers or modular analyzers, which deserialize information on program/proof dependencies that remain relevant until the end of processing.

But one could also imagine program that use deserialization for message-passing, where each deserialized payload typically has a very short life: on each iteration of the event loop we deserialize input messages, do some work, and produce some output, and typically the input data is dead at this point. On such programs the heuristic of #13320 would be very bad, it could result in a blowup in memory consumption.

Current proposal

The current PR implements a more manual approach where the programmer is in charge of being explicit about which parts of the program are "ramp up" phases, where most of the memory allocated does not replace pre-existing memory that becomes dead, and will remain live for a long time. In the Gc module:

type suspended_collection_work

val ramp_up : (unit -> 'a) -> 'a * suspended_collection_work
val ramp_down : suspended_collection_work -> unit

During the execution of the ramp_up callback, allocations into the major heap do not hasten collection work. When the callback returns, the user also gets a count of the amount of collectino work that was suspended in this way. This lets them optionally call ramp_down at some later time, once they suspect that a corresponding amount of memory has become unreachable, typically when the memory allocated during ramp-up itself becomes dead.

In the case of Rocq, ramp_up could be called exactly around the unmarshalling of .vo files, or possibly at a slightly larger granularity of Require Import work. (Note that using this PR requires a manual change to the source code, so programs do not get magically faster or slower unlike with #13320. And using the new functions while supporting older OCaml versions will require configure-time hacks, meh.)

In the case of a message-passing event loop, it is possible to call ramp_up when loading input messages/events at the beginning of each turn of the event loop, and then ramp_down at the end of the turn. This should behave roughly like today in the case where the input messages size is approximately constant from one turn to the next, but it would behave better if the sizes are very heterogeneous: without ramp_up, the first round with a "very large" input could make the GC waste collection time at the beginning of the round, while with ramp_up and ramp_down the collection work would be delayed to the end of the turn where it is actually useful.

Benchmarks

I have not yet tried to experiment with using this feature in the Rocq codebase. For I am using a synthetic benchmark that unmarshalls 30 lists of 50_000 integers, and then does some light work over them (List.map succ). On this benchmark, wrapping the unmarshalling part under a Gc.ramp_up call (whose deferred work is thus ignored) results in a speedup that seems to be around 15-20%. (I tried putting a ramp_down call later, the speedup is still there.)

The benchmark programs are available on my branch
https://github.com/gasche/ocaml/tree/gc-rampup-control-benchmark

How to review

I have rewritten the history so that the PR is pleasant to read commit-by-commit, so I would recommend doing that. (The definition of ramp_up is a bit complex, and seeing it grow in complexity over time is better than reading it in one go.)

cc @NickBarnes which suggested this broad approach, and @damiendoligez with whom I discussed an interface yesterday. I also discussed the Rocq behavior with @OlivierNicole, @ejgallego and @gadmm.

Implementation notes

The general idea is to add more information to the domain state:

  • whether we are in a ramp-up phase or not (gc_policy)
  • (during ramp-up) how much collection work has been suspended

Whenever we allocate in the major heap, check whether we are in a ramp-up phase and in that case "do the right thing": track the allocation in a way that will not hasten the GC, but correctly counts the amount of suspended work.

I found it more difficult than I thought to get the implementation right, for three reasons:

  1. My first idea was to not increment domain_state->allocated_words during ramp-up phases (this is the counter that is used for GC pacing calculation), but allocated_words is used in a lot of other statistics that would become completely wrong during ramp-up. For example, the GC would gladly report that no words have been promoted to the major heap.

    Instead what I do is that I increment allocated_words as always, but also increment a allocated_suspended_words in parallel, which denotes a subset of major allocations since the last slice that should be considered "suspended". Then I need to change the GC-pacing computation in update_major_slice_work to use allocated_words - allocated_suspended_words, but all other usages of allocated_words can remain unchanged.

  2. allocated_words and other such counters are reset to 0 on each new slice of the major GC. (Major collections happen incrementally, one "slice" of the memory at a time.) But to count suspended allocations I want to track something during the full ramp-up phase, which (1) could start in the middle of a GC slice and (2) could last for several slices. I tried to have allocated_suspended_words not be reset on each slice and it does not work. I ended up with two different variables, one with per-slice information and one with per-phase information. (See documentation comments in domain_state.tbl.)

  3. Initially I thought that ramp_down would just subtract from the counter of allocated_suspended_words. (And then that counter can become negative, in which case it adds more GC collection work at the next slice instead of reducing it.) This makes it tricky and I couldn't get the behavior I expected in the corner case where someone calls ramp_down on a large amount of suspended work in the middle of a short ramp_up phase. Now I track two separate unsigned counters, allocated_suspended_words which grows on ramp_up and allocated_resumed_words which grows on ramp_down, and the GC work calculation uses allocated_words - allocated_suspended_words + allocated_resumed_words.

@stedolan
Copy link
Contributor

This sounds like a useful mechanism, but I haven't yet looked at any of the code.

I do have a quick comment about benchmarking, though:

On this benchmark, wrapping the unmarshalling part under a Gc.ramp_up call (whose deferred work is thus ignored) results in a speedup that seems to be around 15-20%.

Everything the GC does is a space/time tradeoff. So, you can't make a useful statement about a GC policy change by saying "this benchmark goes X% faster" without also talking about its memory usage. (If all we cared about was speed, then why bother collecting at all?)

So, to be meaningful, the numbers here need to be at least of the form "X% faster while not using any more memory", or, better yet, plot the whole curve of space-time tradeoffs attained by varying the space_overhead parameter, and show that the curve you get with your change is below the curve you get without it.

@gasche
Copy link
Member Author

gasche commented Mar 10, 2025

The particular microbenchmark that I am using does not have any memory that becomes free during its rampup phase, so it would not be appropriate to measure peak memory usage variations. (I just looked and the peak memory consumption is exactly the same in both configurations, and has sub-2% variations between o=80 and o=200.)

I can try to gather more interesting results when looking at Coq, once I find where to (and the motivation to) tweak the code to use rampup.

@gasche gasche force-pushed the gc-rampup-control branch 2 times, most recently from c0aea45 to a8ad059 Compare March 10, 2025 20:56
@gasche
Copy link
Member Author

gasche commented Mar 10, 2025

I now have some performance numbers on Rocq usage, instead of a synthetic benchmark, and they also look good.

Command Mean [ms] Relative
rampup=yes 554.9 ± 10.1 1.00
rampup=no 702.1 ± 11.3 1.27 ± 0.03

This is the time required to run the following command from a Coq-8.20 repository after making make world to build it:

_build/install/default/bin/coqc -I _build/install/default/lib/ -coqlib _build/install/default/lib/coq/ theories/setoid_ring/Field_tac.v

I observe a small memory increase when using rampup, from around 130Mio to 132Mio on that file, and this holds for values of the space_overhead parameter from 60 to 240:

o=60 rampup=yes
0.27user 0.15system 0:00.45elapsed 93%CPU (0avgtext+0avgdata 132800maxresident)k
0inputs+160outputs (0major+29546minor)pagefaults 0swaps
o=60 rampup=no
0.57user 0.15system 0:00.75elapsed 97%CPU (0avgtext+0avgdata 131160maxresident)k
0inputs+160outputs (0major+29108minor)pagefaults 0swaps

o=120 rampup=yes
0.25user 0.16system 0:00.43elapsed 98%CPU (0avgtext+0avgdata 132520maxresident)k
0inputs+160outputs (0major+29535minor)pagefaults 0swaps
o=120 rampup=no
0.41user 0.16system 0:00.59elapsed 98%CPU (0avgtext+0avgdata 130604maxresident)k
0inputs+160outputs (0major+29092minor)pagefaults 0swaps

o=180 rampup=yes
0.27user 0.15system 0:00.43elapsed 98%CPU (0avgtext+0avgdata 132552maxresident)k
0inputs+160outputs (0major+29491minor)pagefaults 0swaps
o=180 rampup=no
0.41user 0.15system 0:00.57elapsed 98%CPU (0avgtext+0avgdata 130916maxresident)k
0inputs+160outputs (0major+29089minor)pagefaults 0swaps

o=240 rampup=yes
0.29user 0.12system 0:00.42elapsed 98%CPU (0avgtext+0avgdata 132324maxresident)k
0inputs+160outputs (0major+29548minor)pagefaults 0swaps
o=240 rampup=no
0.35user 0.15system 0:00.51elapsed 98%CPU (0avgtext+0avgdata 130884maxresident)k
0inputs+160outputs (0major+29077minor)pagefaults 0swaps

The patch to Coq to enable rampup conditionally when loading dependencies is as follows. (This may not be the right place to tweak, I just tried something quickly.)

diff --git i/vernac/library.ml w/vernac/library.ml
index 0cf441a3e1..7666b77729 100644
--- i/vernac/library.ml
+++ w/vernac/library.ml
@@ -448,11 +448,27 @@ let warn_require_in_module =
     (fun () -> strbrk "Use of “Require” inside a module is fragile." ++ spc() ++
                strbrk "It is not recommended to use this functionality in finished proof scripts.")
 
+let ramp_up_config =
+  match Sys.getenv "RAMP_UP" with
+  | "yes" | "1" -> true
+  | "no" | "0" -> false
+  | exception Not_found -> false
+  | other ->
+    Printf.ksprintf failwith
+      "incorrect RAMP_UP value %S, expected [yes|no|1|0]." other
+
+let ramp_up f =
+  if not ramp_up_config
+  then f ()
+  else fst (Gc.ramp_up f)
+
 let require_library_from_dirpath needed =
   if Lib.is_module_or_modtype () then warn_require_in_module ();
+  ramp_up @@ fun () ->
   Lib.add_leaf (in_require needed)
 
 let require_library_syntax_from_dirpath ~intern modrefl =
+  ramp_up @@ fun () ->
   let needed, contents = List.fold_left (rec_intern_library ~intern) ([], DPmap.empty) modrefl in
   let needed = List.rev_map (fun (root, dir) -> root, DPmap.find dir contents) needed in
   Lib.add_leaf (in_require_syntax needed);

P.S.: I backported the PR to earlier OCaml releases for easier testing (the Coq tests above use 5.3): https://github.com/gasche/ocaml/tree/gc-rampup-control-5.3 , https://github.com/gasche/ocaml/tree/gc-rampup-control-5.2.

@gasche gasche force-pushed the gc-rampup-control branch 2 times, most recently from 196d88a to 458b4c0 Compare March 10, 2025 23:47
gasche added 9 commits March 11, 2025 00:54
This is a buildup commit, currently there is no control in the GC to suspend or resume specific allocations, so these counters are always 0.

The intuition is to "suspend" allocations during ramp-up phases, and "resume" allocations during ramp-down.
During [ramp_up], the deallocation work coming from allocations is
"suspended". It can be "resumed" by calling [ramp_down].

[ramp_up] does not currently count the total number of suspended
allocations (this needs more domain state that is not reset on each
major state), so the user would not know which value to provided to
[ramp_down]. This will be added next.
@gasche gasche force-pushed the gc-rampup-control branch from 458b4c0 to 3c211f2 Compare March 10, 2025 23:55
SkySkimmer added a commit to SkySkimmer/coq that referenced this pull request Mar 12, 2025
SkySkimmer added a commit to SkySkimmer/coq that referenced this pull request Mar 12, 2025
SkySkimmer added a commit to SkySkimmer/coq that referenced this pull request Mar 13, 2025
old = 4.14.1
@gasche
Copy link
Member Author

gasche commented Mar 13, 2025

@SkySkimmer ran the Coq/Rocq benchmark suite on top of this PR, with a patch to add Gc.ramp_up in the Rocq deserialization logic. (He added ramp_up in a more localized way than I do, so it covers deserialization but not the processing of the result to insert it in the environment. See the patch, a one-line change.)

The benchmark results are as follows:

┌─────────────────────────────────────┬──────────────────────────┬────────────────────────────────────────┬─────────────────────────┐
│                                     │      user time [s]       │            CPU instructions            │  max resident mem [KB]  │
│                                     │                          │                                        │                         │
│            package_name             │   NEW      OLD    PDIFF  │      NEW             OLD        PDIFF  │   NEW      OLD    PDIFF │
├─────────────────────────────────────┼──────────────────────────┼────────────────────────────────────────┼─────────────────────────┤
│                       coq-fiat-core │   61.87    70.81  -12.63 │   394450819517    457079272327  -13.70 │  477436   475660   0.37 │
│            coq-metacoq-translations │   15.61    17.48  -10.70 │   111895965460    126021766570  -11.21 │  758364   760076  -0.23 │
│                         rocq-stdlib │  221.94   239.73   -7.42 │  1434313333193   1561269593455   -8.13 │  588916   587988   0.16 │
│                           coq-color │  254.59   274.68   -7.31 │  1627435228193   1756472332676   -7.35 │ 1160296  1141876   1.61 │
│                   coq-metacoq-utils │   24.06    25.84   -6.89 │   161053729188    172342714159   -6.55 │  578336   576364   0.34 │
│                    coq-fiat-parsers │  284.79   304.59   -6.50 │  2244888309981   2376692017192   -5.55 │ 2261368  2248844   0.56 │
│                coq-metacoq-template │  144.19   153.55   -6.10 │   971825906783   1033734178078   -5.99 │ 1010352   992180   1.83 │
│                    coq-math-classes │   87.50    92.91   -5.82 │   543408215957    583429604735   -6.86 │  494800   495176  -0.08 │
│                         coq-coqutil │   47.56    50.38   -5.60 │   309008754774    328243731980   -5.86 │  583644   581152   0.43 │
│                        rocq-bignums │   30.75    32.46   -5.27 │   200492992443    211769177069   -5.32 │  474052   473908   0.03 │
│                           coq-verdi │   45.17    47.54   -4.99 │   305386823947    321109327841   -4.90 │  537992   535448   0.48 │
│                  coq-metacoq-common │   65.52    68.83   -4.81 │   427543145180    450663622958   -5.13 │  915584   918316  -0.30 │
│                        coq-coqprime │   54.12    56.47   -4.16 │   380735388805    396763321152   -4.04 │  831136   828384   0.33 │
│                   coq-metacoq-pcuic │  637.03   661.96   -3.77 │  4096899441005   4252111545597   -3.65 │ 2186680  2284664  -4.29 │
│                      coq-coquelicot │   37.90    39.32   -3.61 │   226664767441    239517123897   -5.37 │  826312   823272   0.37 │
│                            coq-hott │  166.08   171.84   -3.35 │  1194984061874   1229889462565   -2.84 │  468544   468996  -0.10 │
│               coq-mathcomp-analysis │  577.54   597.16   -3.29 │  4181002996150   4307837512098   -2.94 │ 1584860  1587700  -0.18 │
│                 coq-metacoq-erasure │  511.18   528.24   -3.23 │  3574024695362   3675744536761   -2.77 │ 1857912  1853476   0.24 │
│                            coq-corn │  682.80   703.73   -2.97 │  4706009514420   4888176991326   -3.73 │  682828   737388  -7.40 │
│                   coq-iris-examples │  393.47   404.95   -2.83 │  2636011796820   2719242386939   -3.06 │ 1150600  1149520   0.09 │
│               coq-mathcomp-solvable │   81.85    83.91   -2.46 │   557067322801    573427731589   -2.85 │  901004   898660   0.26 │
│               coq-engine-bench-lite │  119.10   121.58   -2.04 │   964881502719    984154545141   -1.96 │  987968  1048224  -5.75 │
│              coq-mathcomp-odd-order │  511.34   521.89   -2.02 │  3647040275366   3708519708989   -1.66 │ 1765964  1815304  -2.72 │
│                  coq-mathcomp-field │   96.46    98.40   -1.97 │   677371672022    698191732393   -2.98 │ 1553316  1524472   1.89 │
│                      coq-verdi-raft │  556.66   567.15   -1.85 │  3898236004813   3995881973059   -2.44 │  828272   819568   1.06 │
│ coq-neural-net-interp-computed-lite │  252.44   256.92   -1.74 │  2452829110791   2462864603255   -0.41 │  844816   854172  -1.10 │
│              coq-mathcomp-character │   72.13    73.38   -1.70 │   483220903833    497757235139   -2.92 │ 1191272  1183568   0.65 │
│                coq-mathcomp-algebra │  178.29   180.88   -1.43 │  1227107494643   1254494109491   -2.18 │ 1199080  1202164  -0.26 │
│        coq-fiat-crypto-with-bedrock │ 6258.38  6346.55   -1.39 │ 52063728563865  52561242970024   -0.95 │ 4130868  4131504  -0.02 │
│                 coq-category-theory │  624.99   633.61   -1.36 │  4738085087029   4779732812836   -0.87 │ 1033828  1048752  -1.42 │
│               coq-mathcomp-fingroup │   24.12    24.45   -1.35 │   155939466928    158565126306   -1.66 │  514716   514040   0.13 │
│                         coq-unimath │ 2391.33  2423.26   -1.32 │ 19414028286570  19523228327940   -0.56 │ 1158712  1212444  -4.43 │
│                        coq-rewriter │  346.82   351.10   -1.22 │  2660064102787   2683800630602   -0.88 │ 1315428  1329184  -1.03 │
│             coq-metacoq-safechecker │  355.35   359.31   -1.10 │  2754147402814   2771072606877   -0.61 │ 1780904  1768800   0.68 │
│          coq-performance-tests-lite │  927.32   937.50   -1.09 │  7664634539662   7771643498056   -1.38 │ 2350148  2271764   3.45 │
│                           rocq-elpi │   13.29    13.41   -0.89 │    97985182642     99745949722   -1.77 │  476720   478780  -0.43 │
│         coq-rewriter-perf-SuperFast │  479.19   482.64   -0.71 │  3925146862630   3947209908005   -0.56 │ 2001548  1985832   0.79 │
│                            coq-core │    2.87     2.89   -0.69 │    19541235678     19526668836    0.07 │   86276    87424  -1.31 │
│                           rocq-core │    6.88     6.92   -0.58 │    45133986059     45294273246   -0.35 │  437112   436500   0.14 │
│                       coq-fourcolor │ 1334.86  1334.75    0.01 │ 12378764204355  12431403377256   -0.42 │  829128   808984   2.49 │
│              coq-mathcomp-ssreflect │   91.29    91.09    0.22 │   602745456458    606807054888   -0.67 │ 1571760  1575328  -0.23 │
│                      rocq-equations │    8.92     8.87    0.56 │    64647410873     64931300740   -0.44 │  394228   389308   1.26 │
│                        rocq-runtime │   72.30    71.62    0.95 │   528050593671    527909306466    0.03 │  541480   541900  -0.08 │
└─────────────────────────────────────┴──────────────────────────┴────────────────────────────────────────┴─────────────────────────┘

In these results, OLD is some version of Rocq on vanilla 5.2, and NEW is the same version of Rocq built on top of my gc-rampup-5.2 branch, with a single line adding a ramp_up call in the deserialization logic. A negative relative difference, -x%, means that the patch is x% faster or less memory-hungry than the without-ramp-up baseline, and a positive difference +x% means that it is x% slower or more memory-hungry.

The results confirm that this approach is beneficial to Rocq, at least in batch mode: it always improves performance and never increases peak memory consumption (memory numbers are within noise). The speedups vary a lot between Rocq packages, from 0% to 12%. (I hypothesize that packages with a lot of small/fast .v files benefit more, as Require is a constant-time cost. For example the coq-fourcolour proof is fairly long to check, and its .vo deserialization time is neglectible.) There are speedups of 7% for two relatively big Rocq packages, the stdlib and the CoLoR library.

My conclusion is that ramp_up does seem to solve or at least alleviate deserialization-at-initialization-time slowdowns observed in OCaml 5, with a one-line change to the Rocq codebase giving notable performance gains. I remain in favor of pursuing this approach -- hopefully with someone volunteering to review it!

@SkySkimmer
Copy link

I hypothesize that packages with a lot of small/fast .v files benefit more, as Import is a constant-time cost.

Small correction, the changed command is Require not Import

SkySkimmer added a commit to SkySkimmer/coq that referenced this pull request Mar 14, 2025
old = 4.14.1
@damiendoligez damiendoligez self-assigned this Mar 19, 2025
SkySkimmer added a commit to SkySkimmer/coq that referenced this pull request Mar 20, 2025
old = 4.14.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants