-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gc.ramp_{up,down}
to avoid wasted collection work during program initialization
#13861
base: trunk
Are you sure you want to change the base?
Conversation
This sounds like a useful mechanism, but I haven't yet looked at any of the code. I do have a quick comment about benchmarking, though:
Everything the GC does is a space/time tradeoff. So, you can't make a useful statement about a GC policy change by saying "this benchmark goes X% faster" without also talking about its memory usage. (If all we cared about was speed, then why bother collecting at all?) So, to be meaningful, the numbers here need to be at least of the form "X% faster while not using any more memory", or, better yet, plot the whole curve of space-time tradeoffs attained by varying the space_overhead parameter, and show that the curve you get with your change is below the curve you get without it. |
The particular microbenchmark that I am using does not have any memory that becomes free during its rampup phase, so it would not be appropriate to measure peak memory usage variations. (I just looked and the peak memory consumption is exactly the same in both configurations, and has sub-2% variations between o=80 and o=200.) I can try to gather more interesting results when looking at Coq, once I find where to (and the motivation to) tweak the code to use rampup. |
c0aea45
to
a8ad059
Compare
I now have some performance numbers on Rocq usage, instead of a synthetic benchmark, and they also look good.
This is the time required to run the following command from a Coq-8.20 repository after making
I observe a small memory increase when using rampup, from around 130Mio to 132Mio on that file, and this holds for values of the
The patch to Coq to enable rampup conditionally when loading dependencies is as follows. (This may not be the right place to tweak, I just tried something quickly.) diff --git i/vernac/library.ml w/vernac/library.ml
index 0cf441a3e1..7666b77729 100644
--- i/vernac/library.ml
+++ w/vernac/library.ml
@@ -448,11 +448,27 @@ let warn_require_in_module =
(fun () -> strbrk "Use of “Require” inside a module is fragile." ++ spc() ++
strbrk "It is not recommended to use this functionality in finished proof scripts.")
+let ramp_up_config =
+ match Sys.getenv "RAMP_UP" with
+ | "yes" | "1" -> true
+ | "no" | "0" -> false
+ | exception Not_found -> false
+ | other ->
+ Printf.ksprintf failwith
+ "incorrect RAMP_UP value %S, expected [yes|no|1|0]." other
+
+let ramp_up f =
+ if not ramp_up_config
+ then f ()
+ else fst (Gc.ramp_up f)
+
let require_library_from_dirpath needed =
if Lib.is_module_or_modtype () then warn_require_in_module ();
+ ramp_up @@ fun () ->
Lib.add_leaf (in_require needed)
let require_library_syntax_from_dirpath ~intern modrefl =
+ ramp_up @@ fun () ->
let needed, contents = List.fold_left (rec_intern_library ~intern) ([], DPmap.empty) modrefl in
let needed = List.rev_map (fun (root, dir) -> root, DPmap.find dir contents) needed in
Lib.add_leaf (in_require_syntax needed); P.S.: I backported the PR to earlier OCaml releases for easier testing (the Coq tests above use 5.3): https://github.com/gasche/ocaml/tree/gc-rampup-control-5.3 , https://github.com/gasche/ocaml/tree/gc-rampup-control-5.2. |
196d88a
to
458b4c0
Compare
This is a buildup commit, currently there is no control in the GC to suspend or resume specific allocations, so these counters are always 0. The intuition is to "suspend" allocations during ramp-up phases, and "resume" allocations during ramp-down.
During [ramp_up], the deallocation work coming from allocations is "suspended". It can be "resumed" by calling [ramp_down]. [ramp_up] does not currently count the total number of suspended allocations (this needs more domain state that is not reset on each major state), so the user would not know which value to provided to [ramp_down]. This will be added next.
458b4c0
to
3c211f2
Compare
old = 4.14.1
@SkySkimmer ran the Coq/Rocq benchmark suite on top of this PR, with a patch to add The benchmark results are as follows:
In these results, OLD is some version of Rocq on vanilla 5.2, and NEW is the same version of Rocq built on top of my gc-rampup-5.2 branch, with a single line adding a The results confirm that this approach is beneficial to Rocq, at least in batch mode: it always improves performance and never increases peak memory consumption (memory numbers are within noise). The speedups vary a lot between Rocq packages, from 0% to 12%. (I hypothesize that packages with a lot of small/fast .v files benefit more, as My conclusion is that |
Small correction, the changed command is Require not Import |
old = 4.14.1
old = 4.14.1
This is another iteration to improve the performance of Coq/Rocq on 5.x, a proposal for fixing the GC pacing issues around unmarshalling discussed in #13300 ( the root cause was identified by @NickBarnes in #13300 (comment) ) which is an alternative to #13320.
In short:
Avoid excessive GC work during unmarshalling-heavy ramp-up phases #13320 uses a size-based heuristic to guess automatically whether a deserialization will be short-lived or long-lived; it gives good result on many OCaml programs (without changing their code), but it could result in (silent) memory blowup on some rarer programs.
The current PR provides a manual control in the
Gc
module for programmers to express this knowledge explicitly. It is more robust and more expressive, but requires users to change their code so it will be rarely used.I have not been able to evaluate the impact of this approach on Rocq itself, but results on a synthetic benchmark suggest that it does improve program performance.
The problem
By default, when we allocate 1Mio in the major heap (via large allocations or promotions from the minor heap), the major GC assumes that about 1Mio of memory has become dead at roughly the same time, because it assumes that the program is a steady-state of peak memory consumption. So it asks mutator to traverse the major heap to find 1Mio of dead memory to collect.
As discussed in #13300, this assumption is wrong during initialization phases where programs allocate a lot more memory than they collect -- we call this a "ramp up" behavior. In particular, Coq/Rocq typically starts proof scripts with a
Require Import
instruction, which unmarshalls a lot of files recursively, allocating a lot of memory at the start of the program that typically remains live until the end. (OCaml may have similar behaviors with the .cmi of module dependencies which are long-lived). The GC tries to collect inexistant dead memory at the same time, and slows down the progrma. As observed in #13320 (comment), the OCaml 4 GC was more robust to this sort of situation: for some obscure technical reasons, it behaves better when the flawed heuristic demands a lot of useless work -- and so these "ramp up" situations become performance regressions of OCaml 5 compared to OCaml 4.Past proposal
In #13320 I tried to fix the issue by saying bluntly that unmarshalling, above a certain payload size, should be handled as part of a ramp-up phase -- it should not hasten the GC. This is heuristically true for most OCaml programs that look like type-checkers or proof-checkers or modular analyzers, which deserialize information on program/proof dependencies that remain relevant until the end of processing.
But one could also imagine program that use deserialization for message-passing, where each deserialized payload typically has a very short life: on each iteration of the event loop we deserialize input messages, do some work, and produce some output, and typically the input data is dead at this point. On such programs the heuristic of #13320 would be very bad, it could result in a blowup in memory consumption.
Current proposal
The current PR implements a more manual approach where the programmer is in charge of being explicit about which parts of the program are "ramp up" phases, where most of the memory allocated does not replace pre-existing memory that becomes dead, and will remain live for a long time. In the Gc module:
During the execution of the
ramp_up
callback, allocations into the major heap do not hasten collection work. When the callback returns, the user also gets a count of the amount of collectino work that was suspended in this way. This lets them optionally callramp_down
at some later time, once they suspect that a corresponding amount of memory has become unreachable, typically when the memory allocated during ramp-up itself becomes dead.In the case of Rocq,
ramp_up
could be called exactly around the unmarshalling of.vo
files, or possibly at a slightly larger granularity ofRequire Import
work. (Note that using this PR requires a manual change to the source code, so programs do not get magically faster or slower unlike with #13320. And using the new functions while supporting older OCaml versions will require configure-time hacks, meh.)In the case of a message-passing event loop, it is possible to call
ramp_up
when loading input messages/events at the beginning of each turn of the event loop, and thenramp_down
at the end of the turn. This should behave roughly like today in the case where the input messages size is approximately constant from one turn to the next, but it would behave better if the sizes are very heterogeneous: withoutramp_up
, the first round with a "very large" input could make the GC waste collection time at the beginning of the round, while withramp_up
andramp_down
the collection work would be delayed to the end of the turn where it is actually useful.Benchmarks
I have not yet tried to experiment with using this feature in the Rocq codebase. For I am using a synthetic benchmark that unmarshalls 30 lists of 50_000 integers, and then does some light work over them (
List.map succ
). On this benchmark, wrapping the unmarshalling part under aGc.ramp_up
call (whose deferred work is thus ignored) results in a speedup that seems to be around 15-20%. (I tried putting aramp_down
call later, the speedup is still there.)The benchmark programs are available on my branch
https://github.com/gasche/ocaml/tree/gc-rampup-control-benchmark
How to review
I have rewritten the history so that the PR is pleasant to read commit-by-commit, so I would recommend doing that. (The definition of
ramp_up
is a bit complex, and seeing it grow in complexity over time is better than reading it in one go.)cc @NickBarnes which suggested this broad approach, and @damiendoligez with whom I discussed an interface yesterday. I also discussed the Rocq behavior with @OlivierNicole, @ejgallego and @gadmm.
Implementation notes
The general idea is to add more information to the domain state:
gc_policy
)Whenever we allocate in the major heap, check whether we are in a ramp-up phase and in that case "do the right thing": track the allocation in a way that will not hasten the GC, but correctly counts the amount of suspended work.
I found it more difficult than I thought to get the implementation right, for three reasons:
My first idea was to not increment
domain_state->allocated_words
during ramp-up phases (this is the counter that is used for GC pacing calculation), butallocated_words
is used in a lot of other statistics that would become completely wrong during ramp-up. For example, the GC would gladly report that no words have been promoted to the major heap.Instead what I do is that I increment
allocated_words
as always, but also increment aallocated_suspended_words
in parallel, which denotes a subset of major allocations since the last slice that should be considered "suspended". Then I need to change the GC-pacing computation inupdate_major_slice_work
to useallocated_words - allocated_suspended_words
, but all other usages ofallocated_words
can remain unchanged.allocated_words
and other such counters are reset to 0 on each new slice of the major GC. (Major collections happen incrementally, one "slice" of the memory at a time.) But to count suspended allocations I want to track something during the full ramp-up phase, which (1) could start in the middle of a GC slice and (2) could last for several slices. I tried to haveallocated_suspended_words
not be reset on each slice and it does not work. I ended up with two different variables, one with per-slice information and one with per-phase information. (See documentation comments indomain_state.tbl
.)Initially I thought that
ramp_down
would just subtract from the counter ofallocated_suspended_words
. (And then that counter can become negative, in which case it adds more GC collection work at the next slice instead of reducing it.) This makes it tricky and I couldn't get the behavior I expected in the corner case where someone callsramp_down
on a large amount of suspended work in the middle of a shortramp_up
phase. Now I track two separate unsigned counters,allocated_suspended_words
which grows onramp_up
andallocated_resumed_words
which grows onramp_down
, and the GC work calculation usesallocated_words - allocated_suspended_words + allocated_resumed_words
.