Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to pare down code size? #2786

Closed
syg opened this issue Feb 23, 2024 · 16 comments
Closed

Possible to pare down code size? #2786

syg opened this issue Feb 23, 2024 · 16 comments

Comments

@syg
Copy link

syg commented Feb 23, 2024

As discussed elsewhere, Chrome and V8 would ideally like to see the size of the proposal reduced to make shipping on low-resource devices like lower end Android phones more palatable.

This issue is only for focusing on the static code size, the size that's in the shipped binary of Chrome.

The bulk of Temporal in V8 is implemented across 3 files: builtins-temporal.cc, js-temporal-objects.cc, and temporal-parser.cc. IIUC this isn't even complete yet. @FrankYFTang, is that the case?

The numbers we have as of V8 12.4 on 32bit release Android APK:

File Size of .text + .data + .rodata + .data.rel.ro
builtins-temporal.cc 125.92 KiB
js-temporal-objects.cc 108.83 KiB
temporal-parser.cc 14.80 KiB
Total 249.55 KiB

V8 contributes a total of 9539.62 KiB, which makes Temporal at least 2.6% of all of V8.

This is after size optimizations. V8 has already disabled runtime callstats (a profiling functionality that bloats code size) on release channels due to Temporal.

The guidelines from the Android folks is that a feature that adds 100-250 KiB is considered Medium, with 250+ KiB considered Large. "Features" in that context is not strictly language features. A new JIT tier is also a feature, and V8 uses a lot of size budget for our compilers since we believe better performance has outsized impact.

So, ideally, any reduction in code size here would be welcome. The closer we get to 100 KiB the better.

@ptomato ptomato added this to the Stage 4 milestone Feb 23, 2024
@ptomato ptomato pinned this issue Feb 23, 2024
@syg
Copy link
Author

syg commented Feb 26, 2024

cc @msaboff if JSC has any numbers to share re: Apple's resource-constrained devices.

@ptomato
Copy link
Collaborator

ptomato commented Mar 14, 2024

I have some good things to report on this topic. I've spent the last 2 weeks trying out some experiments in V8 to see what is most effective in reducing the binary size.

I've been measuring as follows. I use ./tools/dev/gm.py android_arm.release to build V8 for ARM32, and am looking at the object files and the built d8 executable. Here's the build args generated by this configuration:

is_component_build = false
is_debug = false
target_cpu = "arm"
v8_target_cpu = "arm"
target_os = "android"

v8_enable_backtrace = true
v8_enable_disassembler = true
v8_enable_object_print = true
v8_enable_verify_heap = true
dcheck_always_on = false

Edit: also make sure to add v8_enable_runtime_call_stats = false

To try to reproduce the size measurements of the object files above, I used the following:

for file in builtins-temporal js-temporal-objects temporal-parser; do \
  size -A -d out/android_arm.release/obj/v8_base_without_compiler/$file.o | \
  grep -E '\.(text|rodata|data)' | cut -c20-25 | jq -s add; \
done

Here's what I get:

File Size
builtins-temporal.o 181597 67281
js-temporal-objects.o 157880
temporal-parser.o 23920
Total 363397 (355k) 249081 (243k)

Unfortunately these numbers are different from what Shu posted above. (Edit: They are only slightly different now.) We talked about this offline and concluded that it's probably not possible to reproduce it exactly — the numbers in the OP were measured with an internal Google tool. So I'm treating this baseline number with some skepticism. Instead, I'll focus on comparing various built d8 binaries to the d8 built from 'vanilla' V8, using Bloaty.

More to come in following comments!

@ptomato
Copy link
Collaborator

ptomato commented Mar 14, 2024

A short overview of where the size goes to.

JS builtins are the JS function objects, approximately equivalent to intrinsics using the %-notation in the spec text, e.g. %Temporal.PlainDate.prototype.add%. There is one per public API and this is pretty much fixed. Temporal has ~300 of them.

C++ builtins are C++ functions linked in such a way that when you call a JS builtin from JS, the C++ builtin is executed. Currently, Temporal has about ~300 of them, one per public API, but unlike JS builtins that number is not fixed.

There is also C++ code called by the C++ builtins, but that isn't a builtin itself. Of course the machine code generated by the compiler also takes up space in the binary.

The only way you could reduce the number of JS builtins is by removing public API from the proposal. (Is it really the only way? You could try sharing the same JS builtin between more than one API entry point, but then you'd have e.g. Object.is(Temporal.PlainDate.prototype.valueOf, Temporal.PlainTime.prototype.valueOf) equal to true which would leak implementation details to user code.)

You can reduce the number of C++ builtins by reusing the same C++ builtin for multiple JS builtins. I'll call this "multiplexing". I've found that C++ builtins take up the most space in the binary, 60% ~40% of the total space of Temporal, so there are good gains to be had here. More on this later.

You can also reduce the size of the C++ code, which hopefully leads to the compiled machine code being smaller. You really have to delete a lot of code for this to have any noticeable effect, I found.

@ptomato
Copy link
Collaborator

ptomato commented Mar 14, 2024

Multiplexing. There seem to be many opportunities to combine similar C++ builtins into one. For example, the getters for the various types' .calendarId properties could be implemented something like this, in JS-like pseudocode:

function CalendarIdGetter() {
  let slotValue;
  switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
    case PLAIN_DATE:
      if (!IsTemporalDate(this)) throw new TypeError();
      slotValue = this[PLAIN_DATE_CALENDAR_SLOT];
      break;
    case PLAIN_DATE_TIME:
      if (!IsTemporalDateTime(this)) throw new TypeError();
      slotValue = this[PLAIN_DATE_TIME_CALENDAR_SLOT];
      break;
    // ...
  }
  return ToTemporalCalendarIdentifier(slotValue);
}

Then each getter is a separate JS function object, but each calls the same C++ builtin as its body. We can stash the needed info to control the C++ builtin's behaviour on private properties of that JS function object, for example here we'd stash the numeric constant PLAIN_DATE as Object.getOwnPropertyDescriptor(Temporal.PlainDate.prototype, 'calendarId').get[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR].

We can do similar things for the .valueOf() methods, for the .epochSeconds/.epochMilliseconds. etc getters, and for all the getters such as .year, .monthCode, .daysInMonth, .inLeapYear etc that just invoke a calendar method when they are called.

The runtime overhead of this is basically 1 switch statement. That's a naive view, as branch prediction will likely interfere, so it'd probably be good to benchmark that.

Note that there is precedent in the V8 codebase for this: There isn't a separate builtin for each TypedArray constructor, instead they all share the same one. (Although it's a TurboFan builtin, not a C++ builtin.)

I have a branch at ptomato/v8@3cb3a6f that implements some of this multiplexing. We could do more, or less of it, I just picked a few of the ones that seemed like obvious gains. d8 compiled from this branch is 68k 36k smaller than my baseline, about 19% 15% reduction.

@ptomato
Copy link
Collaborator

ptomato commented Mar 14, 2024

You can also take the multiplexing idea to its extreme conclusion and multiplex everything into one giant C++ builtin. That essentially looks like this, in JS-like pseudocode:

function TemporalMethodMultiplexer() {
  switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
    case PLAIN_DATE:
      if (!IsTemporalDate(this)) throw new TypeError();
      switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
        case 'add':
          return TemporalPlainDatePrototypeAdd(this,
                 arguments[0], arguments[1]);
        case 'subtract':
          return TemporalPlainDatePrototypeSubtract(this, 
                 arguments[0], arguments[1]);
        // ...
      }
    case PLAIN_DATE_TIME:
      if (!IsTemporalDateTime(this)) throw new TypeError();
      switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
        // ...
      }
    // ...
  }
}

This makes the runtime overhead basically 2 switch statements, which is again a naive view. (In real code, the inner switch statements aren't really switch statements, they're string comparisons which sometimes collapse to pointer comparisons, because many but not all of the possible method names are internalized strings.)

This approach would probably need some fine-tuning based on usage data, because the order in which methods are considered, does matter for performance...

I also have a branch implementing this approach, at v8/v8@fab0ae9. (For reasons, it multiplexes everything into 3 builtins: one for all constructors, one for all static methods, and one for all prototype methods.) d8 compiled from this branch is 199k 92k smaller than my baseline, about 56% 38% reduction.

(This is what suggests to me that the most effective gains come from reducing the number of C++ builtins. This branch removes builtins but almost no actual implementation.)

@syg
Copy link
Author

syg commented Mar 15, 2024

Interesting, multiplexing is a much bigger reduction than I expected. Might be worth revisiting the decision.

Multiplexing was discussed and rejected in the past as a general technique because it muddies performance timers and tracing. One can argue that Temporal performance isn't so important, but the idea is if you're tracing a whole application, you'd want to know that X% of the app time is spent in, say, Date subtraction vs DateTime subtraction. That would be harder to tease out if there's just a multiplexed C++ frame for both.

@syg
Copy link
Author

syg commented Mar 15, 2024

Is there any interest in API simplification itself? It is extremely batteries-included.

@ptomato
Copy link
Collaborator

ptomato commented Mar 15, 2024

I've been focusing primarily on whether we can achieve gains without touching the current state of the proposal and delaying it even more. But I did do a build where I simply ripped out a bunch of the APIs and all the code associated with them, probably about 30–40% of the total proposal, applied on top of the fully multiplexed build. That build was 56k 60k smaller than the fully multiplexed build with all APIs intact. (So that number applies to just the code, since all the C++ builtins were already gone.) (Edit: Note that with and without runtime callstats it removed the same number of bytes of machine code, about 56k, but in the no-RCS case the result happened to just fit in one less 4k page.)

So I think it's not a question of ripping out a few of the included batteries.

For comparison, I also did a build where I implemented #2789 and #2798, which are PRs with potential observable optimizations that we were not planning to pursue as normative changes unless they yielded significant size reductions. That build was 4k smaller but I suspect most of that was padding, because the few hundred bytes less of machine code happened to fit in one less 4k page. (Edit: That build was not any smaller at all, because decreases in the .text and .data segments were offset by increases in .relro_padding.)

@ptomato
Copy link
Collaborator

ptomato commented Mar 15, 2024

Of course only after I wrote up everything here did I figure out why my baseline number was so much higher than the one in the OP — my build had runtime callstats turned on, while Shu mentioned:

V8 has already disabled runtime callstats (a profiling functionality that bloats code size) on release channels due to Temporal.

(So I assumed that the release builds I'd been making also had runtime callstats disabled, but that was not the case.)

I'll go back through the thread and edit in the updated numbers, leaving the old ones visible with strikeout. But here is a summary.

The size reduction from multiplexing is not as dramatic as I originally thought, but it is still large and in my opinion worth doing.

Additionally, taking the full multiplexing approach could allow V8 to switch runtime callstats back on by default, if Temporal was the only reason it was disabled; once all the Temporal builtins have been removed by multiplexing, runtime callstats actually make builtins-temporal.o 2k smaller for some reason (although of course, the d8 binary is still considerably larger in total with runtime callstats.)

The partial multiplex approach saves -36k, the full multiplex approach saves -92k, and ripping out 30–40% of the proposal still saves about -56k although this time, due to page alignment, we lose several k of padding additionally.

@ptomato
Copy link
Collaborator

ptomato commented May 2, 2024

See #2834 for a different approach to reducing the number of builtins.

@jakobkummerow
Copy link

I've been asked to comment on this. Summary: size is not free, it does make sense to trim bloat when possible.

Multiplexing addresses one aspect of the cost of having many functions, by avoiding duplication of machine code compiled from C++ (or whatever language the engine is written in), and it's nice to see those savings. (FWIW, I'd use the size command to compare the .text section size; including padding in the results isn't representative because it'll change "randomly" on every unrelated commit, so you might just get lucky or unlucky at the particular measurement you take.)

Multiplexing doesn't address another aspect of the cost of having many functions though, and that's simply having the JavaScript object hierarchy. Each object property increases the size of that object. Each closure (which all have observable identity: (function() {}) !== (function() {})) takes some pointers to store in memory, even when its actual executable body is multiplexed/shared. For a single built-in function, that's of course not very much (I haven't measured it, something on the order of 100 bytes or so would sound about right), but if you have 320 Temporal.* functions installed on the global object, dozens of <iframe>s on your average website with their own global object each, and dozens of open tabs in the typical browser session, it does add up to quite significant memory consumption. Many users' systems are eternally close to an out-of-memory situation; we know from (sometimes accidental...) experiments that allocating just 20MB more causes a significant uptick in the out-of-memory renderer crash rate. (That's the key reason why adding things to the platform should be a very careful consideration: it just doesn't make sense to burden every <iframe> with every feature that anyone might ever need.)

As a related data point, V8 uses a "snapshotting" technique (deserializing a heap snapshot that's embedded in the shipped binary) because creating all of Math.*, Array.*, etc from scratch for every <iframe> would be way too expensive. While snapshotting turns this from "a big problem" to "not much of a problem", that doesn't mean that doubling (or quadrupling even?) the required size of the snapshot wouldn't matter.

Since you're already experimenting with custom V8 builds to measure things, you could relatively simply measure what dropping all or part of Temporal does to memory consumption after starting a d8 session. That's a single global object; so for an estimate of the impact on the average browser session you can multiply the number by 100 to 1000.

@ptomato
Copy link
Collaborator

ptomato commented May 23, 2024

I haven't measured the runtime memory consumption. Those experiments could be done, but I focused on the static code size here because that's what @syg requested at the top of the thread. If we need to reverse course on that, I'd appreciate having some guidelines as to what kind of runtime size increase Google would consider acceptable and what would not.

Is the snapshot size not included in the stats that Bloaty gives when comparing d8 binaries?

@jakobkummerow
Copy link

I didn't mean to suggest reversing course. Binary size is an important metric. My previous comment came from a more general "why proposal size reductions are worth considering" perspective: in addition to binary size, memory consumption matters too. Sorry that that was a bit of a deviation from the original topic of this thread; feel free to discuss it elsewhere if you prefer.

I don't know how to determine a number X such that "X bytes is acceptable, X+1 bytes is not" (for either binary size or memory consumption). The less, the better. If you can cover 98% of the use cases for 70% of the cost, go for it. There may be a few developers who will complain that some niche feature they liked is gone, but on the flip side there will be a few billion users who will be happy that the software on their devices is a little less bloated.

@littledan
Copy link
Member

@jakobkummerow Thanks for the explanation. This is really helpful. I share your goal of reducing complexity and surface area where possible.

I'm a little curious what your thoughts are on a different kind of multiplexing, more like TypedArray methods: what if we reused the same function identity, but then multiplexed internally among function bodies, as proposed for toJSON #2858 ?

Not that this weakens your argument, but I'm just curious about the impact on making the snapshot bigger, and therefore more snapshot restoration work: Has the V8 team recently considered implementing lazy allocation of the objects backing some JS builtins, as other browsers do? I wonder if that could help mitigate the runtime memory usage/startup time cost. Is such a technique too complex/costly?

@jakobkummerow
Copy link

what if we reused the same function identity, but then multiplexed internally among function bodies

I haven't really formed an opinion on which of the size reduction ideas under discussion are most worthwhile/impactful/acceptable.
As a quick estimate: each function with object identity needs a JSFunction (32 bytes), if it has its own JS-level source then it also needs a SharedFunctionInfo (48 bytes), if it has its own C++ implementation then the builtin tables probably add another two pointers or so. I'm not exactly sure which two scenarios are being compared at #2858, but that could mean a lower bound of 800 bytes saved. (The size impact of changes is notoriously hard to predict accurately; when in doubt, try it and measure!)

Has the V8 team recently considered implementing lazy allocation of the objects backing some JS builtins

I haven't heard of any such explorations recently, fwiw.

@ptomato
Copy link
Collaborator

ptomato commented Jun 13, 2024

I'm assuming the slate of removals that was approved in the TC39 meeting on 2024-06-12 is sufficient to close this issue. Please let us know ASAP if that's not the case.

@ptomato ptomato closed this as completed Jun 13, 2024
@ptomato ptomato unpinned this issue Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants