-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonicalize away bit width and embed small integers into IntId
s
#4487
base: trunk
Are you sure you want to change the base?
Conversation
043e620
to
833c177
Compare
The first change here is to canonicalize away bit width when tracking integers in our shared value store. This lets us have a more definitive model of "what is the mathematical value". It also frees us to use more efficient bit widths when available, such as bits inside the ID itself. For canonicalizing, we try to minimize the width adjustments and maximize the use of the SSO in APInt, and so we never shrink belowe 64-bits and grow in multiples of the word bit width in the implementation. We also canonicalize to the signed 2s compliment representation so we can represent negative numbers in an intuitive way. The canonicalizing requires getting the bit width out of the type and adjusting to it within the toolchain when doing any kind of math, and this PR updates various places to do that, as well as adding some convenience APIs to assist. Then we take advantage of the canonical form and embed small integers into the ID itself rather than allocating storage for them and referencing them with an index. This is especially helpful for the pervasive small integers such as the sizes of types, arrays, etc. Those no longer require indirection at all. Various short-cut APIs to take advantage of this have also been added. This PR improves lexing by about 5% when there are lots of `i32` types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through to try wrap my head around everything, noticed a few inconsequential things along the way.
IntId
s
Co-authored-by: Dana Jansens <[email protected]>
Co-authored-by: Dana Jansens <[email protected]>
Co-authored-by: Carbon Infra Bot <[email protected]>
What percentage of tokens/bytes being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LG, sorry about my usual spread of comments. High level I think the IntId and IntStore changes look pretty much like what I'd expected after discussions, I'm glad for the noted performance improvements.
toolchain/base/value_ids.h
Outdated
|
||
static auto MakeIndexOrInvalid(int index) -> IntId { | ||
CARBON_DCHECK(index >= 0 && index <= InvalidIndex); | ||
return IntId(ZeroIndexId - index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there validation that this doesn't produce incorrect values? Is it possible to have a unit test that tries making too many unique integers, to check for graceful failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...
I don't think the unit test is easy to do here, as we don't even have the token payload size limitation, and so we can have a lot of unique integers. Should be 2 billion - 8 million or something, and each needs its own APInt.
But one thing that made me happy about the logic here is that we actually compute the ID from InvalidIndex
(which is the largest value of index
allowed) in a constexpr
context below. And that should ensure that this subtraction doesn't hit UB provided the assert above it holds, and produces the expected ID value even for the largest value. And for the smallest of 0
, its pretty easy to analyze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More focused on lex, that's a lower limit of 2 million right? Is that feasible to test, like with a string of long integers one after the other?
I think 2B may be infeasible to reach until we get metaprogramming.
Note, fine to not address this in this PR, but I do lean towards that we should test lex thresholds given the low-ish limits.
toolchain/base/value_ids.h
Outdated
|
||
// Tries to make a signed APInt into an embedded value in the ID, and if | ||
// unable to do that returns the `Invalid` ID. | ||
static auto TryMakeSignedValue(llvm::APInt value) -> IntId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, since you'd asked organizational comments, it might be worth moving these Make functions to IntStore (if the result is more compact)... for example, I'm having to flip back and forth between files in order to understand how IntStore::AddSigned works, and that might've been something that could be in one spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, merged into one file.
Once there, I moved these all to be private helper functions in IntStore
.
I actually tried inlining most of them, but it felt slightly awkward. We end up wanting both Add...
and Lookup...
code paths in the store I think, at least for generality. And these helpers are useful to extract and make common between those.
I actually added another Lookup
to simplify one of the places where we unnecessarily were forming an APInt. Currently there aren't a lot of Lookup
calls, but it seems like an important API from a library design perspective so I didn't want to fully remove them.
That said, happy to revisit or discuss if there is a cleaner way to structure this... not super confident in the exact result I ended up with.
Co-authored-by: Jon Ross-Perkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comments, I think I've gotten to them all, but let me know if I missed anything!
// This will always be a signed `APInt` with a canonical bit width for the | ||
// specific integer value in question. | ||
auto Get(IntId id) const -> llvm::APInt { | ||
if (id.is_value()) [[likely]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that we have standard attributes now. Happy to either switch to LLVM ones until we can move the rest of the code, or move the rest of the code in a follow-up.
toolchain/check/handle_literal.cpp
Outdated
@@ -46,7 +46,7 @@ static auto MakeI32Literal(Context& context, Parse::NodeId node_id, | |||
return context.AddInst<SemIR::IntValue>( | |||
node_id, | |||
{.type_id = context.GetBuiltinType(SemIR::BuiltinInstKind::IntType), | |||
.int_id = context.ints().Add(i32_val)}); | |||
.int_id = context.ints().AddUnsigned(i32_val)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path didn't get updated enough, all of this should have been simplified with this PR to just pass through the ID after verifying that the value fits into an i32
. The extending and creating a new ID all stemmed from when there was implicit bit width in the integer IDs themselves. The new code should be more clear.
That said, I have thought about removing AddUnsigned
and forcing the lexer to form the unsigned APInt, but I'm worried that would add cost due to needing a wider APInt ealier in the process.
Because we want to canonicalize the bit width inside the store, I didn't want clients to do any unnecessary resizing if possible, and the cleanest way I see to do that is to let them directly add an unsigned APInt if that's what they have.
// Exceptions. See /LICENSE for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
|
||
#ifndef CARBON_TOOLCHAIN_BASE_INT_STORE_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I'll do the rename from int_store.h
to int.h
last to preserve review threads as much as I can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, missed replying to one thread it seems, but found it now and replied below. (The code change was already in, just lost the thread.)
This is just in the compile_benchmark for the lex phase, using the generated source there:
Seems to fluctuate a bit between 2% and 5%. The 1% for the smallest file is because we spend more time in setup/teardown. The % of tokens that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I think my comments are pretty small, except for the one "lex file with 2M ints" test suggestion which I'm happy to split out. So feel free to merge when you've had a chance to go through remaining stuff.
@@ -70,6 +77,25 @@ class File : public Printable<File> { | |||
return types().GetAs<PointerType>(pointer_id).pointee_id; | |||
} | |||
|
|||
// Returns integer type information from a type ID. Abstracts away the | |||
// difference between an `IntType` instruction defined type and a builtin | |||
// instruction defined type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// instruction defined type. | |
// instruction defined type. Uses IntId::Invalid for types that have an | |
// invalid width. |
auto LookupLarge(int64_t value) const -> IntId; | ||
auto LookupSignedLarge(llvm::APInt value) const -> IntId; | ||
|
||
CanonicalValueStore<APIntId> values_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CanonicalValueStore<APIntId> values_; | |
// Stores values which don't fit in an IntId. These are always signed. | |
CanonicalValueStore<APIntId> values_; |
private: | ||
friend struct Testing::IntStoreTestPeer; | ||
|
||
struct APIntId : IdBase, Printable<APIntId> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct APIntId : IdBase, Printable<APIntId> { | |
// Used for `values_`; tracked using `IntId`'s index range. | |
struct APIntId : IdBase, Printable<APIntId> { |
return ZeroIndexId - id_; | ||
} | ||
|
||
constexpr auto AsTokenPayload() const -> uint32_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr auto AsTokenPayload() const -> uint32_t { | |
// Returns the ID formatted as a lex token payload. | |
constexpr auto AsTokenPayload() const -> uint32_t { |
// This will always be a signed `APInt` with a canonical bit width for the | ||
// specific integer value in question. | ||
auto Get(IntId id) const -> llvm::APInt { | ||
if (id.is_value()) [[likely]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is we've generally agreed to use C++ attribute forms so that seems the better choice. I don't think it makes sense to switch this code if the rest changes.
// Used to return information about an integer type in `GetIntTypeInfo`. | ||
struct IntTypeInfo { | ||
bool is_signed; | ||
IntId bit_width = IntId::Invalid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntId bit_width = IntId::Invalid; | |
IntId bit_width; |
Looks like this default is unused, suggesting removal.
// Because this is the first index ID, and we encoded indices as successive | ||
// negative numbers counting downwards, we can both use a comparison with | ||
// this ID to distinguish value and index IDs, and to compute the actual index | ||
// from the ID. The computation of an index in fact is just a subtraction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble reading this due to the commas. What do you think of:
// Because this is the first index ID, and we encoded indices as successive | |
// negative numbers counting downwards, we can both use a comparison with | |
// this ID to distinguish value and index IDs, and to compute the actual index | |
// from the ID. The computation of an index in fact is just a subtraction: | |
// ZeroIndexId is the first index ID, and we encode indices as successive | |
// negative numbers counting downwards. The setup allows us to both use a comparison with | |
// this ID to distinguish value and index IDs, and to compute the actual index | |
// from the ID. | |
// | |
// The computation of an index in fact is just a subtraction: |
// only a few lines of code, but it ends up expensive and a lot of code so we | ||
// move these out-of-line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// only a few lines of code, but it ends up expensive and a lot of code so we | |
// move these out-of-line. | |
// only a few lines of code, but we move these out-of-line because the generated code is big and harms performance for the non-`Large` common case. |
Suggesting a slightly different comment due to discussion.
// Each bit is either `T` for part of the token or `P` as part | ||
// of the available payload that we use for the ID: | ||
// | ||
// clang-format off: visualizing bit positions | ||
// | ||
// 0bTTTT'TTTT'TPPP'PPPP'PPPP'PPPP'PPPP'PPPP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nuanced thing, the clang-format
throws me a little, maybe one of:
// Each bit is either `T` for part of the token or `P` as part | |
// of the available payload that we use for the ID: | |
// | |
// clang-format off: visualizing bit positions | |
// | |
// 0bTTTT'TTTT'TPPP'PPPP'PPPP'PPPP'PPPP'PPPP | |
// clang-format off: visualizing bit positions | |
// | |
// Each bit is either `T` for part of the token or `P` as part | |
// of the available payload that we use for the ID: | |
// | |
// 0bTTTT'TTTT'TPPP'PPPP'PPPP'PPPP'PPPP'PPPP |
// Each bit is either `T` for part of the token or `P` as part | |
// of the available payload that we use for the ID: | |
// | |
// clang-format off: visualizing bit positions | |
// | |
// 0bTTTT'TTTT'TPPP'PPPP'PPPP'PPPP'PPPP'PPPP | |
// Each bit is tagged either `T` for part of the token or `P` as part | |
// of the available payload that we use for the ID. | |
// | |
// clang-format off: visualizing bit positions | |
// | |
// 0bTTTT'TTTT'TPPP'PPPP'PPPP'PPPP'PPPP'PPPP |
The first change here is to canonicalize away bit width when tracking
integers in our shared value store. This lets us have a more definitive
model of "what is the mathematical value". It also frees us to use more
efficient bit widths when available, such as bits inside the ID itself.
For canonicalizing, we try to minimize the width adjustments and
maximize the use of the SSO in APInt, and so we never shrink belowe
64-bits and grow in multiples of the word bit width in the
implementation. We also canonicalize to the signed 2s compliment
representation so we can represent negative numbers in an intuitive way.
The canonicalizing requires getting the bit width out of the type and
adjusting to it within the toolchain when doing any kind of math, and
this PR updates various places to do that, as well as adding some
convenience APIs to assist.
Then we take advantage of the canonical form and embed small integers
into the ID itself rather than allocating storage for them and
referencing them with an index. This is especially helpful for the
pervasive small integers such as the sizes of types, arrays, etc. Those
no longer require indirection at all. Various short-cut APIs to take
advantage of this have also been added.
This PR improves lexing by about 5% when there are lots of
i32
types.