diff --git a/text/0000-generic-int.md b/text/0000-generic-int.md new file mode 100644 index 00000000000..3239e3fd04e --- /dev/null +++ b/text/0000-generic-int.md @@ -0,0 +1,665 @@ +- Feature Name: generic_int +- Start Date: 2018-05-15 +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + +# Summary +[summary]: #summary + +Adds the builtin types `uint` and `int`, allowing integers with an +arbitrary size in bits. For now, restricts N ≤ 128. + +# Motivation +[motivation]: #motivation + +## Bitfields + +This feature will ultimately allow bitfields, which allow multiple fields in a +struct to be defined as a range of bits, rather than bytes. Here's an example +in C: + +```C +struct MipsInstruction { + int opcode: 6; + int rs: 5; + int rt: 5; + int rd: 5; + int shift: 5; + int function: 6; +} +``` + +In this format, all the data we need (in this case, one particular kind of MIPS +instruction) is stored within 32 bits, but we don't have any particular way to +map these to fields. In today's Rust, if we wanted to represent this struct, +we'd have to create methods to grab the data for each field using bit shifts and +masking. Here's an example of what this looks like for the `rs` field: + +```rust +fn get_rs(inst: u32) -> u32 { + (inst >> 21) & !(!0 << 5) +} +fn set_rs(inst: u32, rs: u32) -> u32 { + inst & (!(!0 << 5) << 21) | ((rs & !(!0 << 5)) << 21) +} +``` + +As you can see, getting the shift (`21`) and mask (`!(!0 << 5)`) is not as +obvious as you'd think. The shift is actually the sum of the widths of all of +the fields after `rs` (`5 + 5 + 5 + 6 == 21`), and the mask is actually +`0b11111`, where the number of ones corresponds to the size of the field. It's +very easy for a human to mess these up, whereas in this case, C does all of the +work for you. + +This RFC actually doesn't solve these problems, but solves a subtler problem +that C doesn't even try to solve: what is the type of the `rs` field itself? + +If we were to write out this hypothetical struct in Rust, we'd want something +like: + +```rust +#[repr(bitpacked)] +struct MipsInstruction { + opcode: u6, + rs: u5, + rt: u5, + rd: u5, + shift: u5, + function: u6, +} +``` + +Unfortunately, `u5` and `u6` aren't valid types in Rust, and C's way of doing +`u8: 5` sounds even less desirable. Luckily, with const generics, we have a +better solution: `uint`: + +```rust +#[repr(bitpacked)] +struct MipsInstruction { + opcode: uint<6>, + rs: uint<5>, + rt: uint<5>, + rd: uint<5>, + shift: uint<5>, + function: uint<6>, +} +``` + +Additionally, methods can take or return `uint<5>` and `uint<6>` to statically +guarantee that a value does not have more bits than necessary. If a method to +set `opcode` for example took a `u8`, it's not clear whether the method should +simply ignore the upper two bits or if it should panic if they're non-zero. + +As stated before, `#[repr(bitpacked)]` is *not* proposed by this RFC and is not +necessarily the way this will work. However, `uint` is what this RFC +proposes. + +## Generalising code for integers + +Right now, there's a *lot* of boilerplate for implementing methods for integer +primitives. The standard library itself is a great example; almost the entirety +of `core::num` uses some gnarly macros to define all sorts of traits for all the +integer types. One example is `Shl` and `Shr`, which are defined for not just +every integer type, but every *combination* of integer types. We could easily do +this with const generics instead: + +```rust +impl Shl> for uint { + type Output = uint; + #[inline] + fn shl(self, rhs: uint) -> unit { + // implementation + } +} +``` + +This will probably also decrease compiler time as well, as the compiler simply +needs to understand one generic impl instead of 144 macro-generated impls. +Because `Shl` and `Shr` can also handle references, that multiplies our +generated impl count by four per trait, meaning that our macros generate 576 +impls total! + +## Enum optimisations + +Using a smaller number of bits for a value also has the benefit of niche enum +optimisations. For example, `uint<7>` represents a single ASCII character, and +`Option>` can be stored in a single byte. Additionally, +`Result, E>` also takes one byte if `E` is zero-sized. + +## Documentation + +Having generic impls would drastically reduce the noise in the "implementations" +section of rustdoc. For example, the number of implementations for `Add` for +integer types really drowns out the fact that it's also implemented for strings +and `std::time` types, which is useful to know too. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Up until now, we've only seen *specific* integer types, like `u8` or `i32`. What +if we want to write a trait that works with all of them? + +Let's go with a simple example. Let's say that we want to code our own `Default` +trait which returns one instead of zero. We'll define it like this: + +```rust +pub trait MyDefault: Sized { + fn my_default() -> Self; +} +``` + +Right now, with what we know, we have to copy-paste our impl for every integer +type. It would look something like: + +```rust +impl MyDefault for u8 { + fn my_default() -> u8 { 1 } +} +``` + +Except, we'd have to replicate this for every single integer type. If we're +clever, we could use a macro: + +```rust +macro_rules! impl_my_default { + ($($int: ident),*) => {$( + impl MyDefault for $int { + fn my_default() -> $int { 1 } + } + )*} +} +impl_my_default!(i8, u8, i16, u16, i32, u32, i64, u64, i128, u128, isize, usize); +``` + +But really, that's just making the compiler do the copy-pasting for us. Instead, +we're going to use the special types `uint` and `int` to generalise the +code for us. For this, we're going to use const generics, instead of type +generics. The end result looks like: + +```rust +impl MyDefault for uint { + fn my_default() -> uint { 1 } +} +impl MyDefault for int { + fn my_default() -> int { 1 } +} +impl MyDefault for usize { + fn my_default() -> usize { 1 } +} +impl MyDefault for isize { + fn my_default() -> isize { 1 } +} +``` + +That's a lot better! As you'll notice, we still have to separately implement +code for `usize` and `isize`; that's because they're separate types altogether. +For example, on 64-bit systems, `usize` is technically just a `u64`, but +`uint<64>` refers to `u64`, not `usize`. It's unfortunate, but we've reduced our +number of impls from twelve to four, which is still a very big deal! + +Note that there's one slight caveat here, which is that our `my_default` method +might overflow. This seems silly, but there's one type, `int<1>` where you can't +have the value 1, and because of this, the compiler will emit an +`overflowing_literals` lint. In general, if you're casting a literal *to* a +generic integer, you can't expect any value other than zero to work. In the +future, we'll be able to annotate our `int` impl with something like +`where N > 1`, but until then, we'll have to deal with this weird overflow +behaviour. + +## Other benefits + +One other side effect of having `uint` and `int` is that we can represent +a lot more types than before. For example, `uint<7>` is just a seven-bit +integer, which we might use to represent an ASCII character. Don't be fooled by +these; we can't actually make `N` bigger than 128 (for now), and `uint` is +almost always stored using `uint`. `uint<7>` is stored as +`u8`, `uint<10>` is stored as `u16`, `uint<48>` is stored as `u64`, etc. If you +store a value with a nonstandard amount of bits, you're not actually saving any +space, so, keep that in mind! + +That said, there are a few cases where we might want to use one of these +weirdly sized types, as many data formats store values in terms of bits instead +of bytes to save space. Eventually, we'll have a way of representing these, but +not today. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## Primitive behaviour + +The compiler will gain the built-in integer types `uint` and `int`, where +`const N: usize`. These will alias to existing `uN` and `iN` types if `N` is a +power of two and no greater than 128. `usize`, `isize`, and `bool` remain +separate types because of coherence issues, i.e. they can have separate +implementations and additional restrictions applied. + +`int` and `uint` will have the smallest power-of-two size and alignment. +For example, this means that `uint<48>` will take up 8 bytes and have an +alignment of 8, even though it only has 6 bytes of data. + +`int` stores values between -2N-1 and 2N-1-1, and +`uint` stores values between 0 and 2N-1. It's valid to cast +`int` or `uint` to `int` or `uint` via `as`, and sign extension will +occur as expected. This conversion is lossless when converting `int` to +`int` or `uint`, or from `uint` to `int` or `uint`, where +`M >= N`. + +In addition to the usual casts, `uint<1>` can also be cast *to* `bool` via `as`, +whereas all other integer types can only be cast *from* `bool`. This does not +apply to `int<1>`, as the value `true` is equivalent to 1, and `int<1>` does not +include the value 1. `int<1>` will still allow casts *from* `bool`, which will +overflow to -1 as expected. + +For the moment, a monomorphisation error will occur if `N > 128`, to minimise +implementation burden. This error can be lifted in the future, and if any future +`uN` or `iN` types are added, these should alias with existing `int` and +`uint` types. The main justification for this change is to allow the compiler +to continue to represent literals and other values using `u128` internally, and +to allow stabilising this feature *before* larger ints are allowed internally. + +`uint<0>` and `int<0>` are allowed as constant zero types. While this is not +inconsistent for `uint<0>`, as zero fits in the range [0, 2n-1], it +*is* somewhat inconsistent for `int<0>`, as the expected range for `int<0>` is +-½ to ½. It is decided to make both of these a constant zero type but *not* +identical types, as the invariant +`>::default() == >::default() == 0` is desired. + +Integer literals can be automatically coerced to `uint` and `int`, +although generic `iN` and `uN` suffixes are left out for future RFCs. Because of +the restriction that `N <= 128`, compiler code can continue to store literals as +`u128`. When coercing literals to `uint` or `int`, bounds checks should be +done just like they are with regular integer literals. + +Primitive operations on `int` and `uint` should work exactly like they do +on `int` and `uint`: overflows should panic when debug assertions are +enabled, but ignored when they are not. In general, `uint` will be +zero-extended to the next power of two, and `int` will be sign-extended to +the next power of two. + +## Standard library + +Right now, many methods on the integers are implemented using intrisics that +only work on the existing integer types, which are natively supported by LLVM; +this will not change. Instead, methods that require these intrinsics will be +implemented using specialisation, i.e. adding separate `default impl` versions +which will ultimately call the intrinsic-based versions. + +This way, instructions which work on the native sizes will be emitted for those +sizes, and non-native sizes will "emulate" the method using the native version. +For example, take the `count_zeroes` function: + +```rust +default impl uint { + fn count_zeros(self) -> u32 { + const M: usize = N.next_power_of_two(); + let zeros = (self as uint).count_zeros(); + zeros - (M - N) + } +} +// impl u8, etc. +``` + +Let's look at the specific case for `uint<7>`: + +```rust +fn count_zeros(self) -> u32 { + const M: usize = 7.next_power_of_two(); + let zeroes = (self as uint).count_zeroes() + zeroes - (M - 7) +} +``` + +Ultimately, this will optimize to: + +```rust +fn count_zeroes(self) -> u32 { + (self as u8).count_zeroes() - 1 +} +``` + +This way, we can implement all of the expected integer functions without having +to mess with the underlying intrinsics. + +## Niche optimizations for enums + +### Niche? + +Right now, Rust optimises enums for types which have unrepresentable values. For +example, `char` is basically a `u32` under the hood, but the values in the range +`0xD800..0xE000` aren't allowed. The compiler can exploit this, allowing +`Option` to be the same size as a `char`; the space of invalid values +(niche) can be used to store enum discriminants. + +As a hard rule, `uint` will always have its upper `N.next_power_of_two() - N` +bits set to zero, and similarly, `int`'s upper bits will be sign-extended. To +compute niche optimizations for enums, we'll have to take this into account. + +### `uint`: the easy one + +Niche values for `uint` are simply any nonzero patterns in the upper bits, +similarly to `bool`. For example, because `uint<7>` can be from 0 to 127, the +values from 128 to 255 are free to use; this allows us to represent 127 unit +variants in those extra values. Imagine a weirder version of `Option>`, +where there are 127 different versions of `None`-- this would be the same size +as a `uint<7>`. + +### `int`: the hard one + +Niche values for `int` are trickier. Because of sign extension, the extra +bits can be either zero or one. However, the sign is "extended," which we can +exploit-- as long as at least one of the upper bits is different from the sign +bit, we're okay. This gives us exactly the same number of values as we'd +otherwise get from `uint`, but we'll have to warp our brains a bit to +understand them. + +Below, I'll only draw the upper, "middle," and lower bits, so that we can +convince ourselves that this applies to any number of bits. The middle bit will +be the sign bit at bit `N-1`, and we'll also show the bits around it. Let's +start counting, from zero: + +``` +00...000...00 (0) +00...000...01 (1) +00...000...10 (2) +... +00...001...11 (2.pow(N) - 1) +``` + +Note that the next value sets our sign bit to one, but leaves the remaining +values past the sign at zero. This starts the niche space, where the sign bit +isn't the same as all the bits after it: + +``` +00...010...00 (2.pow(N) + 0) +00...010...01 (2.pow(N) + 1) +00...010...10 (2.pow(N) + 2) +... +00...011...11 (2.pow(N + 1) - 1) +``` + +Note that now, once we add one, our sign bit will flip back. However, even if we +keep counting, it won't be extended all the way to the end. So, really, our +niche extends quite far, until... + +``` +00...100...00 (+2.pow(N + 1)) +00...100...01 (+2.pow(N + 1) + 1) +00...100...10 (+2.pow(N + 1) + 2) +... +11...101...11 (-2.pow(N) + 1) +``` + +Finally, at this point, adding one will flip our sign bit to be negative, and it +will finally be extended all the way. After this, our ordinary values resume, +starting at `-2.pow(N)` and ending at `-1`. + +As you can see, we have exactly the same number of values to use as niche, and +they're all in one contiguous range. + +## Computations and storage + +### Overflow and underflow + +Because sign extension will always be applied, it's safe for the compiler to +internally treat `uint` as `uint` when doing +computations. As a concrete example, this means that adding two `uint<48>` +values will work exactly like adding two `u64` values, generating exactly the +same code on targets with 64-bit registers. Right? + +Actually, it turns out that things will be slightly more complicated. In debug +mode, the generated code will be very similar; overflow checks will still panic +as usual. However, in release mode, as we saw in the above section, there will +have to be additional code to make sure that overflows and underflows extend the +sign correctly. + +This is actually a good case for adding a `add_unchecked` (and similar) methods +which require code to ensure that overflow does not occur, skipping the extra +code in release builds. Such methods would require their own RFC, but this +actually does provide a decent reason why they'd be useful. + +Unfortunately, these generalisations aren't going to be free in all cases. + +### Storage sizes + +The restriction that `N <= 128` may allow for future changes where `uint` is +not actually stored as `uint` for sufficiently large `N`. +Right now, even storing `uint<65>` as `u128` makes the most sense, because it +will be the most efficient representation on 64-bit CPUs for performing many +operations. However, storing `uint<4097>` as `uint<8192>`, for example, may be +total overkill. + +The main goal for these representations is to avoid bit-wrangling as much as +possible. Although truncating the number of bits for integers is no longer free, +e.g. `u64` to `uint<48>` requires an AND and `i64` to `int<48>` requires +conditionally excecuting an AND or an OR, the goal is that in most cases, using +non-power-of-two lengths should not incur too much of a performance penalty. +Additionally, the compiler should be able to optimise the sign extension to +happen at the end of a series of operations, rather than after every operation. + +## Documentation + +For now, additional `primitive.uint` and `primitive.int` pages will be added to +rustdoc, and the existing `uN` and `iN` pages will be left as-is. Eventually, if +we're comfortable with it, we can remove the `uN` and `iN` pages entirely and +use `primitive.int` and `primitive.uint` as the single source of documentation +for these types, plus the pages for `usize` and `isize`. + +There certainly is a precedent for this: as of right now, all of these pages +share the same documentation, and the examples are modified to work for all of +these types. Removing these separate pages would help remove documentation +redundancy, although `usize` and `isize` would still have to be kept separate. + +# Potential future extensions + +## `N > 128` + +The most obvious extension is to allow arbitrarily sized integer types. One +clear use for this is cryptographic code which often works with +fixed-but-large-length integers, which would no longer have to decide whether +to offer its own `u256` or `u4096` types. Additionally, generalisation over +integers could allow code that works with e.g. 1024-, 2048-, and 4096-bit keys +to have their arithmetic monomorphised and specialised for free without writing +extra code. + +Another benefit of this would be allowing arbitrary-length integer literals, +which would be useful for bignum libraries. Instead of parsing a string or byte +array to create a bignum, the bignum library could simply have: + +``` +fn new(num: uint) -> BigNum { + // ... +} +``` + +Internally, the compiler could represent the power-of-two `uN` and `iN` as a +`doubled_uint` and `doubled_int`, which could even be +implemented in the standard library. At that point, only a lang item would be +necessary for allowing literal coercion. + +Of course, at this point, how this extension would actually work is complete +speculation. It seems inevitable with this RFC, although due to the large number +of unsolved problems, should require one or several more RFCs to stabilise. + +## Bit sizes and `repr(bitpacked)` + +In the future, types could be sized in terms of bits instead of bytes, with +`(bit_size_of::() + 7) / 8 == size_of::()`. All types would have a bit +size, allowing for a future `repr(bitpacked)` extension which packs all values +in a struct or enum variant into the smallest number of bytes possible, given +their bit sizes. Doing so would prevent referencing the fields of the struct, +although the ability to set/get fields is still possible. + +For example, here's a modified version of our previous example: + +```rust +#[repr(C, bitpacked)] +struct MipsInstruction { + opcode: uint<6>, + rs: uint<5>, + rt: uint<5>, + immediate: u16, +} +``` + +We could allow referencing `&self.immediate` as it is aligned to a byte +boundary, although all of the other fields can only be copied (e.g. `self.rs`) +or set explicitly (e.g. `self.rt = 4`). Because of this restriction, we would +only allow `!Drop` types in bit-packed structs, as `Drop` requires a mutable +reference. We would probably initially want to limit to `Copy` types, but I +haven't thought much on why expanding to all `!Drop` types wouldn't be doable. + +## ASCII-specific methods for `uint<7>` and `[uint<7>]` + +All of the ASCII-specific methods for `u8` could be copied to `uint<7>`, where +all values are valid ASCII. Additionally, longer-term, the potential for using +`uint<7>` for ASCII-specific matching and splitting in `str` could be discussed. + +Additionally, casting from `[uint<7>]` to `str` could be made safe, as all ASCII +strings are valid UTF-8. This could make code which interacts with ASCII much, +much safer, without needing to do `from_utf8_unchecked` or risk the penalty of +`from_utf8`. + +# Drawbacks +[drawbacks]: #drawbacks + +This is a *big* change to the language, not to be taken lightly. Additionally, +many libraries may start using `uint` for non-power-of-two `N` because it +reduces the number of invalid values, even if it's not the best solution. For +example, a library taking a percentage from 0 to 100 might use a `uint<7>`, +even though it really should just take a `u8` or `u32`. + +This requires const generics, which have not even been implemented in the +compiler at the time of writing. This would be the first builtin type which uses +generics at all, too. + +No other language provides this kind of generalisation, and it may be more effort +on the compiler side than it's worth. For example, the existing 576 impls for +`Shl` might just be better and more performant than four new generic ones. + +# Rationale and alternatives +[alternatives]: #alternatives + +## Use `u` and `i` instead of `uint` and `int` + +This would seem closer to the existing types, but has a potential conflict: `u` +and `i`, two names which could easily show up in a program, may shadow builtin +types. Considering how often the variable `i` is used for loops, this seems like +a no-go. + +## Bound-based generalisation + +Generalising integers over their number of bits is certainly a very logical way +to generalise integers *for computers*, but generalising based upon bounds is +a very natural way for humans to do it, *and* more general. For example, instead +of `uint` and `int` types, we could get away with just one type, +`int`. This would be more powerful than the original: for example, a +percentage could be represented exactly as `int<0..=100>`. Whether an integer is +signed simply depends on whether its lower bound is negative. + +The primary reason for leaving this out is… well, it's a lot different from the +existing integer types in the language. Additionally, such types could be added +backwards-compatibly by guaranteeing that `int<0..=2.pow(N)-1>` is equivalent to +`uint`, and `int<-2.pow(N)..=2.pow(N)-1>` is equivalent to `int`. As such, +these types could be added as a future extension. + +Quite simply, there are more questions to answer by adding range-based types +than size-based types, and the goal of this RFC is to be able to reap the +benefits of generic integers before such designs are fleshed out. + +## Integer traits + +Previously, Rust had traits which generalised the integer types and their +methods, but these were ultimately removed in favour of inherent methods. +Going with a generic `uint` over an `Int` trait would avoid the problem of +determining which methods are suited for these traits; instead, the `uint` +type would have all of them. + +Additionally, having these traits does not allow non-power-of-two `uint` +outright, and so, this feature is a strict superset of that. + +Additionally, having separate `uint` and `int` types also solves the +problem of generalising signed and unsigned integers: everything would require +one impl for signed integers, and one for unsigned. This would ensure that these +types have exactly the same behaviour for all `N`, only differing in the upper +bound on the number of bits. + +## Macro-based bitfield structs + +Many proposals for bitfields involved using macros to create structs, which +would ultimately allow the underlying mechanisms for bitfield-based structs to +be decided by the ecosystem, rather than the language itself. + +This is actually orthogonal to the proposal, as having language-level `uint` +types means that such macros would be able to return exactly the type necessary +for each field. While something like `repr(bitpacked)` would be very far off in +terms of stabilisation, these macros could simply get and set `uint` values, +and users could determine how to work with them without having to worry about +checking for overflow or manually implementing code which has already been +written. + +So, as far as this RFC is concerned, this is again orthogonal to the existing +RFC. + +## Offering as a library + +Once const generics and specialisation are implemented and stable, almost all of +this could be offered as a crate which offers `uint` and `int` types. +There are a couple downsides to this: + +* These library types won't unify with the existing `uN` and `iN` types, meaning + that you'll still have to use macros to define impls for all integer types. +* Until compiler support for enum optimisations is made available for external + crates, the unused values for these types won't be usable for enum variants. +* If the standard library were to eventually support generic integer literals, + then these types would have to be built-in anyway. +* The standard library won't be able to reap any of the benefits of these types, + listed in the rest of the RFC. + +## Going without + +In the [Rust 2018 call for community blog posts], a fair number of Rustaceans +expressed a desire for bitfields at the language level, for use in embedded +code. So, while having bitfields is non-negotiable, this RFC's existence is +definitely put into question. + +If this RFC were not accepted, then the most suitable alternative would be to +develop a crate in the nursery which is suitable for creating bitflag-based +structs in a way that offers robust error messages and reasonable overflow +behaviour. However, this would not solve all of the other problems this RFC has +mentioned, and this is likely not the ideal path. + +[Rust 2018 call for community blog posts]: https://readrust.net/rust-2018/ +[`bitfield` crate]: https://crates.io/crates/bitfield + +# Prior art +[prior-art]: #prior-art + +At the time of writing, no known programming language offers this level of +integer generalisation, and if this RFC were to be accepted, Rust would be the +first. + +A few other ways of integer generalisation were explored in the alternatives +section, like integer traits. A number of languages, notably Haskell and other +functional languages, have generalised their integer types over a single trait. + +Additionally, Ada offers range typyes similar to those mentioned in the +alternatives. I only noticed this near the completion of this RFC and haven't +elaborated much on it. + +# Unresolved questions +[unresolved]: #unresolved-questions + +This RFC has a number of unresolved questions, many of which are probably suited +to be solved during the development of this feature, rather than in this RFC. +However, here are just a few: + +* Should `uN` and `iN` suffixes for integer literals, for arbitrary `N`, be + allowed? +* What should the behaviour of `uint` and `int` be when casting to/from + floats? +* Should `uint` and `int` be the smallest size possible, or have a + power-of-two size? +* Should `uint<0>` and `int<0>` be allowed as constant zeroes? +* Should `uint<1>` and `int<1>` allow casting to `bool`? +* Should a sketch of a design for bitfields be proposed before this RFC is + accepted? Would `repr(bitpacked)` even be feasible?