Skip to content

Compressed Integer

Ben edited this page May 19, 2019 · 5 revisions

The compressed integer is a way to store a positive-only number in a variable byte length. It is almost identical to LEB128.

Format definition

It is the same as LEB128, but with the first bit swapped. zchunk reinvented LEB128 by mistake.

How to read an compressed int:

  1. If the first (most significat) bit is 1, this is the last byte to read.
  2. Take the 7 least significt bits and add them to the nth byte of your array.
  3. Repeat

Conversion specifics (anomaly)

The int or long value from java is treated as unsigned. That means, that the adopted value may differ from the value which java prints via toString() conversion.

Example:

Type Value (Java) Java uncompressed Intepreted as CompressedInt
int 0 0x00000000 0 0x80
int 2_147_483_647 0x7FFFFFFF 2_147_483_647 0xFFFF_FEFE
int -1 0xFFFFFFFF 4_294_967_295 0x7F_7F7F_7F8F
long 4_294_967_295 0x00000000_FFFFFFFF 4_294_967_295 0x7F_7F7F_7F8F
long -1 0xFFFFFFFF_FFFFFFFF 18_446_744_073_709_551_615 0x7F7F_7F7F7F7F_7F7F7F81

Implementation details

  • The java value object "CompressedInt" will only know the compressed byte array after construction.
  • Each method is lazy. That means, that the corresponding value will be calculated only once.
  • If all methods on a compressed int were called, it would consume the space of byte[10] (maximum) + BigInteger + long + long + int.

Examples

0

Can be written as: 0x80 == 0b10000000.

1

0x81 == 0b10000001

394

In normal notation: 394 == 0b00000000_00000000_00000000_00000000_00000000_00000000_00000001_10001010. This exceeds one byte obviously. That's how a "normal" integer would be stored. For this example, the leading all-zero-bytes are omitted, as they won't make a difference in this algorithm.

Input: 0b00000001_10001010.

To compress:

  • Take the rightmost (least significant) 7 bytes, which are 0b00001010.
  • The most significant byte is a zero by default.
  • Unsign shift the byte notation by 7, which makes it 0b00000000_00000011
  • Repeat. This means the second byte will be 0b00000011.
  • Right shift by 7. Now we get 0b00000000_00000000. That's 0, so we stop here.
  • Our result is now an array of two bytes: 0b00001010, 0b00000011. Please note that the bytes also "swapped places".
  • The last byte will have the first (most significant) bit of the last byte set to 1. As a final result we get: 0b00001010_10000011.

To decompress:

  • Create a result int, which starts at 0b00000000.
  • Create a byte index counter, which starts at 0.
  • Read a byte. If the first bit is 1, this is the last step.
  • Read the least 7 significant bits, which are: 0b00001010.
  • Shift the read byte by 7 * byte index (which is 7 * 0 = 7).
  • Read the next byte and increase the byte index by 7.
  • Read the next byte, which is 0b10000011. As this starts with 1, this will be the last byte to read.
  • Read the 7 least significant bits, which are: 0b00000011.
  • Shift the read byte by 7 * byte index (which is 7 * 1 = 7). This will give you 0b0000001_1000000.
  • Add them to the result in by using "logical or".

The last operation is:

  0b00000000_00001010 (from step 4) 
| 0b00000001_10000000 (from step 8)
 ===========================
  0b00000001_10001010
Clone this wiki locally