diff --git a/docs/binary-format.md b/docs/binary-format.md index 7f9fc96..7fbfd2d 100644 --- a/docs/binary-format.md +++ b/docs/binary-format.md @@ -1,135 +1,370 @@ -# OnPair binary format - -The decode-side state of a compressed column is three byte buffers — the -**dictionary bytes**, the **dictionary offsets**, and the **codes** — plus the -scalar `bits` (code width, `9..=16`). This document specifies their binary -layout and the invariants a decoder relies on. - -Decoding is a gather-copy: the code stream is a sequence of dictionary indices; -each index names a token (a short byte string), and the decoded output is the +# OnPair in-memory format + +This document specifies the **common in-memory representation** of an +OnPair-compressed column: the layout of its constituent buffers, the invariants +over them, and the language-neutral structures used to address them. It is the +contract that independent implementations agree on so that a column produced by +one is intelligible to another. + +**Scope.** This is the *plain* interchange form — the representation +implementations share to exchange data. It fixes the buffers, their invariants, +and the in-memory addressing of a column whose codes are plain `u16`. Denser +*compressed* encodings of the codes (bit-packing or anything else) +and the on-disk serialization of a column are **out of scope**: each +implementation **MAY** use them internally, but they are not part of this shared +contract. What crosses between implementations is always the plain form defined +here. + +## Overview + +OnPair compresses a sequence of byte strings by replacing recurring substrings +("tokens") with fixed-width integer **codes** into a **dictionary**. Decoding is +a gather-copy: each code names a dictionary token, and the output is those tokens concatenated in code order. -## Conventions +The representation is defined in two layers: + +1. **Decodable data** (§3) — a dictionary plus a code stream. Decoding it + reconstructs the original bytes as a single **concatenated payload**: the + full content, with no boundaries between the records that produced it. This + layer is self-contained and carries no notion of rows. + +2. **Row partitioning** (§4) — an optional layer that records where each row + begins and ends within the code stream, recovering the **delimited records** + from the concatenated payload. It indexes into the code stream; the code + stream does not depend on it. + +The layering is one-directional: the decodable data is meaningful on its own, +and the row layer is meaningful only in terms of the data it partitions. + +## 1. Conventions + +- Multi-byte **integer** fields (offsets and `u16` codes) are + **little-endian**. Token bytes and dictionary bytes are raw byte strings, to + which byte order does not apply. Supported hosts are little-endian, so this + fixed byte order *is* the host's native representation — fixed-width buffers + can be indexed directly as arrays of their element type with no per-element + byte-swapping (§5). +- `MAX_TOKEN_SIZE = 16` — the maximum token length in bytes, and the fixed read + width a decoder uses per token (§3.1). +- `N` — the number of dictionary tokens, `256 <= N <= 2^16` (§3). +- `M` — the number of codes (= number of tokens emitted on decode). Since tokens + are at least one byte, `M` is at most the decoded byte length. +- `R` — the number of rows in a column's row layer (§4). +- `dict_offsets` — the `N + 1` offsets delimiting tokens within `dict_bytes` + (§3.2); `o_i` denotes its `i`-th entry. +- `row_offsets` — the `R + 1` offsets delimiting rows within the code stream + (§4); `r_k` denotes its `k`-th entry. +- **Element count vs. byte length.** Sequence sizes are stated as *logical + element counts*, independent of how the elements are encoded; raw byte buffers + are stated as *byte lengths*. The distinction is maintained deliberately + throughout. +- Normative requirements use **MUST**, **MUST NOT**, and **MAY**. + +## 2. Offsets + +Token and row boundaries are sequences of monotonically non-decreasing unsigned +integers, stored as **contiguous fixed-width little-endian integers** and indexed +directly. The width is fixed per field — `u32` for dictionary offsets (§3.2), +`u64` for row offsets (§4) — chosen generously rather than minimally, since this +is an interchange form (Scope, above) where a few bytes per offset do not matter +and a single fixed width keeps every reader trivial. + +Implementations that want a more compact or randomly-addressable boundary +representation use it *outside* this format, alongside the decodable data (§3), +not as an alternative encoding within it. + +## 3. Decodable data + +The self-contained unit that reconstructs the decompressed output. It comprises +a dictionary (§3.1, §3.2) and a code stream (§3.3). + +**Completeness.** A dictionary **MUST** contain all 256 single-byte tokens — one +length-1 token for every byte value `0x00`–`0xFF` — at arbitrary indices. This +guarantees that *any* byte string is encodable (a byte with no longer matching +token falls back to its single-byte token), and it underpins compressed-domain +search: a query string can always be encoded into codes and matched against the +stream (e.g. equality by comparing code sequences) because encoding can never +fail. Consequently `N >= 256` for every dictionary; there is no empty +dictionary. + +**Uniqueness.** No two dictionary tokens are equal: each byte string appears at +most once. This keeps the encoding of any input unambiguous — essential for +compressed-domain operations such as equality search, where distinct code +sequences must denote distinct byte strings. + +**Dictionary ordering.** A dictionary **MAY** have its tokens arranged in +ascending bytewise-lexicographic order, advertised by the `is_sorted` flag (§5). +Because tokens are unique, this order is strict: token `i` `<` token `i+1` under +unsigned byte comparison. When the flag is set, the ordering invariant **MUST** +hold; a producer **MUST NOT** set it otherwise. Ordering is a property of the +token arrangement only — it does not change the buffer layout or how the column +is decoded — and it enables order-dependent operations such as binary search, +prefix-range queries, and compressed pattern matching. + +### 3.1 Dictionary bytes + +The `N` token byte strings, concatenated in index order with no separators, +followed by trailing **read-padding**: -- All multi-byte integers are **little-endian**. -- `MAX_TOKEN_SIZE = 16` — the maximum token length in bytes and the decoder's - fixed read width. -- `N` — the number of dictionary tokens. `N <= 2^bits`. -- `M` — the number of codes (decoded values = number of tokens emitted). +``` +dict_bytes: token_0 ‖ token_1 ‖ … ‖ token_{N-1} ‖ + └────────── L logical bytes ─────────┘ + ↑ + o_N (first padding byte) +``` -## 1. Dictionary offsets — `dict_offsets` +`L` is the sum of all token lengths — the extent of the real token content, +ending exactly where padding begins. It equals the final dictionary offset `o_N` +(§3.2), which therefore points at the first padding byte (or one past the buffer +if there is no padding). -`N + 1` little-endian **`u32`** values delimiting the tokens within -`dict_bytes`: +**Read-padding (normative).** A decoder reads a fixed `MAX_TOKEN_SIZE` bytes +beginning at each token's offset and consumes only the token's true length — a +branchless "wide read, advance by length" pattern that avoids a per-token +branch. The buffer therefore **MUST** remain readable for `MAX_TOKEN_SIZE` bytes +past the *highest* token offset (that of the last token): ``` -dict_offsets: [ o0=0, o1, o2, …, oN ] (N+1 × u32 LE) - token i = dict_bytes[ o_i .. o_{i+1} ) +readable length of dict_bytes >= offset(last token) + MAX_TOKEN_SIZE ``` -Invariants: - -- `o0 == 0`. -- Strictly increasing: `o_i < o_{i+1}` for all `i` (every token is non-empty). -- Each token length `o_{i+1} - o_i` is in `1 ..= MAX_TOKEN_SIZE`. -- `oN` is the **logical length** of the dictionary (sum of all token lengths). +The slack past the logical end `L` is `MAX_TOKEN_SIZE - length(last token)`: +zero when the last token is a full `MAX_TOKEN_SIZE` bytes, up to +`MAX_TOKEN_SIZE - 1` when it is a single byte. A producer **MAY** emit exactly +this minimum or over-allocate to the worst case; both conform. Padding byte +**values are unspecified**; a decoder never consumes them as token data. -`N = dict_offsets.len() - 1`. An empty dictionary is `[0]` (`N == 0`). +This requirement is a property of token decoding and is independent of how the +code stream (§3.3) is encoded. -## 2. Dictionary bytes — `dict_bytes` +### 3.2 Dictionary offsets -The `N` token byte strings concatenated in index order, with no separators, -followed by **trailing decoder padding**: +`N + 1` little-endian **`u32`** offsets (§2) delimiting the tokens within +`dict_bytes`. Each adjacent pair `(o_i, o_{i+1})` brackets token `i`: ``` -dict_bytes: [ token0 ‖ token1 ‖ … ‖ token_{N-1} ‖ ] - └─────────── oN logical bytes ───────────┘ -``` +token_i = dict_bytes[ o_i .. o_{i+1} ) -The decoder reads a fixed `MAX_TOKEN_SIZE` bytes starting at every token offset -(then slices to the token's true length — the branchless "fat read, advance by -`len`" pattern), so the buffer must stay readable `MAX_TOKEN_SIZE` past the -**highest** offset, which is the last token's: +example: dict_bytes: c a t d o g f i s h + │ │ │ │ + offset: 0 3 6 10 -``` -dict_bytes.len() >= o_{N-1} + MAX_TOKEN_SIZE + dict_offsets = [ 0, 3, 6, 10 ] (N = 3, o_N = 10 = L) + token_0 = dict_bytes[0 .. 3) = "cat" + token_1 = dict_bytes[3 .. 6) = "dog" + token_2 = dict_bytes[6 .. 10) = "fish" ``` -Equivalently, the slack past the logical end `oN` is **variable**: +(The example is schematic; a conformant dictionary has `N >= 256` by +completeness, §3.) -``` -padding = MAX_TOKEN_SIZE - len(last token) // 0 .. MAX_TOKEN_SIZE-1 -``` +Invariants: -— `0` when the last token is a full `MAX_TOKEN_SIZE` bytes, up to -`MAX_TOKEN_SIZE - 1` (15) when it is a single byte. (For a width-`W` token cap -the range is `0 ..= W-1`.) Padding byte values are unspecified; the decoder -never uses them as token data. A writer may emit exactly this minimum or -over-allocate to the `MAX_TOKEN_SIZE - 1` worst case (what `compress` does); -both are valid. +- `o_0 == 0`. +- **Strictly increasing**: `o_i < o_{i+1}` for all `i` — every token is + non-empty. +- Each token length `o_{i+1} - o_i` lies in `1 ..= MAX_TOKEN_SIZE`. +- `o_N == L`, the logical byte length of the dictionary. +- Element count is `N + 1`, with `N >= 256` (completeness, §3). -## 3. Codes +**Width.** Dictionary offsets are `u32`. The largest offset `o_N <= N · +MAX_TOKEN_SIZE` and `N <= 2^16`, so `o_N` always fits in 32 bits; `u32` covers +every conformant dictionary. -`M` values, each a dictionary index in `[0, N)`. Every code fits in `bits` bits -because `N <= 2^bits`. +### 3.3 Code stream -**Packed form**: codes are bit-packed **LSB-first at `bits` bits each** into a -stream of little-endian `u64` words. A code whose bits straddle a 64-bit word -boundary is split — low bits in the current word, remaining high bits open the -next. +`M` codes, each a dictionary index in `[0, N)`, stored as `M` contiguous +little-endian `u16` — one per code. The code width is fixed: every code is a +`u16`, regardless of dictionary size. No tag, no alternative encoding, and no +trailing slack (`u16` elements are independently addressable and incur no +over-read). -``` -word k holds codes packed LSB-first: bit position of code j = j * bits -packed size = ceil(M * bits / 8) bytes -``` +This is the plain interchange form (Scope, above): the representation all +implementations share for exchange. An implementation **MAY** use a denser +internal encoding of the codes — bit-packing or anything else — +within its own code, but such encodings are outside this specification and never +cross between implementations; the shared form is always plain `u16`. -A word-at-a-time unpacker over-reads up to 7 bytes when it extracts the last -code (its `u64` load starts near the end of the buffer). This is a reader -concern, not a format one: the unpacker reads the bulk fast with full-width word -loads and switches to an exact, masked tail for the final few codes whose load -would cross `ceil(M * bits / 8)` — the same fast-region-plus-exact-tail split -the decode loop uses (§Decode, `fat::decode_loop`). No padding word is emitted. +Invariants: -Reference onpair appends one zero `u64` after the packed codes (its unpacker -over-reads unconditionally). OnPair neither writes nor requires it; a reader -ingesting such a buffer ignores any bytes past `ceil(M * bits / 8)`. +- Every code is `< N`. +- Element count is `M`. -**In-memory form** (this crate): `codes` is materialized as a `[u16]`, one -element per code, unpacked — the `bits`-wide packing is applied only when -serializing. The two are equivalent under `bits`. +### 3.4 Decode -Invariant: every code `< N` (indexes a real token). +``` +output = [] +for c in code stream (in order): + output ‖= dict_bytes[ dict_offsets[c] .. dict_offsets[c+1] ) +``` -## Related: row offsets +The fast path reads a fixed `MAX_TOKEN_SIZE` bytes from `dict_offsets[c]` and +advances the output cursor by the token's true length; the read-padding (§3.1) +keeps that read in bounds. -A column also carries `code_offsets`, `R + 1` offsets into the code stream that -delimit the `R` input rows (row `r`'s codes are `codes[ o_r .. o_{r+1} )`). -Their integer width matches the input offset type (`u32`/`u64`). The compressor -emits them because a token may span a row boundary, so the row structure cannot -otherwise be recovered. They index the codes and are not part of the code/dict -encoding. +The loop above is the **bulk decode** of the whole concatenated payload, using +only the dictionary and the code stream. Because each code indexes the +dictionary independently, decoding generalizes to any contiguous range of the +code stream: a consumer can decode a single row `k` by running the same loop over +`code_stream[r_k .. r_{k+1})` (§4), producing exactly that row's bytes without +reading codes outside the range or emitting any adjacent row's output. The row +offsets thus give random access to individual records at no extra decoding cost. -## Decode +## 4. Row partitioning + +The **row layer** records where each row begins and ends within the code +stream: `R + 1` little-endian **`u64`** offsets (§2) **into the code stream** +(code positions, not byte positions) that delimit `R` rows. Each adjacent pair +brackets one row: ``` -out = [] -for c in codes: # in order - out.extend( dict_bytes[ dict_offsets[c] .. dict_offsets[c+1] ) ) -``` +index: 0 1 2 3 … R +row_offsets: [ r_0 r_1 r_2 r_3 … r_R ] -The fast path reads a fixed `MAX_TOKEN_SIZE` bytes from `dict_offsets[c]` and -advances `out` by the true token length; the dictionary padding (§2) makes that -read in bounds. +row k's codes = code_stream[ r_k .. r_{k+1} ) + +e.g. M = 9 codes, offsets = [0, 4, 4, 9]: + row 0 = codes[0..4) (4 codes) + row 1 = codes[4..4) (empty row) + row 2 = codes[4..9) (5 codes) (R = 3, r_R = 9 = M) +``` -## Validation +The row layer enables random access to individual rows and per-row operations +(§3.4). Because the code stream is a flat concatenation with no in-band record +delimiters, row structure cannot be recovered from the codes alone; the row +layer carries it explicitly. A column with zero rows has the single-element +offset array `[0]` (so `R = 0` and `M = 0`). A consumer that needs no row +structure at all uses the decodable data (§3) on its own, without a row layer. -A decoder fed an untrusted/deserialized column should check, in `O(N)` (plus -`O(M)` for the codes), before decoding: +Invariants: -1. `dict_offsets` strictly increasing, every token length `<= MAX_TOKEN_SIZE`; -2. `dict_bytes.len() >= o_{N-1} + MAX_TOKEN_SIZE` (the padding above); -3. every code `< N`. +- `r_0 == 0` and `r_R == M` (the final offset is the total code count). For the + zero-row array `[0]`, these coincide: `r_0 == r_R == 0 == M`. +- **Non-decreasing**: `r_k <= r_{k+1}`. Unlike dictionary offsets these need not + strictly increase — an empty row is a legal `r_k == r_{k+1}`. +- Element count is `R + 1`. + +**Width.** Row offsets are `u64`, which indexes a code stream of any length +(offsets run up to `r_R == M`). + +## 5. In-memory layout + +The structures below are an in-memory addressing convenience, built locally to +point at buffers the consumer already holds; what crosses between implementations +is the buffer *contents* (§2–§4), not these structures. + +They are presented in C for definiteness. They are **non-owning views**: each +pointer borrows memory owned elsewhere and valid for that owner's lifetime; the +structures allocate nothing and free nothing, and pointers are read-only. + +```c +#include + +// The code stream (§3.3): M plain little-endian u16 codes. +typedef struct OnPairCodes { + const uint16_t* data; // M codes, each a dictionary index < N + uint64_t count; // M +} OnPairCodes; + +// The dictionary (§3.1, §3.2): token bytes plus their offsets. +typedef struct OnPairDictionary { + const uint8_t* dict_bytes; // §3.1; read-padded + uint64_t dict_bytes_len; // readable byte extent (bounds-check target) + const uint32_t* dict_offsets; // §3.2; N+1 u32 offsets into dict_bytes + uint64_t dict_offsets_len;// N + 1 + uint8_t is_sorted; // §3; 1 if tokens are bytewise-lexicographic, else 0 + uint8_t _reserved[7]; // MUST be zero +} OnPairDictionary; + +// Decodable data (§3): everything needed to reconstruct the output. +typedef struct OnPairData { + OnPairDictionary dict; // §3.1, §3.2 + OnPairCodes codes; // §3.3 +} OnPairData; + +// The row layer (§4): R+1 u64 offsets into the code stream. +typedef struct OnPairRowOffsets { + const uint64_t* data; // §4; R+1 offsets (the empty column is [0]) + uint64_t count; // R + 1 (>= 1) +} OnPairRowOffsets; + +// A row-partitioned column (§4): decodable data plus a row layer. +// A consumer that does not need row structure uses OnPairData directly. +typedef struct OnPairColumn { + OnPairData data; // §3 + OnPairRowOffsets rows; // §4; always present (R >= 0) +} OnPairColumn; +``` -In this crate these are `Parts::validate_dictionary` (1–2, dictionary only) and -`Parts::validate` (1–3). The safe decoders (`decompress`, `decompress_into`) -enforce 1–2 once up front and check 3 per code in the decode loop. +`OnPairData` yields the decompressed bytes as a single concatenated payload: it +recovers the token content but not the boundaries between the original strings, +so it supports bulk decompression but not reconstruction of individual records. +A consumer that needs only the payload — or that tracks record boundaries +separately — uses `OnPairData` directly. `OnPairColumn` pairs that data with a +row layer that delimits the records, enabling per-string reconstruction and +random access (§3.4). Its `rows` is always a valid `R + 1` offset array; a column +with zero rows has `rows = [0]` (`count == 1`, `M == 0`). + +### Layout rules (normative) + +- **Reserved bytes** (in `OnPairDictionary`) pad the `is_sorted` byte so the + surrounding pointer and 64-bit fields fall on their natural 8-byte boundaries + (on a 64-bit host), making the in-memory layout explicit rather than left to + implicit compiler padding. They **MUST** be written zero and **MAY** be + validated as zero; a future revision may assign them meaning. +- **Pointee alignment.** Each pointer **MUST** be aligned to its element width: + 2 bytes for the `u16` code stream, 4 for the `u32` dictionary offsets, 8 for + the `u64` row offsets. `dict_bytes` has no alignment requirement. +- **Endianness.** Integers within the addressed buffers are little-endian (§1). +- **`dict_bytes_len`** is the readable length including read-padding, and **MUST** + satisfy the §3.1 bound. (The logical length `L` is recoverable as the last + dictionary offset; `dict_bytes_len` records the padded extent the §3.1 + guarantee refers to.) +- **`is_sorted`** is `0` or `1` (§3). When `1`, the dictionary-ordering invariant + **MUST** hold. It is a property assertion only; it does not affect any layout. +- **Lifetime.** All pointers borrow memory owned elsewhere. A consumer **MUST + NOT** retain a view beyond the owner's lifetime, nor free through it. + +### Accessing a buffer + +Every buffer has a fixed, known element type (`u16` codes, `u32` dictionary +offsets, `u64` row offsets), so a consumer reads each through a typed pointer +with no per-buffer or per-element dispatch. Because supported hosts are +little-endian (§1), the buffers' fixed little-endian contents are already in +native form, so all of them are directly indexable as arrays of their element +type with no byte-swapping. The inner loop is monomorphic. + +## 6. Validation + +A column is conformant if and only if all of the following hold. + +**Dictionary** (§3): + +- `dict_offsets_len == N + 1`, with `N >= 256`. +- `o_0 == 0`. +- Strictly increasing: `o_i < o_{i+1}` for all `i`. +- Every token length `o_{i+1} - o_i` is in `1 ..= MAX_TOKEN_SIZE`. +- All 256 single-byte tokens are present (completeness, §3). +- No two tokens are equal (uniqueness, §3). +- `dict_bytes_len >= o_N + MAX_TOKEN_SIZE` (the read-padding bound, §3.1). +- `is_sorted` is `0` or `1`; if `1`, tokens are strictly increasing in + bytewise-lexicographic order. + +**Code stream** (§3.3): + +- Every code is `< N`. + +**Row layer** (§4), for an `OnPairColumn`: + +- `rows.count == R + 1` (so `rows.count >= 1`). +- `r_0 == 0` and `r_R == M`. +- Non-decreasing: `r_k <= r_{k+1}` for all `k`. + +**Other** (§5): + +- All reserved bytes are zero. + +A consumer of untrusted or deserialized input should check the invariants for +the parts it uses — the ordering check, for instance, matters only to consumers +that rely on `is_sorted` for order-dependent operations. \ No newline at end of file