docs: shared in-memory ABI (plain interchange form)#18
Conversation
|
|
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | decompress_all[12] |
753.6 µs | 891.2 µs | -15.44% |
| ❌ | WallTime | decompress_all[16] |
775.1 µs | 888.4 µs | -12.75% |
| ❌ | WallTime | decompress_all[("o_comment", 12)] |
15.2 ms | 17.1 ms | -10.66% |
| ⚡ | WallTime | train_and_compress[("l_comment", 12)] |
469.9 ms | 422.6 ms | +11.2% |
| ⚡ | WallTime | train_and_compress[("o_comment", 12)] |
392.1 ms | 355.6 ms | +10.25% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing docs/in-memory-abi (1766c09) with develop (cb4ea96)
Footnotes
-
2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
| Before decoding untrusted or deserialized input, a consumer should verify, in | ||
| `O(N)` over the dictionary plus `O(M)` over the codes: |
There was a problem hiding this comment.
You can do this either before or in decompression
There was a problem hiding this comment.
I reworked §6 to list what must hold for a conformant column without prescribing when checks must occur. I also made the list comprehensive (e.g. dictionary token uniqueness, which was missing)
Summary
Defines the shared in-memory format for an OnPair-compressed column so that independent implementations are interoperable: a column produced by one is intelligible to another.
It specifies the buffer layout (dictionary bytes + offsets, code stream, optional row offsets), the invariants a decoder relies on (256-token completeness, offset monotonicity, dictionary read-padding, every code < N), and the language-neutral structures for addressing a column in memory.
Scope: this is the plain interchange form (codes are plain
u16). Compressed code encodings (bit-packing, etc.) and on-disk serialization are out of scope and remain each implementation's own concern — this changes only the shared exchange contract, not how anyone serializes or stores data internally.Replaces the previous
docs/binary-format.md.