Skip to content

docs: shared in-memory ABI (plain interchange form)#18

Merged
joseph-isaacs merged 2 commits into
developfrom
docs/in-memory-abi
Jun 5, 2026
Merged

docs: shared in-memory ABI (plain interchange form)#18
joseph-isaacs merged 2 commits into
developfrom
docs/in-memory-abi

Conversation

@gargiulofrancesco

Copy link
Copy Markdown
Collaborator

Summary

Defines the shared in-memory format for an OnPair-compressed column so that independent implementations are interoperable: a column produced by one is intelligible to another.

It specifies the buffer layout (dictionary bytes + offsets, code stream, optional row offsets), the invariants a decoder relies on (256-token completeness, offset monotonicity, dictionary read-padding, every code < N), and the language-neutral structures for addressing a column in memory.

Scope: this is the plain interchange form (codes are plain u16). Compressed code encodings (bit-packing, etc.) and on-disk serialization are out of scope and remain each implementation's own concern — this changes only the shared exchange contract, not how anyone serializes or stores data internally.

Replaces the previous docs/binary-format.md.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@gargiulofrancesco gargiulofrancesco added the documentation Improvements or additions to documentation label Jun 4, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
❌ 3 regressed benchmarks
✅ 27 untouched benchmarks
⏩ 2 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime decompress_all[12] 753.6 µs 891.2 µs -15.44%
WallTime decompress_all[16] 775.1 µs 888.4 µs -12.75%
WallTime decompress_all[("o_comment", 12)] 15.2 ms 17.1 ms -10.66%
WallTime train_and_compress[("l_comment", 12)] 469.9 ms 422.6 ms +11.2%
WallTime train_and_compress[("o_comment", 12)] 392.1 ms 355.6 ms +10.25%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing docs/in-memory-abi (1766c09) with develop (cb4ea96)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@gargiulofrancesco gargiulofrancesco removed the documentation Improvements or additions to documentation label Jun 4, 2026
Comment thread docs/binary-format.md Outdated
Comment on lines +317 to +318
Before decoding untrusted or deserialized input, a consumer should verify, in
`O(N)` over the dictionary plus `O(M)` over the codes:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this either before or in decompression

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked §6 to list what must hold for a conformant column without prescribing when checks must occur. I also made the list comprehensive (e.g. dictionary token uniqueness, which was missing)

@joseph-isaacs joseph-isaacs merged commit e3f73f6 into develop Jun 5, 2026
9 of 11 checks passed
@gargiulofrancesco gargiulofrancesco deleted the docs/in-memory-abi branch June 5, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants