Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ jobs:
run: cargo fmt --all --check
- name: Rust Lint - Clippy
run: cargo clippy --all-features --all-targets
- name: Rust Lint - Docs
run: cargo doc --no-deps --all-features --document-private-items
- name: Rust Test
run: cargo test --workspace --all-features

Expand Down
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ name = "onpair"
warnings = "deny"
missing_docs = "deny"

[lints.rustdoc]
broken_intra_doc_links = "deny"
private_intra_doc_links = "deny"

[lints.clippy]
all = { level = "deny", priority = -1 }
if_then_some_else_none = { level = "deny" }
Expand Down
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@
OnPair is a dictionary-based string compression algorithm designed for on-disk and in-memory database workloads that need both strong compression ratios and fast random access to individual values.
It builds its dictionary in a single sequential pass by incrementally merging frequent adjacent substrings, achieving compression comparable to BPE while being substantially faster and more memory-efficient.

## Format
## Interchange format

The binary layout of a compressed column — dictionary bytes, dictionary
offsets, and codes — is specified in [docs/binary-format.md](docs/binary-format.md).
OnPair defines a shared in-memory representation — the *plain interchange form*
that independent implementations exchange so a column produced by one is
readable by another. It fixes the buffers (dictionary bytes, dictionary
offsets, codes, and row offsets) and their invariants; denser internal
encodings and on-disk serialization are out of scope. See
[docs/interchange-format.md](docs/interchange-format.md).

## References

Expand Down
2 changes: 1 addition & 1 deletion benches/tpch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
//
// Run with: cargo bench --bench tpch
//
// Targets the slim public API in PUBLIC_API.md
// Targets the slim public API
// (`compress` / `decompress` free fns + `Column::as_parts()`).

use std::collections::HashMap;
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/onpair-bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The bench is a uv workspace member of the repo-root `pyproject.toml`. Sync
once from the repo root, then drive it with `uv run`:

```bash
# from /Users/joeisaacs/git/spiraldb/onpair (one-time):
# from the repo root (one-time):
uv sync

# drop a corpus in:
Expand Down
3 changes: 2 additions & 1 deletion docs/binary-format.md → docs/interchange-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,8 @@ A column is conformant if and only if all of the following hold.
- Every token length `o_{i+1} - o_i` is in `1 ..= MAX_TOKEN_SIZE`.
- All 256 single-byte tokens are present (completeness, §3).
- No two tokens are equal (uniqueness, §3).
- `dict_bytes_len >= o_N + MAX_TOKEN_SIZE` (the read-padding bound, §3.1).
- `dict_bytes_len >= o_{N-1} + MAX_TOKEN_SIZE` (the read-padding bound, §3.1;
`o_{N-1}` is the offset of the last token).
- `is_sorted` is `0` or `1`; if `1`, tokens are strictly increasing in
bytewise-lexicographic order.

Expand Down
9 changes: 5 additions & 4 deletions src/column.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,14 @@ pub struct Column<O: Offset> {
pub codes: Vec<u16>,
/// `R + 1` offsets into `codes` delimiting the `R` input rows: row `r`'s
/// codes are `codes[code_offsets[r]..code_offsets[r + 1]]`. The compressor
/// emits these because a token may span a row boundary, so the row
/// structure cannot be recovered from the codes alone.
/// emits these because the codes are a flat concatenation with no in-band
/// row delimiter, so the row structure cannot be recovered from the codes
/// alone.
pub code_offsets: Vec<O>,
}

/// Borrowed view of the data the decoder needs, consumed by
/// [`crate::decompress`] and [`crate::decompress_into`].
/// [`fn@crate::decompress`] and [`crate::decompress_into`].
/// Downstream consumers deserializing from storage build this via struct
/// literal — there is no constructor.
#[derive(Copy, Clone, Debug)]
Expand All @@ -56,7 +57,7 @@ pub struct Parts<'a> {

impl<O: Offset> Column<O> {
/// Zero-copy view over this column's decode arrays. Pass directly to
/// [`crate::decompress`] or [`crate::decompress_into`]. `code_offsets` is
/// [`fn@crate::decompress`] or [`crate::decompress_into`]. `code_offsets` is
/// compressor metadata and is not part of the view.
#[inline]
pub fn as_parts(&self) -> Parts<'_> {
Expand Down
8 changes: 3 additions & 5 deletions src/decompress/fat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@
//! "Fat" token table layout.
//!
//! Each token is materialized into a 16-byte-strided row, so a decode load
//! addresses `data + code * 16` straight from the code — replacing the
//! `code → entry → dict[offset]` dependent-load chain of the
//! [`super::DecodeEntry`] layout with a single independent load. Costs
//! `dict_tokens * 16` bytes of table; whether that pays is a cache-residency
//! question the [`super::plan`] index decides per host.
//! addresses `data + code * 16` straight from the code — a single independent
//! load, with no `code → entry → dict[offset]` indirection. Costs
//! `dict_tokens * 16` bytes of table, rebuilt once per decode call.
//!
//! Loop structure: a 16-byte over-copy fast region ([`super::scalar::copy16`])
//! plus an exact, length-aware tail.
Expand Down
4 changes: 2 additions & 2 deletions src/decompress/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -583,8 +583,8 @@ mod tests {
}

/// Exercise the full decode width sweep against a corpus large enough to
/// drive the batched AVX-512 prefix, the scalar 16-byte remainder, and the
/// exact tail in a single call.
/// drive the 16-byte over-copy fast region and the exact, length-aware
/// tail in a single call.
#[test]
fn decompress_matches_input_across_widths() {
let mut bytes = Vec::new();
Expand Down
2 changes: 1 addition & 1 deletion src/lpm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ impl LongestPrefixMatcher {
/// prefix's length.
///
/// Precondition: `!data.is_empty()` and the matcher contains every
/// single-byte token (always true after [`new`] or [`from_dictionary`]
/// single-byte token (always true after [`Self::new`] or [`Self::from_dictionary`]
/// with a complete dictionary).
#[inline]
pub fn find_longest_match(&self, data: &[u8]) -> (Token, usize) {
Expand Down
12 changes: 6 additions & 6 deletions src/parser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ impl Parser {
/// Train a dictionary against `bytes` / `offsets` and build the matching
/// LPM. `offsets` has length `n + 1`. Returns [`Error::InvalidArg`] if
/// `offsets` is empty or its last (maximum) offset cannot be represented in
/// `usize` or exceeds `bytes.len()` — see [`validate_offsets`]. The `cfg`
/// `usize` or exceeds `bytes.len()`. The `cfg`
/// is valid by construction ([`Bits`](crate::Bits) /
/// [`Threshold`](crate::Threshold)).
pub fn train<O: Offset>(bytes: &[u8], offsets: &[O], cfg: Config) -> Result<Self, Error> {
Expand All @@ -52,8 +52,8 @@ impl Parser {
/// Encode `bytes` / `offsets` using this parser. The dictionary is cloned
/// into the returned [`Column`] so the column is fully decode-self-
/// contained — the strings need not be the corpus the parser was trained
/// on. Returns [`Error::InvalidArg`] on invalid offsets — see
/// [`validate_offsets`].
/// on. Returns [`Error::InvalidArg`] if `offsets` is empty or its last
/// offset cannot be represented in `usize` or exceeds `bytes.len()`.
pub fn parse<O: Offset>(&self, bytes: &[u8], offsets: &[O]) -> Result<Column<O>, Error> {
validate_offsets(bytes, offsets)?;
Ok(self.parse_unchecked(bytes, offsets))
Expand Down Expand Up @@ -81,9 +81,9 @@ impl Parser {

/// Encode every string into a flat `Vec<u16>` of codes plus per-row
/// `code_offsets`. Offset `[i]..[i + 1]` indexes the codes for row `i`. The
/// offsets are compressor metadata — a token may span a row boundary, so the
/// row structure cannot be recovered from the codes alone — and are not needed
/// to decode the column as one flat stream.
/// offsets are compressor metadata — the codes are a flat concatenation with no
/// in-band row delimiter, so the row structure cannot be recovered from the
/// codes alone — and are not needed to decode the column as one flat stream.
pub(crate) fn encode_strings<O: Offset>(
bytes: &[u8],
offsets: &[O],
Expand Down
Loading