Skip to content

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Sep 10, 2025

Description

This PR splits off the documentation changes from #1176 to slightly ease the review burden. These docs changes comprise:

  • Describing how the lossless encoding of floating point numbers works to developers
  • Telling users how to use the feature

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Validated that the modified docs render correctly

Summary by CodeRabbit

  • New Features

    • Added an option to retain original floating-point textual representations during compression; encoder prefers a compact encoded form and falls back to full-string storage when needed.
  • Limitations

    • Not supported for KV‑IR inputs; numeric comparisons use the nearest IEEE‑754 binary64 value and may lose textual precision at query time.
  • Documentation

    • Added user guidance, CLI flag docs, and a new design document describing retention strategies and behaviour.

@gibber9809 gibber9809 requested a review from a team as a code owner September 10, 2025 20:45
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Adds a new design document describing two lossless JSON float encodings, updates the dev-docs index to include it, and updates user docs to document a new CLI flag --no-retain-float-format and floating-point limitations.

Changes

Cohort / File(s) Summary
Design: Retain Float Format
docs/src/dev-docs/design-retain-float-format.md
New design document specifying two lossless JSON float encodings: FormattedFloat (IEEE‑754 binary64 + 2‑byte little‑endian format field) and DictionaryFloat (store full string in a variable dictionary). Defines 2‑byte bit-field layout, acceptance rules for scientific/non‑scientific forms, selection/fallback rules, canonical vs nearest‑binary64 conditions, digit/exponent constraints, examples, and references.
Dev Docs Index Update
docs/src/dev-docs/index.md
Adds design-retain-float-format entry to the Design section toctree (documentation navigation-only addition).
User Docs: CLP Flag & Limitations
docs/src/user-docs/core-clp-s.md
Adds --no-retain-float-format CLI flag documentation (disable storing extra info to preserve original textual float representations), example usage, and updates Current limitations noting KV‑IR ingestion does not preserve printed float formats and query-time comparisons use nearest representable double.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as clp-s
  participant Encoder as Compressor
  participant Store as Archive
  participant Decoder as Decompressor
  participant Out as Output

  rect rgba(230,245,255,0.6)
    note over User,CLI: Compression with optional format retention (flag: --no-retain-float-format)
    User->>CLI: c [input] [output] [--no-retain-float-format?]
    CLI->>Encoder: Parse numbers
    alt format retention enabled
      Encoder->>Encoder: Try FormattedFloat (binary64 + 2‑byte format)
      alt fits spec
        Encoder->>Store: Write FormattedFloat
      else fallback
        Encoder->>Store: Write DictionaryFloat (dict ID -> original string)
      end
    else retention disabled
      Encoder->>Store: Write numeric value only
    end
  end

  rect rgba(235,255,235,0.6)
    note over User,CLI: Decompression
    User->>CLI: d [input]
    CLI->>Decoder: Read records
    alt FormattedFloat
      Decoder->>Out: Reconstruct original string using format bits
    else DictionaryFloat
      Decoder->>Out: Lookup original string by dictionary ID
    else numeric only
      Decoder->>Out: Emit default formatted value (nearest double)
    end
    note right of Decoder: KV‑IR inputs: original printed formats not preserved
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change: adding documentation for the lossless float representation feature in the clp-s component, and it follows the repository's conventional "scope: summary" style; it is concise, specific, and directly related to the files changed so a reviewer can understand the PR's main purpose at a glance.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c098515 and da06d72.

📒 Files selected for processing (1)
  • docs/src/user-docs/core-clp-s.md (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-docs/core-clp-s.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
docs/src/user-docs/core-clp-s.md (2)

157-160: Clarify limitation scope; align “floating point” style

Make it explicit this applies at decompression for archives built from KV‑IR inputs, and drop the hyphen for consistency.

-* For floating-point numbers:
-  * KV-IR inputs currently don't support preserving the original printed float formats.
-  * Comparisons against floating point numbers at query time treat each stored number as if it were
-    the nearest representable double-precision value, which can potentially lose precision.
+* For floating point numbers:
+  * For archives built from KV-IR inputs, decompression does not currently preserve the original
+    printed float formats.
+  * Comparisons against floating point numbers at query time treat each stored number as its nearest
+    representable double-precision (IEEE 754) value, which can lead to loss of precision or
+    unintuitive results.

27-28: Tighten wording: use "floating point" (no hyphen) and mention decompression

File: docs/src/user-docs/core-clp-s.md — lines 27–28

Preserved "that" per docs preference; clarify effect occurs on decompression.

-  * `--no-retain-float-format` specifies that `clp-s` should not store extra information to
-    preserve the original textual representation of floating-point numbers.
+  * `--no-retain-float-format` specifies that `clp-s` should not store metadata needed to
+    preserve the original textual representation of floating point numbers when decompressing.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/src/dev-docs/design-retain-float-format.md (1)

119-120: Prefer a normative JSON reference

Link to RFC 8259 or ECMA‑404 instead of an informal page.

-[json_grammar]: https://www.crockford.com/mckeeman.html
+[json_grammar]: https://www.rfc-editor.org/rfc/rfc8259
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f74cc and f0c4036.

📒 Files selected for processing (3)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
  • docs/src/dev-docs/index.md (1 hunks)
  • docs/src/user-docs/core-clp-s.md (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md

4-4: Inline HTML
Element: i

(MD033, no-inline-html)

🔇 Additional comments (2)
docs/src/dev-docs/index.md (1)

86-86: Approve design doc entry Verified that docs/src/dev-docs/design-retain-float-format.md exists and the slug resolves.

docs/src/user-docs/core-clp-s.md (1)

88-96: Usage example reads clearly

Example is consistent and actionable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (6)
docs/src/user-docs/core-clp-s.md (3)

29-31: Tighten wording; prefer “floating‑point”; concise KV‑IR caveat

Clarify outcome and keep terminology consistent.

-  * `--retain-float-format` specifies that floating-point numbers should be stored with extra
-    metadata to preserve their textual representation after decompression. This feature is currently
-    not supported when ingesting KV-IR.
+  * `--retain-float-format` stores floating‑point numbers with format metadata so decompression
+    preserves their original textual form. Not supported for inputs ingested as KV‑IR.

82-86: Tip: focus on user outcome; note overhead, avoid internals

Remove implementation details; mention size/CPU trade‑off.

-:::{tip}
-Use the `--retain-float-format` flag during compression. Internally, switch to using a different
-encoding approach for floating point numbers that always retains their original formats. For
-example, values like `1.000e+00` or `0.000000012300` will be decompressed unchanged.
-:::
+:::{tip}
+Use `--retain-float-format` during compression to preserve the original textual representation of
+floating‑point values (e.g., `1.000e+00`, `0.000000012300`). This may slightly increase archive
+size and encode/decode CPU due to extra metadata.
+:::

175-176: Clarify scope: KV‑IR limitation applies at decompression for KV‑IR‑built archives

Make the limitation explicit.

-* When using the `--retain-float-format` flag:
-  * KV-IR inputs currently don't support preserving the original printed float formats.
+* When using `--retain-float-format`:
+  * For archives built from KV‑IR inputs, decompression does not currently preserve the original
+    printed floating‑point formats.
docs/src/dev-docs/design-retain-float-format.md (3)

26-32: Specify bit numbering and endianness unambiguously

Call out wire endianness and intra‑field bit numbering to avoid misinterpretation.

 - A 2-byte little-endian *format* field encoding the necessary output formatting information so
   that, upon decompression, the value can be decompressed exactly to the original text.
-  Note that the unused lowest 5 bits of the 2‑byte field are currently reserved.
+  The 2‑byte field is serialized little‑endian on the wire. Within the 16‑bit field, bit 0 is the
+  least‑significant bit (LSB) and bit 15 is the most‑significant bit (MSB).
+  Note that the lowest (LSB) 5 bits of the 2‑byte field are currently reserved.
   Encoders must write them as 0, and decoders must ignore them (treat as “don’t care”) for forward
   compatibility.

40-52: Clarify admissible formats; tighten zero rules

Make constraints precise and consistent.

-* For non-scientific numbers we accept:
-  * Any number that has at most 16 digits after the first non-zero digit
-  * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
-* For scientific numbers we accept:
-  * Single digit numbers with no decimal, followed by an exponent
-  * Or numbers with **1** digit preceding the decimal and up to 16 digits following the
-    decimal, followed by an exponent
-  * Where zero can not be the digit before the decimal, unless the number is a zero
-  * And where the exponent is specified by `e` or `E` optionally followed by `+` or `-`
-  * With at most **4** exponent digits, which must be left-padded with `0`
+* For non‑scientific forms:
+  * Up to 16 digits spanning from the first non‑zero digit (inclusive) through the last digit of
+    the integer or fractional part.
+  * Zero is allowed with at most one zero before the decimal point and up to 16 zeros after it.
+* For scientific forms:
+  * Exactly one digit before the decimal (non‑zero unless the numeric value is zero),
+    optionally followed by a decimal point and up to 16 digits.
+  * Exponent marker `e` or `E` with optional `+` or `-` sign.
+  * Up to **4** exponent digits, left‑padded with `0`.

97-111: Fix inconsistency in “digits” field semantics

Spec text says “non‑zero digits” but examples count all digits; define it consistently.

-This counts the digits from the first non-zero digit up to the last digit of the integer or
+This counts the digits from the first non‑zero digit up to the last digit of the integer or
 fractional part (excluding the exponent). Examples:
 ...
-Per the [JSON grammar][json_grammar], the integer part of a floating point number cannot be empty,
+Per the [JSON grammar][json_grammar], the integer part of a floating‑point number cannot be empty,
 so the minimum number of digits is **1**. To take advantage of this fact, we store this field as
-**actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
-**actual number of digits - 1**.
+**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the
+numeric value zero, count all digits in the significand and subtract 1.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0c4036 and 359f295.

📒 Files selected for processing (2)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
  • docs/src/user-docs/core-clp-s.md (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md

5-5: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


12-12: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


26-26: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


27-27: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


42-42: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


54-54: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


63-63: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


64-64: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


65-65: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


74-74: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


75-75: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


76-76: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


92-92: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


93-93: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


94-94: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


95-95: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


102-102: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


103-103: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


104-104: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


105-105: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: antlr-code-committed (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
docs/src/dev-docs/design-retain-float-format.md (2)

85-88: LGTM: exponent digit mapping and rationale are clear

Formatting and justification read well.


112-116: LGTM: 17‑digit limit and 5‑bit mapping explained

The 17‑digit rationale and encoded‑value mapping are correct.

@anlowee
Copy link
Member

anlowee commented Sep 11, 2025

Finished the first round of the review.

@gibber9809 gibber9809 requested a review from anlowee September 11, 2025 18:43
anlowee
anlowee previously approved these changes Sep 11, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we restructure the doc and add a bit more motivation? I'm thinking something like:

  • Motivation
    • Explain that a float value with the same binary encoding can have multiple string encodings, and show some examples.
    • Explain that to support, truly lossless decompression, we want to store not only the float’s value but the format as well.
    • Explain the complexities (e.g., IEEE-754 bin64 has its own limitations but also some values are intractable to retain the format for, etc.)
  • Solution
    • Two encodings
    • When FormattedFloat cannot represent losslessly, we fall back to DictionaryFloat
  • Overview of what formats we can represent and examples
  • Specification

Overall, we want this doc to be a bit more readable for new developers and/or interested users.

Comment on lines 34 to 38
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+
| Scientific Notation Marker (2 bits) | Exponent Sign (2 bits) | Exponent Digits (2 bits) | Digits from First Non-Zero to End of Number (5 bits) | Reserved (5 bits) |
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+
MSB LSB
```
Copy link
Member

@kirkrodrigues kirkrodrigues Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a bullet list instead and link to each subsection?

- `0.000000123000`**6** (from first `1` to last `0`)
- `0.00`**3** (counts all zeros for zero value)

Per the [JSON grammar][json_grammar], the integer part of a floating point number cannot be empty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to kind of summarize what the JSON grammar really supports and what it doesn't.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)

130-139: Clarify admissible formats; tighten zero rules; make wording unambiguous.

Reword the admissible format bullets to precisely define mantissa/exponent and zero handling.

-* For non-scientific numbers we accept:
-  * Any number that has at most 16 digits after the first non-zero digit
-  * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
-* For scientific numbers we accept:
-  * Single digit numbers with no decimal, followed by an exponent
-  * Or numbers with **1** digit preceding the decimal and up to 16 digits following the
-    decimal, followed by an exponent
-  * Where zero can not be the digit before the decimal, unless every digit in the number is zero
-  * And where the exponent is specified by `e` or `E` optionally followed by `+` or `-`
-  * With at most **4** exponent digits, which can be left-padded with `0`
+* For non‑scientific numbers:
+  - Any number with at most 16 digits spanning from the first non‑zero digit (inclusive) through the last digit of the integer or fractional part.
+  - Zero is allowed with at most one zero before the decimal point and up to 16 zeros after it.
+* For scientific numbers:
+  - Exactly one digit before the decimal (non‑zero unless the numeric value is zero), optionally followed by a decimal point and up to 16 digits.
+  - Exponent marker `e` or `E` with optional `+` or `-` sign.
+  - Up to 4 exponent digits, left‑padded with `0` if present.

178-181: Minor phrasing: drop “respectively” and keep rationale concise.

Current text is fine technically; consider slightly tighter phrasing.

-Since the maximum and minimum decimal exponents for a double, `308` and `-324` respectively, are
-both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to
-support exponents left-padded with `0`.
+Since the maximum and minimum decimal exponents for a double, `308` and `-324`, are both three
+digits, two bits are enough to encode the digit count. We allow up to 4 digits to support
+left‑padded exponents.

205-208: Hyphenate “floating‑point” and tighten punctuation.

Small copyedit.

-As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all
-binary64 floating point numbers without precision loss. As a result, we currently allow a maximum of
-**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which
-requires 5 bits.
+According to IEEE‑754, only 17 decimal significant digits are needed to represent all
+binary64 floating‑point numbers without precision loss. As a result, we currently allow a maximum of
+**17 digits**. Because the stored value is **digits − 1**, the maximum encoded value is 16, which
+requires 5 bits.

200-204: Fix broken link label and resolve “non‑zero digits” ambiguity.

Use the defined [json_spec] label and clarify the stored value definition to match examples.

-Per the [JSON specification][json_grammar], the integer part of a floating point number cannot be
-empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field
-as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
-**actual number of digits - 1**.
+Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be
+empty, so the minimum number of digits is **1**. We therefore store this field as
+**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the
+numeric value zero, count all zeros in the significand and subtract 1.

131-133: Spell “zeros” (not “zeroes”).

Align with house style.

-  * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
+  - Zero is allowed with at most one zero before the decimal and up to 16 zeros after it.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49e38 and aa008af.

📒 Files selected for processing (2)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
  • docs/src/dev-docs/index.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md

7-7: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


33-33: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


47-47: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


63-63: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


76-76: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


93-93: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


93-93: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


95-95: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


102-102: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


115-115: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


115-115: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


122-122: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


130-130: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


130-130: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


131-131: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


137-137: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


138-138: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


139-139: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


142-142: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


142-142: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


143-143: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


200-200: Reference links and images should use a label that is defined
Missing link or image reference definition: "jsongrammar"

(MD052, reference-links-images)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (1)
docs/src/dev-docs/index.md (1)

102-103: LGTM: design doc added to toctree.

Entry renders fine in the Design section.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
docs/src/dev-docs/design-retain-float-format.md (4)

213-216: Fix spec contradiction and missing reference label (json_grammar → json_spec).

The stored-field definition wrongly says “non‑zero digits,” which contradicts the examples and earlier definition (the count includes zeros after the first non‑zero). Also, the link label json_grammar is undefined; use json_spec.

Apply:

-Per the [JSON specification][json_grammar], the integer part of a floating-point number cannot be
-empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field
-as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
-**actual number of digits - 1**.
+Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be
+empty, so the minimum number of digits is **1**. We store this field as
+**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the
+numeric value zero, we store **(count of digits in the significand) − 1**.

83-86: Surround list with blank lines (MD032).

Insert a blank line before the list to satisfy markdownlint.

-Overall the implications here are that:
+Overall the implications here are that:
+
 - For any given number there are many possible representations (infinitely many in fact); and
 - Not all floating-point numbers that are valid JSON correspond to values from a standard like
   IEEE-754

56-60: Surround list with blank lines (MD032).

Add blank lines before and after the two-item list.

-For example:
-- `0.160e2`; and
-- `0.00000000160e10`
+For example:
+
+- `0.160e2`; and
+- `0.00000000160e10`
+

142-152: Consistency: “cannot” and “zeros”.

Use “cannot” (single word) and prefer “zeros” for consistency.

-  - Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
+  - Or at most 1 zero before the decimal and 16 zeros after the decimal, if the number is a zero
@@
-  - Where zero can not be the digit before the decimal, unless every digit in the number is zero
+  - Where zero cannot be the digit before the decimal, unless every digit in the number is zero
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa008af and b4d624f.

📒 Files selected for processing (1)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md

84-84: Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


213-213: Reference links and images should use a label that is defined
Missing link or image reference definition: "jsongrammar"

(MD052, reference-links-images)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's looking really good! easy to navigate, and informative without being overwhelming.

a number of comments to address; let me know if you need clarification.

high-level nit: I think that unordered lists should use * as the bullet instead of -; I didn't bother commenting every time, as everything renders correctly.

gibber9809 and others added 3 commits September 22, 2025 13:48
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4d624f and 288ce21.

📒 Files selected for processing (1)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
docs/src/dev-docs/design-retain-float-format.md (7)

42-44: Avoid double negative; streamline phrasing
Prior feedback noted this; the older wording appears to have resurfaced.

-What it doesn't do is place any restrictions on how a given floating-point number should be written,
-or whether floating-point numbers have to correspond to values from a standard such as IEEE-754
-binary64.
+It does not restrict how a floating‑point number is written, nor require that numbers correspond to
+values from a standard such as IEEE‑754 binary64.

191-194: Inline code and “respectively” phrasing regression

Earlier fix removed “respectively” from the code spans; please restore the improved wording.

-Since the maximum and minimum decimal exponents for a double (`308` and `-324` respectively) are
-both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to
-support exponents left-padded with `0`.
+Since the maximum and minimum decimal exponents for a double (`308` and `-324`) are
+both three digits, two bits are enough to represent the digit count. We allow up to four digits to
+support left‑padded exponents (e.g., `+0003`).

213-216: Field definition contradicts examples; remove “non‑zero digits” wording

This should count all digits from the first non‑zero through the end of the significand; zero is a special case.

-Per the [JSON specification][json_spec], the integer part of a floating-point number cannot be
-empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field
-as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
-**actual number of digits - 1**.
+Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be
+empty, so the minimum number of digits is **1**. To leverage this, we store this field as
+**(count of digits from the first non‑zero digit to the end of the significand) − 1**. For the
+numeric value zero, count all zeros in the significand and store **(that count − 1)**.

1-1: Shorten H1 for sidebar/toctree readability

The current title is long and wraps in the nav. Prefer a concise H1.

-# What does it take to losslessly retain JSON floating-point numbers?
+# Lossless JSON float storage

141-144: Tighten non‑scientific acceptance rule; fix “zeroes” → “zeros”

Make the 17‑significant‑digit cap explicit and use consistent spelling.

-- Any number that has at most 16 digits after the first non-zero digit
-- Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
+- Any number with at most 16 digits following the first non‑zero digit (i.e., up to 17 significant digits total)
+- A zero written with at most 1 zero before the decimal and up to 16 zeros after the decimal

145-151: Unify list marker style (MD004) and minor phrasing

Nested items currently use “*”. Use “-” to match configured style; small grammar tweak.

-  1. The significand is either
-      * a **single** digit with no decimal point, followed by an exponent
-      * **one** digit before the decimal point, and up to **16** digits after the decimal point,
-        followed by an exponent
+  1. The significand is either:
+      - a **single** digit with no decimal point, followed by an exponent
+      - **one** digit before the decimal point and up to **16** digits after the decimal point,
+        followed by an exponent

73-81: Drop ambiguous claim; keep the demonstrative round‑trip example (verified)

Verified (Python): 1.23456789012345681.2345678901234567.

-we know that only the first and third number correspond to IEEE-754 binary64 floating-point numbers.
-The reason we can tell is that the [IEEE-754 specification][ieee754] requires that when converting a
+We can illustrate the distinction via round‑trip behaviour. The [IEEE‑754 specification][ieee754]
+requires that when converting a
 floating-point number to a decimal string at a given precision, the decimal string must correspond
 to the nearest decimal representation. Likewise, when converting a decimal string to a
 floating-point number, the standard requires that the number be converted to the nearest
 floating-point representation. If you use any standards-compliant implementation to turn
 `1.2345678901234568` into a binary64 floating-point number, and back to a decimal string, you will
 find that it has been rounded to `1.2345678901234567`.

Comment on lines +124 to +127
- The double value in IEEE-754 binary64 format.
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
that, upon decompression, the value can be decompressed exactly to the original text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Wording: “reconstructed exactly” vs “decompressed exactly”

We’re reconstructing the original text; suggest clearer terminology.

-- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
-  that, upon decompression, the value can be decompressed exactly to the original text.
+- A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so
+  that, upon decompression, the original text can be reconstructed exactly.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- The double value in IEEE-754 binary64 format.
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
that, upon decompression, the value can be decompressed exactly to the original text.
- The double value in IEEE-754 binary64 format.
- A 2byte littleendian _format_ field encoding the necessary output formatting information so
that, upon decompression, the original text can be reconstructed exactly.
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 124 to 127, the
wording uses "decompressed exactly" which is ambiguous; replace it with
"reconstructed exactly" (or "reconstructed to the original text") for clarity,
updating the sentence to say the 2-byte little-endian format field encodes
output formatting info so that, upon decompression, the value can be
reconstructed exactly to the original text.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 288ce21 and b864ab5.

📒 Files selected for processing (1)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)

6-41: JSON number grammar section reads correctly

Good summary and constraints; matches RFC 8259.


126-127: Wording: “decompressed exactly” → “reconstructed exactly”

We’re reconstructing original text, not “decompressing” it.

-- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
-  that, upon decompression, the value can be decompressed exactly to the original text.
+- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
+  that, upon decompression, the original text can be reconstructed exactly.

171-177: Reserve codepoint semantics: make normative (encoders MUST NOT emit 10; decoders MUST reject)

Interoperability needs explicit write/read rules for reserved values.

 `10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
-checks cleaner.
+checks cleaner. Encoders MUST NOT emit `10`; decoders MUST treat `10` as invalid input and
+fail parsing (or handle per a documented error policy).

186-190: Normative rules for unused/irrelevant subfields

Specify canonical write-zero for irrelevant fields and reject unused codes.

-`11` is unused by the current implementation.
+`11` is unused by the current implementation. Encoders MUST NOT emit `11`; decoders MUST
+treat it as invalid input.
+
+When the [Scientific notation marker] is `00` (not scientific), encoders MUST set both
+[Exponent sign] and [Exponent digits] to `00`. Decoders MUST ignore those two fields when
+the marker is `00`.

142-146: Tighten wording to match the digit-count definition

Mirror the “span from first non‑zero through last digit” phrasing used later.

-- Numbers not written in scientific notation are accepted if **either** of the following are true:
-  1. Any non-zero number with at most **17** digits starting at the first non-zero digit
+- Numbers not written in scientific notation are accepted if **either** of the following are true:
+  1. Any non-zero number with at most **17** digits spanning from the first non-zero digit
+     through the last digit of the integer or fractional part

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the comment we discussed offline, lgtm!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1a4903 and 4909490.

📒 Files selected for processing (1)
  • docs/src/dev-docs/design-retain-float-format.md (1 hunks)
🔇 Additional comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)

121-123: Use “reconstructed exactly”; avoid redundant “decompressed”.

The current sentence repeats “decompressed” and is less precise than “reconstructed”.

Apply this diff:

-– A 2-byte little-endian _format_ field encoding the necessary output formatting information so
-  that, upon decompression, the value can be decompressed exactly to the original text.
+– A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so
+  that, upon decompression, the original text can be reconstructed exactly.

166-185: Make reserved/irrelevant codepoints normative to prevent interop drift.

State MUST/MUST NOT for writers and readers, and define canonical values when not scientific.

Apply this diff:

 ### Scientific notation marker
@@
-`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
-checks cleaner.
+`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
+checks cleaner. Encoders MUST NOT emit `10`; decoders MUST reject it as invalid input.
+
+When the marker is `00` (not scientific), encoders MUST set both “Exponent sign” and
+“Exponent digits” to `00`. Decoders MUST ignore those two fields when the marker is `00`.

 ### Exponent sign
@@
-`11` is unused by the current implementation.
+`11` is unused by the current implementation. Encoders MUST NOT emit `11`; decoders SHOULD
+reject it.

156-160: Tighten prose; fix comma splice; prefer “±Infinity”.

Improves readability and consistency with earlier usage.

Apply this diff:

-These restrictions really correspond to "canonical" representations of floating-point numbers with
-up to 17 digits of precision. This means that our formatting scheme can always represent numbers
-produced by format specifiers such as `%f`, `%e`, and `%g`, so long as they don't use too many
-digits of precision, and the underlying number isn't NaN, or +/- Infinity.
+These restrictions correspond to “canonical” representations with up to 17 digits of precision.
+Our formatting scheme can represent numbers produced by `%f`, `%e`, and `%g`, provided the
+precision is within the supported limit and the value is neither NaN nor ±Infinity.

127-134: Add a compact bit‑layout figure (MSB..LSB across two LE bytes).

A visual removes residual ambiguity about bit indices vs byte order on the wire.

Apply this diff right after the list:

 From MSB to LSB, the 2-byte format field contains the following sections:
@@
 - Reserved for future use (5 bits)
 
+```
+Bit index (field wide, MSB→LSB): 15                                           0
+                                   +----+----+----+-----------+---------------+
+                                   | SN | ES | ED |   DIGITS  |   RESERVED    |
+                                   +----+----+----+-----------+---------------+
+                                   15-14 13-12 11-10     9-5          4-0
+
+Wire order: the 16-bit field is encoded little‑endian (least‑significant byte first).
+```

30-30: Consistency: use “±Infinity”.

Matches the symbol used elsewhere and tightens wording.

Apply this diff:

-- NaN and +/- Infinity are not allowed
+- NaN and ±Infinity are not allowed

Since our goal is to losslessly retain floating-point numbers that come from JSON input, it is worth
taking a look at what kinds of floating-point numbers can appear in JSON.

The [JSON specification][json_spec] treats fields that match the following grammar as number values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Optionally cite ECMA‑404 alongside RFC 8259.

Adds the concise syntax reference many implementers expect.

Apply this diff:

-The [JSON specification][json_spec] treats fields that match the following grammar as number values:
+The [JSON specification][json_spec] (see also [ECMA‑404][ecma404]) treats fields that match the
+following grammar as number values:

And append this reference:

 [json_spec]: https://datatracker.ietf.org/doc/html/rfc8259
 [ieee754]: https://ieeexplore.ieee.org/document/4610935/
+[ecma404]: https://www.ecma-international.org/publications-and-standards/standards/ecma-404/

Also applies to: 225-227

🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 6 (and also
apply the same change to lines 225-227), the JSON spec reference is incomplete;
add an additional citation to ECMA-404 alongside RFC 8259 by appending a brief
parenthetical or bracketed reference to ECMA-404 after the existing RFC 8259
mention so the sentence reads as referencing both RFC 8259 and ECMA-404;
replicate the same edit at lines 225-227 to ensure both places cite ECMA-404.

Comment on lines +202 to +204
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Explicitly state the decimal point is excluded from the count.

Removes a small ambiguity in implementer interpretation.

Apply this diff:

-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the decimal point and the exponent). Examples:
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 202 to 204, the
description of how digits are counted is ambiguous about whether the decimal
point is counted; update the sentence to explicitly state that the decimal point
is excluded from the count. Modify the wording so it clearly reads that digits
are counted from the first non-zero digit to the last digit of the integer or
fractional part, excluding both the exponent and the decimal point, and add a
short clarifying example if desired.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4909490 and c098515.

📒 Files selected for processing (1)
  • docs/src/user-docs/core-clp-s.md (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-docs/core-clp-s.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
docs/src/user-docs/core-clp-s.md (2)

157-160: KV‑IR limitation: scope at decompression; align terminology (“floating point”)

Make it explicit this applies at decompression for archives built from KV‑IR inputs, and use “floating point” (no hyphen) per project preference.

Apply this diff:

-* For floating-point numbers:
-  * KV-IR inputs currently don't support preserving the original printed float formats.
-  * Comparisons against floating point numbers at query time treat each stored number as if it were
-    the nearest representable double-precision value, which can potentially lose precision.
+* For floating point numbers:
+  * For archives built from KV‑IR inputs, decompression does not currently preserve the original
+    printed formats of floating point numbers.
+  * Comparisons against floating point numbers at query time treat each stored number as the nearest
+    representable double-precision value; precision may be lost.

27-28: Verify default and consistency across docs/CLI help

docs/src/user-docs/core-clp-s.md documents --no-retain-float-format and there’s a design doc (docs/src/dev-docs/design-retain-float-format.md); searches found no CLI/arg definition for this flag. Confirm whether retaining float-format is the default and update CLI help and all docs to use the same flag name and explicitly state the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants