-
Notifications
You must be signed in to change notification settings - Fork 82
docs(clp-s): Add documentation for the lossless float representation feature. #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a new design document describing two lossless JSON float encodings, updates the dev-docs index to include it, and updates user docs to document a new CLI flag Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as clp-s
participant Encoder as Compressor
participant Store as Archive
participant Decoder as Decompressor
participant Out as Output
rect rgba(230,245,255,0.6)
note over User,CLI: Compression with optional format retention (flag: --no-retain-float-format)
User->>CLI: c [input] [output] [--no-retain-float-format?]
CLI->>Encoder: Parse numbers
alt format retention enabled
Encoder->>Encoder: Try FormattedFloat (binary64 + 2‑byte format)
alt fits spec
Encoder->>Store: Write FormattedFloat
else fallback
Encoder->>Store: Write DictionaryFloat (dict ID -> original string)
end
else retention disabled
Encoder->>Store: Write numeric value only
end
end
rect rgba(235,255,235,0.6)
note over User,CLI: Decompression
User->>CLI: d [input]
CLI->>Decoder: Read records
alt FormattedFloat
Decoder->>Out: Reconstruct original string using format bits
else DictionaryFloat
Decoder->>Out: Lookup original string by dictionary ID
else numeric only
Decoder->>Out: Emit default formatted value (nearest double)
end
note right of Decoder: KV‑IR inputs: original printed formats not preserved
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-08-15T21:48:40.228Z
Applied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/src/dev-docs/design-retain-float-format.md (1)
119-120
: Prefer a normative JSON referenceLink to RFC 8259 or ECMA‑404 instead of an informal page.
-[json_grammar]: https://www.crockford.com/mckeeman.html +[json_grammar]: https://www.rfc-editor.org/rfc/rfc8259
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)docs/src/dev-docs/index.md
(1 hunks)docs/src/user-docs/core-clp-s.md
(3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md
4-4: Inline HTML
Element: i
(MD033, no-inline-html)
🔇 Additional comments (2)
docs/src/dev-docs/index.md (1)
86-86
: Approve design doc entry Verified that docs/src/dev-docs/design-retain-float-format.md exists and the slug resolves.docs/src/user-docs/core-clp-s.md (1)
88-96
: Usage example reads clearlyExample is consistent and actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (6)
docs/src/user-docs/core-clp-s.md (3)
29-31
: Tighten wording; prefer “floating‑point”; concise KV‑IR caveatClarify outcome and keep terminology consistent.
- * `--retain-float-format` specifies that floating-point numbers should be stored with extra - metadata to preserve their textual representation after decompression. This feature is currently - not supported when ingesting KV-IR. + * `--retain-float-format` stores floating‑point numbers with format metadata so decompression + preserves their original textual form. Not supported for inputs ingested as KV‑IR.
82-86
: Tip: focus on user outcome; note overhead, avoid internalsRemove implementation details; mention size/CPU trade‑off.
-:::{tip} -Use the `--retain-float-format` flag during compression. Internally, switch to using a different -encoding approach for floating point numbers that always retains their original formats. For -example, values like `1.000e+00` or `0.000000012300` will be decompressed unchanged. -::: +:::{tip} +Use `--retain-float-format` during compression to preserve the original textual representation of +floating‑point values (e.g., `1.000e+00`, `0.000000012300`). This may slightly increase archive +size and encode/decode CPU due to extra metadata. +:::
175-176
: Clarify scope: KV‑IR limitation applies at decompression for KV‑IR‑built archivesMake the limitation explicit.
-* When using the `--retain-float-format` flag: - * KV-IR inputs currently don't support preserving the original printed float formats. +* When using `--retain-float-format`: + * For archives built from KV‑IR inputs, decompression does not currently preserve the original + printed floating‑point formats.docs/src/dev-docs/design-retain-float-format.md (3)
26-32
: Specify bit numbering and endianness unambiguouslyCall out wire endianness and intra‑field bit numbering to avoid misinterpretation.
- A 2-byte little-endian *format* field encoding the necessary output formatting information so that, upon decompression, the value can be decompressed exactly to the original text. - Note that the unused lowest 5 bits of the 2‑byte field are currently reserved. + The 2‑byte field is serialized little‑endian on the wire. Within the 16‑bit field, bit 0 is the + least‑significant bit (LSB) and bit 15 is the most‑significant bit (MSB). + Note that the lowest (LSB) 5 bits of the 2‑byte field are currently reserved. Encoders must write them as 0, and decoders must ignore them (treat as “don’t care”) for forward compatibility.
40-52
: Clarify admissible formats; tighten zero rulesMake constraints precise and consistent.
-* For non-scientific numbers we accept: - * Any number that has at most 16 digits after the first non-zero digit - * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero -* For scientific numbers we accept: - * Single digit numbers with no decimal, followed by an exponent - * Or numbers with **1** digit preceding the decimal and up to 16 digits following the - decimal, followed by an exponent - * Where zero can not be the digit before the decimal, unless the number is a zero - * And where the exponent is specified by `e` or `E` optionally followed by `+` or `-` - * With at most **4** exponent digits, which must be left-padded with `0` +* For non‑scientific forms: + * Up to 16 digits spanning from the first non‑zero digit (inclusive) through the last digit of + the integer or fractional part. + * Zero is allowed with at most one zero before the decimal point and up to 16 zeros after it. +* For scientific forms: + * Exactly one digit before the decimal (non‑zero unless the numeric value is zero), + optionally followed by a decimal point and up to 16 digits. + * Exponent marker `e` or `E` with optional `+` or `-` sign. + * Up to **4** exponent digits, left‑padded with `0`.
97-111
: Fix inconsistency in “digits” field semanticsSpec text says “non‑zero digits” but examples count all digits; define it consistently.
-This counts the digits from the first non-zero digit up to the last digit of the integer or +This counts the digits from the first non‑zero digit up to the last digit of the integer or fractional part (excluding the exponent). Examples: ... -Per the [JSON grammar][json_grammar], the integer part of a floating point number cannot be empty, +Per the [JSON grammar][json_grammar], the integer part of a floating‑point number cannot be empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field as -**actual number of non-zero digits to end of number - 1**; for the numeric value zero we store -**actual number of digits - 1**. +**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the +numeric value zero, count all digits in the significand and subtract 1.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)docs/src/user-docs/core-clp-s.md
(3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md
5-5: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
12-12: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
26-26: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
27-27: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
42-42: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
54-54: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
63-63: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
64-64: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
65-65: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
74-74: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
75-75: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
76-76: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
92-92: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
93-93: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
94-94: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
95-95: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
102-102: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
103-103: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
104-104: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
105-105: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: antlr-code-committed (ubuntu-24.04)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
docs/src/dev-docs/design-retain-float-format.md (2)
85-88
: LGTM: exponent digit mapping and rationale are clearFormatting and justification read well.
112-116
: LGTM: 17‑digit limit and 5‑bit mapping explainedThe 17‑digit rationale and encoded‑value mapping are correct.
Finished the first round of the review. |
Co-authored-by: Xiaochong(Eddy) Wei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we restructure the doc and add a bit more motivation? I'm thinking something like:
- Motivation
- Explain that a float value with the same binary encoding can have multiple string encodings, and show some examples.
- Explain that to support, truly lossless decompression, we want to store not only the float’s value but the format as well.
- Explain the complexities (e.g., IEEE-754 bin64 has its own limitations but also some values are intractable to retain the format for, etc.)
- Solution
- Two encodings
- When FormattedFloat cannot represent losslessly, we fall back to DictionaryFloat
- Overview of what formats we can represent and examples
- Specification
Overall, we want this doc to be a bit more readable for new developers and/or interested users.
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+ | ||
| Scientific Notation Marker (2 bits) | Exponent Sign (2 bits) | Exponent Digits (2 bits) | Digits from First Non-Zero to End of Number (5 bits) | Reserved (5 bits) | | ||
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+ | ||
MSB LSB | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a bullet list instead and link to each subsection?
- `0.000000123000` → **6** (from first `1` to last `0`) | ||
- `0.00` → **3** (counts all zeros for zero value) | ||
|
||
Per the [JSON grammar][json_grammar], the integer part of a floating point number cannot be empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to kind of summarize what the JSON grammar really supports and what it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)
130-139
: Clarify admissible formats; tighten zero rules; make wording unambiguous.Reword the admissible format bullets to precisely define mantissa/exponent and zero handling.
-* For non-scientific numbers we accept: - * Any number that has at most 16 digits after the first non-zero digit - * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero -* For scientific numbers we accept: - * Single digit numbers with no decimal, followed by an exponent - * Or numbers with **1** digit preceding the decimal and up to 16 digits following the - decimal, followed by an exponent - * Where zero can not be the digit before the decimal, unless every digit in the number is zero - * And where the exponent is specified by `e` or `E` optionally followed by `+` or `-` - * With at most **4** exponent digits, which can be left-padded with `0` +* For non‑scientific numbers: + - Any number with at most 16 digits spanning from the first non‑zero digit (inclusive) through the last digit of the integer or fractional part. + - Zero is allowed with at most one zero before the decimal point and up to 16 zeros after it. +* For scientific numbers: + - Exactly one digit before the decimal (non‑zero unless the numeric value is zero), optionally followed by a decimal point and up to 16 digits. + - Exponent marker `e` or `E` with optional `+` or `-` sign. + - Up to 4 exponent digits, left‑padded with `0` if present.
178-181
: Minor phrasing: drop “respectively” and keep rationale concise.Current text is fine technically; consider slightly tighter phrasing.
-Since the maximum and minimum decimal exponents for a double, `308` and `-324` respectively, are -both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to -support exponents left-padded with `0`. +Since the maximum and minimum decimal exponents for a double, `308` and `-324`, are both three +digits, two bits are enough to encode the digit count. We allow up to 4 digits to support +left‑padded exponents.
205-208
: Hyphenate “floating‑point” and tighten punctuation.Small copyedit.
-As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all -binary64 floating point numbers without precision loss. As a result, we currently allow a maximum of -**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which -requires 5 bits. +According to IEEE‑754, only 17 decimal significant digits are needed to represent all +binary64 floating‑point numbers without precision loss. As a result, we currently allow a maximum of +**17 digits**. Because the stored value is **digits − 1**, the maximum encoded value is 16, which +requires 5 bits.
200-204
: Fix broken link label and resolve “non‑zero digits” ambiguity.Use the defined
[json_spec]
label and clarify the stored value definition to match examples.-Per the [JSON specification][json_grammar], the integer part of a floating point number cannot be -empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field -as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store -**actual number of digits - 1**. +Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be +empty, so the minimum number of digits is **1**. We therefore store this field as +**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the +numeric value zero, count all zeros in the significand and subtract 1.
131-133
: Spell “zeros” (not “zeroes”).Align with house style.
- * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero + - Zero is allowed with at most one zero before the decimal and up to 16 zeros after it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)docs/src/dev-docs/index.md
(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md
7-7: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
33-33: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
47-47: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
63-63: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
76-76: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
93-93: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
93-93: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
95-95: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
100-100: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
100-100: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
102-102: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
115-115: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
115-115: Emphasis style
Expected: underscore; Actual: asterisk
(MD049, emphasis-style)
122-122: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
130-130: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
130-130: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
131-131: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
132-132: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
133-133: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
134-134: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
135-135: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
137-137: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
138-138: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
139-139: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
142-142: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
142-142: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
143-143: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
200-200: Reference links and images should use a label that is defined
Missing link or image reference definition: "jsongrammar"
(MD052, reference-links-images)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (1)
docs/src/dev-docs/index.md (1)
102-103
: LGTM: design doc added to toctree.Entry renders fine in the Design section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
docs/src/dev-docs/design-retain-float-format.md (4)
213-216
: Fix spec contradiction and missing reference label (json_grammar → json_spec).The stored-field definition wrongly says “non‑zero digits,” which contradicts the examples and earlier definition (the count includes zeros after the first non‑zero). Also, the link label
json_grammar
is undefined; usejson_spec
.Apply:
-Per the [JSON specification][json_grammar], the integer part of a floating-point number cannot be -empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field -as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store -**actual number of digits - 1**. +Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be +empty, so the minimum number of digits is **1**. We store this field as +**(count of digits from the first non‑zero digit to the end of the significand) − 1**; for the +numeric value zero, we store **(count of digits in the significand) − 1**.
83-86
: Surround list with blank lines (MD032).Insert a blank line before the list to satisfy markdownlint.
-Overall the implications here are that: +Overall the implications here are that: + - For any given number there are many possible representations (infinitely many in fact); and - Not all floating-point numbers that are valid JSON correspond to values from a standard like IEEE-754
56-60
: Surround list with blank lines (MD032).Add blank lines before and after the two-item list.
-For example: -- `0.160e2`; and -- `0.00000000160e10` +For example: + +- `0.160e2`; and +- `0.00000000160e10` +
142-152
: Consistency: “cannot” and “zeros”.Use “cannot” (single word) and prefer “zeros” for consistency.
- - Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero + - Or at most 1 zero before the decimal and 16 zeros after the decimal, if the number is a zero @@ - - Where zero can not be the digit before the decimal, unless every digit in the number is zero + - Where zero cannot be the digit before the decimal, unless every digit in the number is zero
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-docs/design-retain-float-format.md
84-84: Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
213-213: Reference links and images should use a label that is defined
Missing link or image reference definition: "jsongrammar"
(MD052, reference-links-images)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's looking really good! easy to navigate, and informative without being overwhelming.
a number of comments to address; let me know if you need clarification.
high-level nit: I think that unordered lists should use *
as the bullet instead of -
; I didn't bother commenting every time, as everything renders correctly.
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
docs/src/dev-docs/design-retain-float-format.md (7)
42-44
: Avoid double negative; streamline phrasing
Prior feedback noted this; the older wording appears to have resurfaced.-What it doesn't do is place any restrictions on how a given floating-point number should be written, -or whether floating-point numbers have to correspond to values from a standard such as IEEE-754 -binary64. +It does not restrict how a floating‑point number is written, nor require that numbers correspond to +values from a standard such as IEEE‑754 binary64.
191-194
: Inline code and “respectively” phrasing regressionEarlier fix removed “respectively” from the code spans; please restore the improved wording.
-Since the maximum and minimum decimal exponents for a double (`308` and `-324` respectively) are -both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to -support exponents left-padded with `0`. +Since the maximum and minimum decimal exponents for a double (`308` and `-324`) are +both three digits, two bits are enough to represent the digit count. We allow up to four digits to +support left‑padded exponents (e.g., `+0003`).
213-216
: Field definition contradicts examples; remove “non‑zero digits” wordingThis should count all digits from the first non‑zero through the end of the significand; zero is a special case.
-Per the [JSON specification][json_spec], the integer part of a floating-point number cannot be -empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field -as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store -**actual number of digits - 1**. +Per the [JSON specification][json_spec], the integer part of a floating‑point number cannot be +empty, so the minimum number of digits is **1**. To leverage this, we store this field as +**(count of digits from the first non‑zero digit to the end of the significand) − 1**. For the +numeric value zero, count all zeros in the significand and store **(that count − 1)**.
1-1
: Shorten H1 for sidebar/toctree readabilityThe current title is long and wraps in the nav. Prefer a concise H1.
-# What does it take to losslessly retain JSON floating-point numbers? +# Lossless JSON float storage
141-144
: Tighten non‑scientific acceptance rule; fix “zeroes” → “zeros”Make the 17‑significant‑digit cap explicit and use consistent spelling.
-- Any number that has at most 16 digits after the first non-zero digit -- Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero +- Any number with at most 16 digits following the first non‑zero digit (i.e., up to 17 significant digits total) +- A zero written with at most 1 zero before the decimal and up to 16 zeros after the decimal
145-151
: Unify list marker style (MD004) and minor phrasingNested items currently use “*”. Use “-” to match configured style; small grammar tweak.
- 1. The significand is either - * a **single** digit with no decimal point, followed by an exponent - * **one** digit before the decimal point, and up to **16** digits after the decimal point, - followed by an exponent + 1. The significand is either: + - a **single** digit with no decimal point, followed by an exponent + - **one** digit before the decimal point and up to **16** digits after the decimal point, + followed by an exponent
73-81
: Drop ambiguous claim; keep the demonstrative round‑trip example (verified)Verified (Python):
1.2345678901234568
→1.2345678901234567
.-we know that only the first and third number correspond to IEEE-754 binary64 floating-point numbers. -The reason we can tell is that the [IEEE-754 specification][ieee754] requires that when converting a +We can illustrate the distinction via round‑trip behaviour. The [IEEE‑754 specification][ieee754] +requires that when converting a floating-point number to a decimal string at a given precision, the decimal string must correspond to the nearest decimal representation. Likewise, when converting a decimal string to a floating-point number, the standard requires that the number be converted to the nearest floating-point representation. If you use any standards-compliant implementation to turn `1.2345678901234568` into a binary64 floating-point number, and back to a decimal string, you will find that it has been rounded to `1.2345678901234567`.
- The double value in IEEE-754 binary64 format. | ||
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so | ||
that, upon decompression, the value can be decompressed exactly to the original text. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Wording: “reconstructed exactly” vs “decompressed exactly”
We’re reconstructing the original text; suggest clearer terminology.
-- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
- that, upon decompression, the value can be decompressed exactly to the original text.
+- A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so
+ that, upon decompression, the original text can be reconstructed exactly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- The double value in IEEE-754 binary64 format. | |
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so | |
that, upon decompression, the value can be decompressed exactly to the original text. | |
- The double value in IEEE-754 binary64 format. | |
- A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so | |
that, upon decompression, the original text can be reconstructed exactly. |
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 124 to 127, the
wording uses "decompressed exactly" which is ambiguous; replace it with
"reconstructed exactly" (or "reconstructed to the original text") for clarity,
updating the sentence to say the 2-byte little-endian format field encodes
output formatting info so that, upon decompression, the value can be
reconstructed exactly to the original text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)
6-41
: JSON number grammar section reads correctlyGood summary and constraints; matches RFC 8259.
126-127
: Wording: “decompressed exactly” → “reconstructed exactly”We’re reconstructing original text, not “decompressing” it.
-- A 2-byte little-endian _format_ field encoding the necessary output formatting information so - that, upon decompression, the value can be decompressed exactly to the original text. +- A 2-byte little-endian _format_ field encoding the necessary output formatting information so + that, upon decompression, the original text can be reconstructed exactly.
171-177
: Reserve codepoint semantics: make normative (encoders MUST NOT emit10
; decoders MUST reject)Interoperability needs explicit write/read rules for reserved values.
`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition -checks cleaner. +checks cleaner. Encoders MUST NOT emit `10`; decoders MUST treat `10` as invalid input and +fail parsing (or handle per a documented error policy).
186-190
: Normative rules for unused/irrelevant subfieldsSpecify canonical write-zero for irrelevant fields and reject unused codes.
-`11` is unused by the current implementation. +`11` is unused by the current implementation. Encoders MUST NOT emit `11`; decoders MUST +treat it as invalid input. + +When the [Scientific notation marker] is `00` (not scientific), encoders MUST set both +[Exponent sign] and [Exponent digits] to `00`. Decoders MUST ignore those two fields when +the marker is `00`.
142-146
: Tighten wording to match the digit-count definitionMirror the “span from first non‑zero through last digit” phrasing used later.
-- Numbers not written in scientific notation are accepted if **either** of the following are true: - 1. Any non-zero number with at most **17** digits starting at the first non-zero digit +- Numbers not written in scientific notation are accepted if **either** of the following are true: + 1. Any non-zero number with at most **17** digits spanning from the first non-zero digit + through the last digit of the integer or fractional part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than the comment we discussed offline, lgtm!
Co-authored-by: Quinn Taylor Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/dev-docs/design-retain-float-format.md
(1 hunks)
🔇 Additional comments (5)
docs/src/dev-docs/design-retain-float-format.md (5)
121-123
: Use “reconstructed exactly”; avoid redundant “decompressed”.The current sentence repeats “decompressed” and is less precise than “reconstructed”.
Apply this diff:
-– A 2-byte little-endian _format_ field encoding the necessary output formatting information so - that, upon decompression, the value can be decompressed exactly to the original text. +– A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so + that, upon decompression, the original text can be reconstructed exactly.
166-185
: Make reserved/irrelevant codepoints normative to prevent interop drift.State MUST/MUST NOT for writers and readers, and define canonical values when not scientific.
Apply this diff:
### Scientific notation marker @@ -`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition -checks cleaner. +`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition +checks cleaner. Encoders MUST NOT emit `10`; decoders MUST reject it as invalid input. + +When the marker is `00` (not scientific), encoders MUST set both “Exponent sign” and +“Exponent digits” to `00`. Decoders MUST ignore those two fields when the marker is `00`. ### Exponent sign @@ -`11` is unused by the current implementation. +`11` is unused by the current implementation. Encoders MUST NOT emit `11`; decoders SHOULD +reject it.
156-160
: Tighten prose; fix comma splice; prefer “±Infinity”.Improves readability and consistency with earlier usage.
Apply this diff:
-These restrictions really correspond to "canonical" representations of floating-point numbers with -up to 17 digits of precision. This means that our formatting scheme can always represent numbers -produced by format specifiers such as `%f`, `%e`, and `%g`, so long as they don't use too many -digits of precision, and the underlying number isn't NaN, or +/- Infinity. +These restrictions correspond to “canonical” representations with up to 17 digits of precision. +Our formatting scheme can represent numbers produced by `%f`, `%e`, and `%g`, provided the +precision is within the supported limit and the value is neither NaN nor ±Infinity.
127-134
: Add a compact bit‑layout figure (MSB..LSB across two LE bytes).A visual removes residual ambiguity about bit indices vs byte order on the wire.
Apply this diff right after the list:
From MSB to LSB, the 2-byte format field contains the following sections: @@ - Reserved for future use (5 bits) +``` +Bit index (field wide, MSB→LSB): 15 0 + +----+----+----+-----------+---------------+ + | SN | ES | ED | DIGITS | RESERVED | + +----+----+----+-----------+---------------+ + 15-14 13-12 11-10 9-5 4-0 + +Wire order: the 16-bit field is encoded little‑endian (least‑significant byte first). +```
30-30
: Consistency: use “±Infinity”.Matches the symbol used elsewhere and tightens wording.
Apply this diff:
-- NaN and +/- Infinity are not allowed +- NaN and ±Infinity are not allowed
Since our goal is to losslessly retain floating-point numbers that come from JSON input, it is worth | ||
taking a look at what kinds of floating-point numbers can appear in JSON. | ||
|
||
The [JSON specification][json_spec] treats fields that match the following grammar as number values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Optionally cite ECMA‑404 alongside RFC 8259.
Adds the concise syntax reference many implementers expect.
Apply this diff:
-The [JSON specification][json_spec] treats fields that match the following grammar as number values:
+The [JSON specification][json_spec] (see also [ECMA‑404][ecma404]) treats fields that match the
+following grammar as number values:
And append this reference:
[json_spec]: https://datatracker.ietf.org/doc/html/rfc8259
[ieee754]: https://ieeexplore.ieee.org/document/4610935/
+[ecma404]: https://www.ecma-international.org/publications-and-standards/standards/ecma-404/
Also applies to: 225-227
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 6 (and also
apply the same change to lines 225-227), the JSON spec reference is incomplete;
add an additional citation to ECMA-404 alongside RFC 8259 by appending a brief
parenthetical or bracketed reference to ECMA-404 after the existing RFC 8259
mention so the sentence reads as referencing both RFC 8259 and ECMA-404;
replicate the same edit at lines 225-227 to ensure both places cite ECMA-404.
This counts the digits from the first non-zero digit up to the last digit of the integer or | ||
fractional part (excluding the exponent). Examples: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Explicitly state the decimal point is excluded from the count.
Removes a small ambiguity in implementer interpretation.
Apply this diff:
-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
This counts the digits from the first non-zero digit up to the last digit of the integer or | |
fractional part (excluding the exponent). Examples: | |
This counts the digits from the first non-zero digit up to the last digit of the integer or | |
fractional part (excluding the decimal point and the exponent). Examples: |
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 202 to 204, the
description of how digits are counted is ambiguous about whether the decimal
point is counted; update the sentence to explicitly state that the decimal point
is excluded from the count. Modify the wording so it clearly reads that digits
are counted from the first non-zero digit to the last digit of the integer or
fractional part, excluding both the exponent and the decimal point, and add a
short clarifying example if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-docs/core-clp-s.md
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
docs/src/user-docs/core-clp-s.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
docs/src/user-docs/core-clp-s.md (2)
157-160
: KV‑IR limitation: scope at decompression; align terminology (“floating point”)Make it explicit this applies at decompression for archives built from KV‑IR inputs, and use “floating point” (no hyphen) per project preference.
Apply this diff:
-* For floating-point numbers: - * KV-IR inputs currently don't support preserving the original printed float formats. - * Comparisons against floating point numbers at query time treat each stored number as if it were - the nearest representable double-precision value, which can potentially lose precision. +* For floating point numbers: + * For archives built from KV‑IR inputs, decompression does not currently preserve the original + printed formats of floating point numbers. + * Comparisons against floating point numbers at query time treat each stored number as the nearest + representable double-precision value; precision may be lost.
27-28
: Verify default and consistency across docs/CLI helpdocs/src/user-docs/core-clp-s.md documents
--no-retain-float-format
and there’s a design doc (docs/src/dev-docs/design-retain-float-format.md); searches found no CLI/arg definition for this flag. Confirm whether retaining float-format is the default and update CLI help and all docs to use the same flag name and explicitly state the default.
Description
This PR splits off the documentation changes from #1176 to slightly ease the review burden. These docs changes comprise:
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Limitations
Documentation