[RFC] Switch to JSON #3

digama0 · 2024-01-01T10:25:06Z

The concrete syntax of the export format is troublesome to parse, being a homegrown textual format with a heavy reliance on newline separators and with lots of "clever" sequence encodings. Moreover, the encoding of names is unquoted, which is just plain wrong in the presence of names with escaped characters, because these names can include arbitrary characters, including newlines, keywords and everything else - a classic SQL injection attack.

I propose we drop this ad hoc encoding entirely and switch to a JSON-based format. This is much easier to get right, and libraries for doing the parsing are numerous (but it's also feasible to write the parser directly).

ammkrn · 2024-01-01T11:17:51Z

Fwiw I'm not opposed to the idea; while the current format is really easy to write a parser for, JSON would certainly be more accessible for people/projects that don't want to write a parser in the first place. I haven't looked deeply into the name escaping issue yet, but I'll take your word for it that it's a potential source of problems.

I'm interested to see what you come up with.

adomasbaliuka · 2024-08-23T09:17:08Z

JSON has a quite complicated spec with lots of extensions which aren't implemented exactly the same (or strictly according to the spec) by different libraries.

Perhaps something slightly more limited and more standardized would be better? (Ideally one would also use a formally verified parser, which after searching for a bit I'm still not quite sure about the status of for JSON)

Perhaps a subset of JSON would work? (Which one would specify separately for the purpose of lean4export.)

digama0 · 2024-08-24T05:21:49Z

I'm not sure what you are referring to, JSON is an exceedingly simple format. The only part I'm aware of tricky bits in json is for representation of large integers and floating point, but I don't think these are likely to be issues since we don't need floats at all and it is unlikely for numbers to get that large except when representing lean bignums, and string encoding these solves the problem completely.

adomasbaliuka · 2024-08-24T21:10:04Z

All I'm saying is that there isn't an ironclad standard for how to parse JSON and many (also "standard") parser implementations differ. Some examples: here.

I thought maybe it's important for this export format to have an ironclad standard. Am I wrong? I guess it's also reasonable to say something like "let the parser fail on exported valid lean proofs in extremely rare cases (e.g. if they contain too big numbers), let it crash on malformed input, let it output whatever it wants, the kernel will check everything in the end and that's all that matters".

digama0 · 2024-08-24T22:26:34Z

All I'm saying is that there isn't an ironclad standard for how to parse JSON and many (also "standard") parser implementations differ. Some examples: here.

This is not true. What that page shows is that there exist multiple documents that specify JSON, and parsers sometimes accept more or less than the spec due to implementation limits or "extensions". I don't see why any of this matters, I think you should be more specific.

Regarding having an ironclad standard, I'm not seeing anything here which prevents having such. But this is a proof format, which means that it actually doesn't matter if there are edge cases which are interpreted oddly, because proof checkers are allowed to spuriously fail for implementation limits reasons or even just not liking the shape of the proof. That's a quality-of-implementation issue, not a correctness issue.

daniel-levin · 2025-02-15T11:08:58Z

I thought maybe it's important for this export format to have an ironclad standard. Am I wrong?

No, you are not wrong. I don't think anyone wants to allow spurious errors to be introduced by a poorly designed serialization format.

With this shared goal in mind, I maintain that imperfect serialization formats, such as JSON, are fine for all pragmatic purposes, provided their shortcomings are dealt with by the serializer - in this case lean4export.

Moreover, "ironclad" serialization formats are rarely truly ironclad. Even the most strictly specified, such as BER/DER - widely used in public key cryptography systems - are replete with deliberately non-conformant implementations. Example: RustCrypto/formats#779.

Finally, the advantages of an export format that can be understood by practically every programming language far outweigh the downsides. Sure, using JSON makes ambiguity and/or invalidity possible, though unlikely. The principal advantage of JSON (or TOML, my preferred format) is ease of implementation. I am optimistic that making it easy to get started will result in more implementations, which will result in more scrutiny of the proof terms constructed by Lean. A more diverse set of kernels will necessarily result in more scrutiny of the serializer itself. By this logic, JSON is a reasonable candidate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Switch to JSON #3

[RFC] Switch to JSON #3

digama0 commented Jan 1, 2024

ammkrn commented Jan 1, 2024

adomasbaliuka commented Aug 23, 2024 •

edited

Loading

digama0 commented Aug 24, 2024 •

edited

Loading

adomasbaliuka commented Aug 24, 2024

digama0 commented Aug 24, 2024

daniel-levin commented Feb 15, 2025

[RFC] Switch to JSON #3

[RFC] Switch to JSON #3

Comments

digama0 commented Jan 1, 2024

ammkrn commented Jan 1, 2024

adomasbaliuka commented Aug 23, 2024 • edited Loading

digama0 commented Aug 24, 2024 • edited Loading

adomasbaliuka commented Aug 24, 2024

digama0 commented Aug 24, 2024

daniel-levin commented Feb 15, 2025

adomasbaliuka commented Aug 23, 2024 •

edited

Loading

digama0 commented Aug 24, 2024 •

edited

Loading