-
Notifications
You must be signed in to change notification settings - Fork 658
Structured cbor #3036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Structured cbor #3036
Conversation
Full disclosure: This PR incorporates code from a draft generated by Junie (albeit an impressive draft that saved a day of work). This is not a dumb copypasta of AI-generated code. Even if it were already feature-complete It would still not yet be marked ready for review because we have yet to review everything internally. I also want to stress that "we" is not a euphemism. There will be at least two of us reviewing and discussing internally, almost certainly with additional input from other humans in the process of readying this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the code at a general level. There is a lot of repetition, so I've only commented on the first case, not every one. I guess some of the AI generation is visible in the details.
formats/cbor/commonMain/src/kotlinx/serialization/cbor/CborElement.kt
Outdated
Show resolved
Hide resolved
formats/cbor/commonMain/src/kotlinx/serialization/cbor/CborElement.kt
Outdated
Show resolved
Hide resolved
formats/cbor/commonMain/src/kotlinx/serialization/cbor/CborElement.kt
Outdated
Show resolved
Hide resolved
formats/cbor/commonMain/src/kotlinx/serialization/cbor/internal/CborElementSerializers.kt
Show resolved
Hide resolved
formats/cbor/commonMain/src/kotlinx/serialization/cbor/internal/CborElementSerializers.kt
Outdated
Show resolved
Hide resolved
formats/cbor/commonMain/src/kotlinx/serialization/cbor/internal/CborElementSerializers.kt
Show resolved
Hide resolved
674e7ef
to
aea3f4b
Compare
cd963c7
to
c89fd46
Compare
Performance seems to be OK (
My hot takes:
|
I just noticed something that looks weird to me. See this test case here that is failing and closely compare expected vs actual. the byte string is wrapped twice for the reference. |
7cdbe4d
to
9b3f0e5
Compare
e3c207b
to
8d49f2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review from my side :)
Mostly reviewed API surface, will take a look on the implementation details later
} | ||
} | ||
@Serializable(with = CborIntSerializer::class) | ||
public sealed class CborInt<T : Any>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to RFC 8949, integers are:
- type 0 -
0..2^64-1
- type 1 -
-2^64..-1
This means that only ULong
can really fit both positive and negative values.
So I believe that it will be more correct to have a single CborInt
implementation, with an API similar to:
public class CborInt(sign: Int, absoluteValue: ULong, tags) {
constructor(value: Long)
fun toLong(): Long
fun toLongOrNull(): Long?
}
where:
- sign is: -1, 0, 1
- toLong - will throw if can't fit - probably a very rare case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I agree. The constructors with the current signature (simply passing a number) should stay for convenience.
* traditional methods like [List.get] or [List.size] to obtain CBOR elements. | ||
*/ | ||
@Serializable(with = CborListSerializer::class) | ||
public class CborList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to call it CborArray
, as spec says, it's array
, and Json format also calls it JsonArray
because it's called array
in spec :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, there is an annotation by the very same name ins the same package. Now moving the cbor elements is easily done, but it will probably not be very intuitive (thinking of autocompletion here).
In any case: Good point but not my call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, how did I miss that :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is on my anyhow because git blame says "Prünster" for the whole file containing the @CborArray
annotation. So my short-sighted naming is to blame.
* See [RFC 8949 3.4. Tagging of Items](https://datatracker.ietf.org/doc/html/rfc8949#name-tagging-of-items). | ||
*/ | ||
@OptIn(ExperimentalUnsignedTypes::class) | ||
public var tags: ULongArray = tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should make the implementation in a way, that it will be not mutable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it so that there is a val with a custom getter that delegates to the actual field, but the tags should still be simply written to a CborElement as they are collected upon deserialisation, to avoid serialization overhead. Actually, the elements
should probably be mutable too internally so the whole container structure that now collects tags and elements for Cbor structures can go away.
In the end this would mean that we have:
- an
internal var
making it possible to collect tags during deserialisation, but apublic val tags
with a custom getter and no backing field - an
internal val elements
for array and map so we can also collect elements as we go and get rid of another instantiation, but apublic val elements: List<CborElement>
This should boost performance for deserialisation significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, this is related to #3036 (comment), but it would also be nice to compare it to how JSON works with lists. Maybe there is some pattern there that could be used inside the CBOR implementation.
* This method is allowed to invoke only as the part of the whole deserialization process of the class, | ||
* calling this method after invoking [beginStructure] or any `decode*` method will lead to unspecified behaviour. | ||
*/ | ||
public fun decodeCborElement(): CborElement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that there should also be CborEncoder.encoderCborElement
as in Json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just delegates to encodeToByteArray<CborElement>
, right?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most probably to encodeSerializableValue(CborElementSerializer, element)
, but yes, just for symmetry
) | ||
|
||
/*need to expose writer to access encodeTag()*/ | ||
internal fun Encoder.asCborEncoder() = this as? CborWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that we should probably make it possible to cast only to CborEncoder
here so that we use the same public API as the user.
As far as I see, the only thing needed is encodeTags
—so is this a sign that encodeTags
might be exposed on the CborEncoder
interface?
I'm not sure, though, how useful it could be in reality, but it seems like it should be possible to encode tags manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably on to something here. I should investigate tag handling some more
@Serializable(with = CborPositiveIntSerializer::class) | ||
public class CborPositiveInt( | ||
value: ULong, | ||
vararg tags: ULong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This vararg
feels very unfortunate. For example, currently, it's possible to call CborPositiveInt(1u, 2u, 3u)
, and it's really hard to tell what's going on here.
UlongArray
will probably be better
Overall, it's applied to all other declarations; e.g., CborString("1", 1, 2, 3)
doesn't feel better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the first part, but the second part is simply in line with how the tagging annotations behave and it is convenient for the user (much more so that ulongArrrayOf()
.
Yet, I am aware that it should be consistent, so making ints work differently from everything else is also not really nice… In the end, don't really have an opinion. Sort it out internally, it's a straight-forward refactor anyways.
What would help in such cases is a language feature that forces parameter name specification at the call site, but that is not going to happen anytime soon, if ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, for me additional tags
field looks very similar to how annotations are applied currently.
Additionally, with Collection Literals it will look like CborString("hello", [1u, 2u, 3u])
or with Named-only parameters it could even be forced to call it like CborString("hello", tags = [1u, 2u, 3u])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two features would help, but in the meantime, we need something else for a solution
* The whole hierarchy is [serializable][Serializable] only by [Cbor] format. | ||
*/ | ||
@Serializable(with = CborElementSerializer::class) | ||
public sealed class CborElement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the alternative design:
sealed class CborElement {
abstract val tags: ULongArray
abstract fun withTags(tags: ULongArray): CborElement
}
// and all other
class CborString(value: String): CborElement() {
override fun withTags(tags: ULongArray): CborString
}
Or, even having CborElement
have no tags at all, but have a wrapper element CborTagged
?
sealed class CborElement
// and all other
class CborString(value: String): CborElement()
class CborTagged(element: CborElement, tags: ULongArray): CborElement
Both variants may apply some performance penalty, though.
It's just that, reading the spec and current implementation of CborElementSerializers
, the current placement of tags seems unfortunate and does not result in straightforward logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one solves the issues nicely and it would work well with the internally mutable tags array s.t. it would avoid instantiation overhead just for modifying tags. BUT, that would be a footgun for the consumer.
I have thought about CborTagged, also because of the asymmetric serialization/deserialization behaviour, which really is not all that nice, but also very niche (why would anyone in their right mind do this??)
When I started with the implementation, I was imagining how the most straight-forward and usable way would look like: Tags are attached to something and have a hierarchy. An array is enough to represent this and makes for a nice API that mirrors the pattern of @ValueTags
, @ObjectTags
, and @KeyTags
. Now you could get rid of that pesky asymmetry, if you disallow those annotations on CborElement
. Then I'd assume we'd be back to straightforward logic again?!
Bottom line: there are issues that need sorting out. The inconsistency issue is documented in the readme, so I was not trying to cover anything up, but I wanted unbiased opinions and fresh ideas because I could not get to a satisfactory solution by myself. It worked once already. I just hope Leonid checks the code first, and the comments later ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, all new declarations/APIs should probably have an @ExperimentalSerializationApi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to not annotate them all, since the whole CBOR part is formally experimental. Although people probably don't perceive it so without annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it will just feel a bit odd if those are not annotated, while all other declarations in CBOR are annotated.
I would treat that as if the CborElement
hierarchy is stable (or stabilized), but other parts are not(yet).
fixes #2975
This PR introduces structured CBOR encoding and decoding
Encoding from/to
CborElement
Bytes can be decoded into an instance of
CborElement
with the [Cbor.decodeFromByteArray] function by either manuallyspecifying [CborElement.serializer()] or specifying [CborElement] as generic type parameter.
It is also possible to encode arbitrary serializable structures to a
CborElement
through [Cbor.encodeToCborElement].Since these operations use the same code paths as regular serialization (but with specialized serializers), the config flags
behave as expected
Newly introduced CBOR-specific structures
[CborPrimitive] represents primitive CBOR elements, such as string, integer, float boolean, and null.
CBOR byte strings are also treated as primitives
Each primitive has a [value][CborPrimitive.value]. Depending on the concrete type of the primitive, it maps
to corresponding Kotlin Types such as
String
,Int
,Double
, etc.Note that Cbor discriminates between positive ("unsigned") and negative ("signed") integers!
CborPrimitive
is itself an umbrella type (a sealed class) for the following concrete primitives:null
Boolean
(it is still possible to instantiate it as the
invoke
operator on its companion is overridden accordingly):Long
numbers≥0
Long
numbers<0
String
Double
ByteArray
and is used to encode them as CBOR byte string (in contrast to a listof individual bytes)
[CborList] represents a CBOR array. It is a Kotlin [List] of
CborElement
items.[CborMap] represents a CBOR map/object. It is a Kotlin [Map] from
CborElement
keys toCborElement
values.This is typically the result of serializing an arbitrary
Example
Decoding it results in the following CborElement (shown in manually formatted diagnostic notation):
Implementation Details
I tried to stick to the existing CBOR codepaths as closely as possible, and the approach to add tags directly to CborElements is the most pragmatic way of getting expressiveness and convenient use. It does come with a caveat (also taken from the Readme:
Tags are properties of
CborElements
, and it is possible to mixing arbitrary serializable values withCborElement
s thatcontain tags inside a serializable structure. It is also possible to annotate any [CborElement] property
of a generic serializable class with
@ValueTags
.This can lead to asymmetric behavior when serializing and deserializing such structures!
The test cases (and comments in the test cases reflect this
Closing Remarks
I also fixed a faulty hex input test vector that I introduced myself, last year, if I pieced it together correctly (see here) and I amended the benchmarks. (see here).
Since the commits from here will be squashed anyways, I did not care for a clean history.