PoC (Query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

slvrtrn · 2025-05-20T10:51:16Z

Summary

Warning

This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.

First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.

There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (full list TBD). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:

This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.

#[tokio::test]
#[cfg(feature = "time")]
async fn test_serde_with() {
    #[derive(Debug, Row, Serialize, Deserialize, PartialEq)]
    struct Data {
        #[serde(with = "clickhouse::serde::time::datetime64::millis")]
        n1: OffsetDateTime, // underlying is still Int64; should not compose it from two (U)Int32
    }

    let client = prepare_database!().with_struct_validation_mode(StructValidationMode::EachRow);
    let result = client
        .query("SELECT 42 :: UInt32 AS n1, 144 :: Int32 AS n2")
        .fetch_one::<Data>()
        .await;

    assert!(result.is_err());
    assert!(matches!(
        result.unwrap_err(),
        Error::InvalidColumnDataType { .. }
    ));
}

This PR introduces:

Optional RBWNAT format usage instead of RB, which allows for stronger type safety guarantess. This is regulated by the StructValidationMode client option, which has three possible modes:
- Disabled - like current main branch, uses RB without validation
- FirstRow - new mode, uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.
- EachRow - every single row is validated. It is the slowest out of the three.
New rowbinary internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.
An ability to conveniently deserialize map as a HashMap<K, V>, and not only as Vec<(K, V)>, which was confusing.
Clearer error messages when using RBWNAT.
A lot of tests and more to come, especially with difficult corner cases (nested nullable, multi-dimensional mixed arrays/maps/tuples/enum, etc).
Benchmarks to highlight the difference in performance between validation modes are WIP and will be added to this PR.

Likely possible to implement:

Stricter enum values validation, e.g. that the deserialized Int repr matches exactly
Support for "shuffled" structure definition, where the order of the fields does not match the DB, but the names and types are correct; it should be possible by leveraging (perhaps optionally) visit_map API allowed for deserialize_struct instead of current visit_seq, which processes a struct as a tuple.

Source files to look at:

data_types.rs - RBWNAT/Native data types strings parsed as proper AST. See the tests for more output examples, like Variant data type parsing.
rbwnat.rs - main integration tests WIP.
validation.rs - validating Serde calls against the data types provided in the RBWNAT format header.

slvrtrn

Added a few comments regarding the intermediate implementation.

slvrtrn · 2025-05-21T21:34:03Z

Cargo.toml

@@ -98,7 +98,7 @@ rustls-tls-native-roots = [

 [dependencies]
 clickhouse-derive = { version = "0.2.0", path = "derive" }
-
+clickhouse-rowbinary = { version = "*", path = "rowbinary" }


Perhaps it is more of clickhouse-data-types-utils, cause this logic is shared across RBWNAT and Native, and the produced AST can be used in libraries such as ch2rs.

slvrtrn · 2025-05-21T21:34:19Z

Cargo.toml

@@ -139,6 +140,6 @@ serde_bytes = "0.11.4"
 serde_json = "1"
 serde_repr = "0.1.7"
 uuid = { version = "1", features = ["v4", "serde"] }
-time = { version = "0.3.17", features = ["macros", "rand"] }
+time = { version = "0.3.17", features = ["macros", "rand", "parsing"] }


for easy testing

slvrtrn · 2025-05-21T21:35:39Z

rowbinary/src/decoders.rs

+    }
+    let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string();
+    Ok(result)
+}


more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]?

slvrtrn · 2025-05-21T21:35:54Z

rowbinary/src/error.rs

+
+    #[error("Type parsing error: {0}")]
+    TypeParsingError(String),
+}


Needs revising.

slvrtrn · 2025-05-21T21:36:13Z

rowbinary/src/leb128.rs

+}
+
+// FIXME: do not use Vec<u8>
+pub fn encode_leb128(value: u64) -> Vec<u8> {


used for tests only

slvrtrn · 2025-05-21T21:49:57Z

src/rowbinary/de.rs

+            0 => visitor.visit_some(&mut RowBinaryDeserializer {
+                input: self.input,
+                validator: inner_data_type_validator,
+            }),
            1 => visitor.visit_none(),


This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.

src/rowbinary/de.rs

slvrtrn · 2025-05-21T21:51:50Z

src/rowbinary/de.rs

    }

    #[inline]
-    fn deserialize_map<V: Visitor<'data>>(self, _visitor: V) -> Result<V::Value> {
-        panic!("maps are unsupported, use `Vec<(A, B)>` instead");
+    fn deserialize_map<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {


adds HashMap support with proper validation, including nested types

src/rowbinary/de.rs

slvrtrn · 2025-05-21T21:52:59Z

src/rowbinary/de.rs

@@ -318,33 +639,3 @@ impl<'data> Deserializer<'data> for &mut RowBinaryDeserializer<'_, 'data> {
        false
    }
 }
-
-fn get_unsigned_leb128(mut buffer: impl Buf) -> Result<u64> {


moved to utils. Perhaps it should be just moved to rowbinary crate.

slvrtrn and others added 7 commits May 7, 2025 21:00

Draft RowBinaryWNAT/Native header parser

31d109a

Add RBWNAT header parser

3a66d7a

RBWNAT deserializer WIP

cf72759

RBWNAT deserializer - more types WIP

5a60295

RBWNAT deserializer - validation WIP

b338d88

RBWNAT deserializer - validation WIP

8ae3629

Merge branch 'main' into row-binary-header-check

acced9e

slvrtrn mentioned this pull request May 20, 2025

Consideration of Type Safety #199

Open

mshustov requested review from Copilot and loyd May 20, 2025 12:36

This comment was marked as resolved.

Sign in to view

RBWNAT deserializer - validation, benches WIP

c20af77

slvrtrn commented May 21, 2025

View reviewed changes

slvrtrn added 4 commits May 22, 2025 17:58

RBWNAT deserializer - improve performance

c4a608e

RBWNAT deserializer - clearer error messages on panics

0d416cf

Fix clippy and build

65cb92f

Fix core::mem::size_of import

fbfbd99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PoC (Query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

PoC (Query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

Uh oh!

slvrtrn commented May 20, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

slvrtrn left a comment

Uh oh!

slvrtrn May 21, 2025

Uh oh!

slvrtrn May 21, 2025

Uh oh!

slvrtrn May 21, 2025

Uh oh!

slvrtrn May 21, 2025

Uh oh!

slvrtrn May 21, 2025

Uh oh!

slvrtrn May 21, 2025

Uh oh!

Uh oh!

slvrtrn May 21, 2025

Uh oh!

Uh oh!

slvrtrn May 21, 2025

Uh oh!

Uh oh!

PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety #221

Are you sure you want to change the base?

PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety #221

Uh oh!

Conversation

slvrtrn commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

This comment was marked as resolved.

Uh oh!

slvrtrn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PoC (Query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

PoC (Query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

slvrtrn commented May 20, 2025 •

edited

Loading