Skip to content

PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

slvrtrn
Copy link
Contributor

@slvrtrn slvrtrn commented May 20, 2025

Summary

Warning

This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.

First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.

There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (full list TBD). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:

This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.

#[tokio::test]
#[cfg(feature = "time")]
async fn test_serde_with() {
    #[derive(Debug, Row, Serialize, Deserialize, PartialEq)]
    struct Data {
        #[serde(with = "clickhouse::serde::time::datetime64::millis")]
        n1: OffsetDateTime, // underlying is still Int64; should not compose it from two (U)Int32
    }

    let client = prepare_database!().with_struct_validation_mode(StructValidationMode::EachRow);
    let result = client
        .query("SELECT 42 :: UInt32 AS n1, 144 :: Int32 AS n2")
        .fetch_one::<Data>()
        .await;

    assert!(result.is_err());
    assert!(matches!(
        result.unwrap_err(),
        Error::InvalidColumnDataType { .. }
    ));
}

This PR introduces:

  • Optional RBWNAT format usage instead of RB, which allows for stronger type safety guarantess. This is regulated by the StructValidationMode client option, which has three possible modes:
    • Disabled - like current main branch, uses RB without validation
    • FirstRow - new mode, uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.
    • EachRow - every single row is validated. It is the slowest out of the three.
  • New rowbinary internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.
  • An ability to conveniently deserialize map as a HashMap<K, V>, and not only as Vec<(K, V)>, which was confusing.
  • Clearer error messages when using RBWNAT.
  • A lot of tests and more to come, especially with difficult corner cases (nested nullable, multi-dimensional mixed arrays/maps/tuples/enum, etc).
  • Benchmarks to highlight the difference in performance between validation modes are WIP and will be added to this PR.

Likely possible to implement:

  • Stricter enum values validation, e.g. that the deserialized Int repr matches exactly
  • Support for "shuffled" structure definition, where the order of the fields does not match the DB, but the names and types are correct; it should be possible by leveraging (perhaps optionally) visit_map API allowed for deserialize_struct instead of current visit_seq, which processes a struct as a tuple.

Source files to look at:

@mshustov mshustov requested review from Copilot and loyd May 20, 2025 12:36
Copilot

This comment was marked as resolved.

Copy link
Contributor Author

@slvrtrn slvrtrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments regarding the intermediate implementation.

@@ -98,7 +98,7 @@ rustls-tls-native-roots = [

[dependencies]
clickhouse-derive = { version = "0.2.0", path = "derive" }

clickhouse-rowbinary = { version = "*", path = "rowbinary" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is more of clickhouse-data-types-utils, cause this logic is shared across RBWNAT and Native, and the produced AST can be used in libraries such as ch2rs.

@@ -139,6 +140,6 @@ serde_bytes = "0.11.4"
serde_json = "1"
serde_repr = "0.1.7"
uuid = { version = "1", features = ["v4", "serde"] }
time = { version = "0.3.17", features = ["macros", "rand"] }
time = { version = "0.3.17", features = ["macros", "rand", "parsing"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for easy testing

}
let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string();
Ok(result)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]?


#[error("Type parsing error: {0}")]
TypeParsingError(String),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs revising.

}

// FIXME: do not use Vec<u8>
pub fn encode_leb128(value: u64) -> Vec<u8> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used for tests only

0 => visitor.visit_some(&mut RowBinaryDeserializer {
input: self.input,
validator: inner_data_type_validator,
}),
1 => visitor.visit_none(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.

}

#[inline]
fn deserialize_map<V: Visitor<'data>>(self, _visitor: V) -> Result<V::Value> {
panic!("maps are unsupported, use `Vec<(A, B)>` instead");
fn deserialize_map<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds HashMap support with proper validation, including nested types

@@ -318,33 +639,3 @@ impl<'data> Deserializer<'data> for &mut RowBinaryDeserializer<'_, 'data> {
false
}
}

fn get_unsigned_leb128(mut buffer: impl Buf) -> Result<u64> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to utils. Perhaps it should be just moved to rowbinary crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants