-
Notifications
You must be signed in to change notification settings - Fork 109
PoC (Query): RowBinaryWithNamesAndTypes
for enchanced type safety
#221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments regarding the intermediate implementation.
@@ -98,7 +98,7 @@ rustls-tls-native-roots = [ | |||
|
|||
[dependencies] | |||
clickhouse-derive = { version = "0.2.0", path = "derive" } | |||
|
|||
clickhouse-rowbinary = { version = "*", path = "rowbinary" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is more of clickhouse-data-types-utils
, cause this logic is shared across RBWNAT and Native, and the produced AST can be used in libraries such as ch2rs.
@@ -139,6 +140,6 @@ serde_bytes = "0.11.4" | |||
serde_json = "1" | |||
serde_repr = "0.1.7" | |||
uuid = { version = "1", features = ["v4", "serde"] } | |||
time = { version = "0.3.17", features = ["macros", "rand"] } | |||
time = { version = "0.3.17", features = ["macros", "rand", "parsing"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for easy testing
} | ||
let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string(); | ||
Ok(result) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]
?
|
||
#[error("Type parsing error: {0}")] | ||
TypeParsingError(String), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs revising.
} | ||
|
||
// FIXME: do not use Vec<u8> | ||
pub fn encode_leb128(value: u64) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used for tests only
0 => visitor.visit_some(&mut RowBinaryDeserializer { | ||
input: self.input, | ||
validator: inner_data_type_validator, | ||
}), | ||
1 => visitor.visit_none(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.
} | ||
|
||
#[inline] | ||
fn deserialize_map<V: Visitor<'data>>(self, _visitor: V) -> Result<V::Value> { | ||
panic!("maps are unsupported, use `Vec<(A, B)>` instead"); | ||
fn deserialize_map<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adds HashMap support with proper validation, including nested types
@@ -318,33 +639,3 @@ impl<'data> Deserializer<'data> for &mut RowBinaryDeserializer<'_, 'data> { | |||
false | |||
} | |||
} | |||
|
|||
fn get_unsigned_leb128(mut buffer: impl Buf) -> Result<u64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to utils. Perhaps it should be just moved to rowbinary
crate.
Summary
Warning
This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.
First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.
There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (full list TBD). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:
This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.
This PR introduces:
StructValidationMode
client option, which has three possible modes:Disabled
- like current main branch, uses RB without validationFirstRow
- new mode, uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.EachRow
- every single row is validated. It is the slowest out of the three.rowbinary
internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.HashMap<K, V>
, and not only asVec<(K, V)>
, which was confusing.Likely possible to implement:
visit_map
API allowed fordeserialize_struct
instead of currentvisit_seq
, which processes a struct as a tuple.Source files to look at: