You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:
Low-cardinality, high-repetition columns create the most efficient dictionary encodings, which can drastically reduce total file size and improve predicate pushdown performance
Statistics filtering, which can be applied at the page or row group level, can potentially filter out entire files if the relevant filter columns are colocated properly
This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.
However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of ShardingFunction that could assign shard # based on a hash of user-specified column(s), for example:
(...basically a low-powered SMB that doesn't cost much on the write side.)
We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.
The text was updated successfully, but these errors were encountered:
Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:
This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.
However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of
ShardingFunction
that could assign shard # based on a hash of user-specified column(s), for example:(...basically a low-powered SMB that doesn't cost much on the write side.)
We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.
The text was updated successfully, but these errors were encountered: