Improve data colocation support for Parquet writes #5466

clairemcginty · 2024-08-29T20:01:04Z

Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:

Low-cardinality, high-repetition columns create the most efficient dictionary encodings, which can drastically reduce total file size and improve predicate pushdown performance
Statistics filtering, which can be applied at the page or row group level, can potentially filter out entire files if the relevant filter columns are colocated properly

This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.

However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of ShardingFunction that could assign shard # based on a hash of user-specified column(s), for example:

case class User(userId: String, date: DateTime, age: Int)

val data: SCollection[User] = ...
data.saveAsTypedParquetFile(
  path,
  shardBy = ShardBy[User](numShards = 1024, columns = Set(_.userId, _.age))
)

class ShardBy[T](numShards: Int, columns: Set[FilteringColumn]) extends ShardingFunction[T] { ... }

(...basically a low-powered SMB that doesn't cost much on the write side.)

We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.

The text was updated successfully, but these errors were encountered:

clairemcginty added the parquet label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data colocation support for Parquet writes #5466

Improve data colocation support for Parquet writes #5466

clairemcginty commented Aug 29, 2024 •

edited

Loading

Improve data colocation support for Parquet writes #5466

Improve data colocation support for Parquet writes #5466

Comments

clairemcginty commented Aug 29, 2024 • edited Loading

clairemcginty commented Aug 29, 2024 •

edited

Loading