Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data colocation support for Parquet writes #5466

Open
clairemcginty opened this issue Aug 29, 2024 · 0 comments
Open

Improve data colocation support for Parquet writes #5466

clairemcginty opened this issue Aug 29, 2024 · 0 comments
Labels

Comments

@clairemcginty
Copy link
Contributor

clairemcginty commented Aug 29, 2024

Parquet files benefit greatly when similar data values are colocated in the same page, row group, or file:

  • Low-cardinality, high-repetition columns create the most efficient dictionary encodings, which can drastically reduce total file size and improve predicate pushdown performance
  • Statistics filtering, which can be applied at the page or row group level, can potentially filter out entire files if the relevant filter columns are colocated properly

This is very hard to do in Scio, or in distributed data processing engines in general, because the data is by default parallelized and unordered. The closest we have right now is SMB, where you can group and sort by up to 2 columns.

However, for non-SMB use cases, we should be able to leverage Beam's ShardingFunction to colocate data efficiently. We could offer a custom implementation of ShardingFunction that could assign shard # based on a hash of user-specified column(s), for example:

case class User(userId: String, date: DateTime, age: Int)

val data: SCollection[User] = ...
data.saveAsTypedParquetFile(
  path,
  shardBy = ShardBy[User](numShards = 1024, columns = Set(_.userId, _.age))
)

class ShardBy[T](numShards: Int, columns: Set[FilteringColumn]) extends ShardingFunction[T] { ... }

(...basically a low-powered SMB that doesn't cost much on the write side.)

We should also look into z-ordering, which would incur more penalty on write performance but potentially unlock even greater downstream performance gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant