Parquet vs ORC

TLDR: SageWorks does a LOT of column queries for it's Exploratory Data Analysis (EDA) and computing column statistics. So in all of our use case Parquet performed better (faster and less bytes scanned by Athena). The classes that write DataSource provide support for BOTH. Simply use the output_format parameter if you'd like to specific ORC instead of Parquet.

Parquet and ORC (Optimized Row Columnar) are both popular columnar storage formats, but they handle data and optimize queries in slightly different ways.

Apache Parquet is a columnar storage file format that is optimized for use with Apache Arrow. It's designed to bring efficient columnar storage of data compared to row-based files like CSV. When querying, Parquet is able to read only the columns that are needed for the query, which greatly optimizes the reading process and minimizes the amount of data scanned.

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It's optimized for large streaming reads, but with integrated support for finding required rows quickly.

When querying data stored in ORC format, the query reads entire stripes of data (with each stripe containing multiple rows), even if only one column is used in the query. ORC file format does provide a way to skip over irrelevant stripes based on column statistics (min, max, sum, count), but it is not as efficient as Parquet in the case of querying specific columns.

So, while ORC has some advantages, such as better compression due to its light-weight indexes within each block (stripe), Parquet's ability to read only required columns makes it often more efficient and faster for column-specific queries,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet vs ORC

Clone this wiki locally