Ibis benchmarking: DuckDB, DataFusion, Polars – Ibis #10179

lostmygithubaccount · 2024-09-20T00:25:29Z

giscus[bot]
bot Sep 20, 2024

Ibis benchmarking: DuckDB, DataFusion, Polars – Ibis

the portable Python dataframe library

https://ibis-project.org/posts/ibis-bench/

Sep 20, 2024

hi @alberttwong, the results of the TPC-H queries are written out to Parquet files and discarded (to ensure the results are materialized uniformly), but this does not contain the runtimes

the runtimes are stored as JSON and compacted into Parquet files, then uploaded to a public GCS bucket so you can perform you own analysis. they results are a bit old at this point and I plan on improving the benchmarking (e.g. capturing memory usage) and running on newer versions soon

it's also not necessarily straightforward to compare each system, as sometimes queries fail on one but not others. some are better at some scale factors. in general, Polars is the best when data size is small relative to R…

View full answer

alberttwong · 2024-09-20T00:25:30Z

alberttwong
Sep 20, 2024 — with giscus

Did I miss the part that you "write the results of the execution time in a parquet file? If not, what system was faster (sum of all the queries for a given engine)?

2 replies

lostmygithubaccount Sep 20, 2024
Maintainer

hi @alberttwong, the results of the TPC-H queries are written out to Parquet files and discarded (to ensure the results are materialized uniformly), but this does not contain the runtimes

the runtimes are stored as JSON and compacted into Parquet files, then uploaded to a public GCS bucket so you can perform you own analysis. they results are a bit old at this point and I plan on improving the benchmarking (e.g. capturing memory usage) and running on newer versions soon

it's also not necessarily straightforward to compare each system, as sometimes queries fail on one but not others. some are better at some scale factors. in general, Polars is the best when data size is small relative to RAM, with DuckDB and DataFusion not far behind. DuckDB is the only one that consistently works on large-than-RAM workloads and the one I would overall call "fastest". DataFusion also does pretty well on large-than-RAM workloads, but can still fail (particularly on query 9 and 18). it's on average a little slower than DuckDB, but sometimes faster

if you haven't seen, the following benchmarking blog has some more detailed comparisons between the three on ~1 TB scale factor: https://ibis-project.org/posts/1tbc/

this ad-hoc analysis might also be of interest, you can see the crossover point: https://lostmygithubaccount.github.io/ibis-bench/analysis.html (though note Polars fails a lot of queries w/ decimals due to not being able to round them, so it looks faster than it is for that section)

Answer selected by cpcloud

lostmygithubaccount Sep 20, 2024
Maintainer

forgot code, to connect and read it in:

pip install 'ibis-framework[duckdb]' gcsfs

import ibis
import gcsfs


ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 22
ibis.options.repr.interactive.max_length = 22
ibis.options.repr.interactive.max_columns = None


BUCKET = "ibis-bench"

dir_name = os.path.join(BUCKET, "bench_logs_v2", "cache")

fs = gcsfs.GCSFileSystem()

con = ibis.get_backend()
con.register_filesystem(fs)

t = (
    ibis.read_parquet(f"gs://{dir_name}/file_id=*.parquet")
    .mutate(
        timestamp=ibis._["timestamp"].cast("timestamp"),
    )
    .relocate(
        "instance_type",
        "system",
        "sf",
        "query_number",
        "execution_seconds",
        "timestamp",
    )
    .cache()
)
t

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ibis benchmarking: DuckDB, DataFusion, Polars – Ibis #10179

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Ibis benchmarking: DuckDB, DataFusion, Polars – Ibis #10179

giscus[bot] bot Sep 20, 2024

Ibis benchmarking: DuckDB, DataFusion, Polars – Ibis

Replies: 1 comment · 2 replies

alberttwong Sep 20, 2024 — with giscus

lostmygithubaccount Sep 20, 2024 Maintainer

lostmygithubaccount Sep 20, 2024 Maintainer

giscus[bot]
bot Sep 20, 2024

Replies: 1 comment 2 replies

alberttwong
Sep 20, 2024 — with giscus

lostmygithubaccount Sep 20, 2024
Maintainer

lostmygithubaccount Sep 20, 2024
Maintainer