Processing on a laptop with polars streaming API (untested on full dataset) #4

marius-mather · 2024-02-13T02:22:20Z

I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.

If anyone is able to give it a try on the full dataset, please do!

import glob
import time
import polars as pl


def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
    return (
        df.group_by("station")
        .agg(pl.col("measure").min().alias("min"),
             pl.col("measure").max().alias("max"),
             pl.col("measure").mean().alias("mean"))
        .sort("station")
    )


if __name__ == "__main__":
    print("Polars config:", pl.Config.state())
    start = time.time()
    files = glob.glob("data/*.parquet")
    data = pl.scan_parquet(list(files), rechunk=True)
    query = aggregate(data)
    result = query.collect(streaming=True)
    end = time.time()
    print(f"Time: {end - start}")
    print(result)

karlwiese · 2024-02-14T14:25:10Z

Did you consider streaming from S3 with something like

source = "s3://bucket/*.parquet"
storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "us-east-1",
}

data = pl.scan_parquet(source, storage_options=storage_options)

(copy/pasted from documentation: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_parquet.html)

Sure, most of the time will be IO and it mainly depends on the internet connection. At least it works around your disk space issue.

MurrayData · 2024-02-14T15:26:06Z

I had a go with your code @marius-mather, on my Dell workstation (see my solution in other issue). 10 billion took 59 seconds, so I then scaled up and ran the entire 1 trillion dataset.

This took 1 hour, 40 minutes and 15 seconds using your code unaltered.

Time: 6014.877286672592
shape: (412, 4)
┌───────────────┬───────┬──────┬───────────┐
│ station       ┆ min   ┆ max  ┆ mean      │
│ ---           ┆ ---   ┆ ---  ┆ ---       │
│ str           ┆ f64   ┆ f64  ┆ f64       │
╞═══════════════╪═══════╪══════╪═══════════╡
│ Abha          ┆ -43.8 ┆ 83.4 ┆ 18.000158 │
│ Abidjan       ┆ -34.4 ┆ 87.8 ┆ 25.999815 │
│ Abéché        ┆ -33.6 ┆ 92.1 ┆ 29.400132 │
│ Accra         ┆ -34.1 ┆ 86.7 ┆ 26.400023 │
│ Addis Ababa   ┆ -58.0 ┆ 79.6 ┆ 16.000132 │
│ …             ┆ …     ┆ …    ┆ …         │
│ Yinchuan      ┆ -53.1 ┆ 72.0 ┆ 9.000168  │
│ Zagreb        ┆ -52.8 ┆ 72.1 ┆ 10.699659 │
│ Zanzibar City ┆ -34.2 ┆ 87.0 ┆ 26.000115 │
│ Ürümqi        ┆ -52.9 ┆ 66.7 ┆ 7.399789  │
│ İzmir         ┆ -45.3 ┆ 79.6 ┆ 17.899969 │
└───────────────┴───────┴──────┴───────────┘

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing on a laptop with polars streaming API (untested on full dataset) #4

Processing on a laptop with polars streaming API (untested on full dataset) #4

marius-mather commented Feb 13, 2024

karlwiese commented Feb 14, 2024

MurrayData commented Feb 14, 2024

Processing on a laptop with polars streaming API (untested on full dataset) #4

Processing on a laptop with polars streaming API (untested on full dataset) #4

Comments

marius-mather commented Feb 13, 2024

karlwiese commented Feb 14, 2024

MurrayData commented Feb 14, 2024