Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing on a laptop with polars streaming API (untested on full dataset) #4

Open
marius-mather opened this issue Feb 13, 2024 · 2 comments

Comments

@marius-mather
Copy link

I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.

If anyone is able to give it a try on the full dataset, please do!

import glob
import time
import polars as pl


def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
    return (
        df.group_by("station")
        .agg(pl.col("measure").min().alias("min"),
             pl.col("measure").max().alias("max"),
             pl.col("measure").mean().alias("mean"))
        .sort("station")
    )


if __name__ == "__main__":
    print("Polars config:", pl.Config.state())
    start = time.time()
    files = glob.glob("data/*.parquet")
    data = pl.scan_parquet(list(files), rechunk=True)
    query = aggregate(data)
    result = query.collect(streaming=True)
    end = time.time()
    print(f"Time: {end - start}")
    print(result)
@karlwiese
Copy link

Did you consider streaming from S3 with something like

source = "s3://bucket/*.parquet"
storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "us-east-1",
}

data = pl.scan_parquet(source, storage_options=storage_options)  

(copy/pasted from documentation: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_parquet.html)

Sure, most of the time will be IO and it mainly depends on the internet connection. At least it works around your disk space issue.

@MurrayData
Copy link

I had a go with your code @marius-mather, on my Dell workstation (see my solution in other issue). 10 billion took 59 seconds, so I then scaled up and ran the entire 1 trillion dataset.

This took 1 hour, 40 minutes and 15 seconds using your code unaltered.

Time: 6014.877286672592
shape: (412, 4)
┌───────────────┬───────┬──────┬───────────┐
│ station       ┆ min   ┆ max  ┆ mean      │
│ ---           ┆ ---   ┆ ---  ┆ ---       │
│ str           ┆ f64   ┆ f64  ┆ f64       │
╞═══════════════╪═══════╪══════╪═══════════╡
│ Abha          ┆ -43.8 ┆ 83.4 ┆ 18.000158 │
│ Abidjan       ┆ -34.4 ┆ 87.8 ┆ 25.999815 │
│ Abéché        ┆ -33.6 ┆ 92.1 ┆ 29.400132 │
│ Accra         ┆ -34.1 ┆ 86.7 ┆ 26.400023 │
│ Addis Ababa   ┆ -58.0 ┆ 79.6 ┆ 16.000132 │
│ …             ┆ …     ┆ …    ┆ …         │
│ Yinchuan      ┆ -53.1 ┆ 72.0 ┆ 9.000168  │
│ Zagreb        ┆ -52.8 ┆ 72.1 ┆ 10.699659 │
│ Zanzibar City ┆ -34.2 ┆ 87.0 ┆ 26.000115 │
│ Ürümqi        ┆ -52.9 ┆ 66.7 ┆ 7.399789  │
│ İzmir         ┆ -45.3 ┆ 79.6 ┆ 17.899969 │
└───────────────┴───────┴──────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants