You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.
If anyone is able to give it a try on the full dataset, please do!
I had a go with your code @marius-mather, on my Dell workstation (see my solution in other issue). 10 billion took 59 seconds, so I then scaled up and ran the entire 1 trillion dataset.
This took 1 hour, 40 minutes and 15 seconds using your code unaltered.
I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.
If anyone is able to give it a try on the full dataset, please do!
The text was updated successfully, but these errors were encountered: