Parallel processing of GLRBT dataset #63

srkirkland · 2024-11-06T18:39:52Z

Our CSV complete 2025 data is 16,902,955 rows containing a few rows for each cluster (cluster + treatment)

We should split the CSV into multiple CSVs and then recombine them after processing. I think splitting by county would be clean -- some counties probably have a high share of the biomass but if it's a problem we can reassess.

Also if it's easier to just split by cluster number into equally sized buckets that's fine too.

The trick is, each CSV will need headers and when you recombine them you need to do it so you only end up with headers at the top. I think this can be done via bash w/o needing javascript code but give it a shot.

And finally, once this code is written and tested, we'll need to write some scripts to actually run it. Perhaps it could go through every mini CSV and move one into an 'inprogress' folder while working on it and then to done when finished. The key is to be repeatable and resilient so we don't accidentally double process the same cluster/treatment.

srkirkland assigned aunshx Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing of GLRBT dataset #63

Parallel processing of GLRBT dataset #63

srkirkland commented Nov 6, 2024

Parallel processing of GLRBT dataset #63

Parallel processing of GLRBT dataset #63

Comments

srkirkland commented Nov 6, 2024