Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel processing of GLRBT dataset #63

Open
srkirkland opened this issue Nov 6, 2024 · 0 comments
Open

Parallel processing of GLRBT dataset #63

srkirkland opened this issue Nov 6, 2024 · 0 comments
Assignees

Comments

@srkirkland
Copy link
Member

Our CSV complete 2025 data is 16,902,955 rows containing a few rows for each cluster (cluster + treatment)

We should split the CSV into multiple CSVs and then recombine them after processing. I think splitting by county would be clean -- some counties probably have a high share of the biomass but if it's a problem we can reassess.

Also if it's easier to just split by cluster number into equally sized buckets that's fine too.

The trick is, each CSV will need headers and when you recombine them you need to do it so you only end up with headers at the top. I think this can be done via bash w/o needing javascript code but give it a shot.

And finally, once this code is written and tested, we'll need to write some scripts to actually run it. Perhaps it could go through every mini CSV and move one into an 'inprogress' folder while working on it and then to done when finished. The key is to be repeatable and resilient so we don't accidentally double process the same cluster/treatment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants