-
-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.
Hi @jacobbieker. As we discussed yesterday, I've sketched out the very rough design in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I like the plan and the simplicity of it. Just one comment
scripts/prepare_ml_data.py
Outdated
manager.make_destination_paths() | ||
manager.check_paths_exist() | ||
# TODO: If not overwrite, for each DataSource, get the maximum_batch_id already on disk. | ||
# TODO: Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the paths exist, but this doesn't, how would the script know what the spatial and temporal locations of each example is? Should this throw an error if the paths exist, but this doesn't? I guess, I think think this should go before the getting the max batch ID, and error out if its not overwrite, but batches do exist, and this file does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point! As you suggested, I've moved "Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files." above checking for max_batch_id. Thanks! Good spot!
…f_each_example.csv for each split
Hi @jacobbieker OK, I think I'll stop here in this PR; and continue in a subsequent PR tomorrow! This PR implements a rough draft of (almost) all the steps listed in
Tomorrow, I'll start a new PR which builds off this one, and implements the second half of the "big new design": Reading in the |
Yeah, this looks great! I like how its set up and going! |
Pull Request
Description
Imlement roughly the first half of the "Big New Design"! This is quite a big PR, sorry, because it's plugging together the new code.
Broadly implements an updated version of the design first sketched out in #213 (comment)
Also implements / fixes some other issues which were blocking this PR:
load_solar_pv_data_from_gcs()
should usefsspec
and hence be able to load data from any compute environment #286split()
function #299DataSourceList
intoManager
; and maintain DataSources in adict
instead of a list? #298prepare_ml_data.py
#171How Has This Been Tested?
The new
prepare_ml_data.py
runs succesfully.But the unit ests are currently failing (deliberately! I haven't updated the tests yet!)
Checklist: