Big new design part 1 :) #300

JackKelly · 2021-10-28T11:06:19Z

Pull Request

Description

Imlement roughly the first half of the "Big New Design"! This is quite a big PR, sorry, because it's plugging together the new code.

Broadly implements an updated version of the design first sketched out in #213 (comment)

Also implements / fixes some other issues which were blocking this PR:

load_solar_pv_data_from_gcs() should use fsspec and hence be able to load data from any compute environment #286
Assert there's no overlap between train, test and validation datetimes at end of split() function #299
Change DataSourceList into Manager; and maintain DataSources in a dict instead of a list? #298
Allow user to configure the frequency of the t0 datetimes in the config yaml #277
Pass command-line-arguments into prepare_ml_data.py #171

How Has This Been Tested?

The new prepare_ml_data.py runs succesfully.

But the unit ests are currently failing (deliberately! I haven't updated the tests yet!)

No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.

JackKelly · 2021-10-28T11:08:12Z

Hi @jacobbieker. As we discussed yesterday, I've sketched out the very rough design in main() in prepare_ml_data.py. The code is pretty broken at the moment so please don't pay any attention to the details yet! But, if you get a minute, please do take a look at the broad sequence of steps sketched out in main() in prepare_ml_data.py and let me know if you have any comments! Thanks!

jacobbieker

LGTM! I like the plan and the simplicity of it. Just one comment

jacobbieker · 2021-10-28T11:14:39Z

scripts/prepare_ml_data.py

+    manager.make_destination_paths()
+    manager.check_paths_exist()
+    # TODO: If not overwrite, for each DataSource, get the maximum_batch_id already on disk.
+    # TODO: Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files.


If the paths exist, but this doesn't, how would the script know what the spatial and temporal locations of each example is? Should this throw an error if the paths exist, but this doesn't? I guess, I think think this should go before the getting the max batch ID, and error out if its not overwrite, but batches do exist, and this file does not.

Very good point! As you suggested, I've moved "Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files." above checking for max_batch_id. Thanks! Good spot!

…286

…f_each_example.csv for each split

JackKelly · 2021-10-28T17:24:43Z

Hi @jacobbieker OK, I think I'll stop here in this PR; and continue in a subsequent PR tomorrow!

This PR implements a rough draft of (almost) all the steps listed in prepare_ml_data.main() up to and including creating spatial_and_temporal_locations_of_each_example.csv for each split.

prepare_ml_data.py runs. But the unittests still fail, and there are a bunch of linter errors, and I need to write a bunch of new unittests. But, if you fancy it, please do skim-read the code to make sure you're happy with the broad direction (but please don't worry about linter errors, missing docstrings etc... I'll get to those tomorrow, hopefully!)

Tomorrow, I'll start a new PR which builds off this one, and implements the second half of the "big new design": Reading in the spatial_and_temporal_locations_of_each_example.csv files and starting separate processes for each DataSource to prepare batches.

jacobbieker · 2021-10-28T17:48:01Z

Hi @jacobbieker OK, I think I'll stop here in this PR; and continue in a subsequent PR tomorrow!

This PR implements a rough draft of (almost) all the steps listed in prepare_ml_data.main() up to and including creating spatial_and_temporal_locations_of_each_example.csv for each split.

prepare_ml_data.py runs. But the unittests still fail, and there are a bunch of linter errors, and I need to write a bunch of new unittests. But, if you fancy it, please do skim-read the code to make sure you're happy with the broad direction (but please don't worry about linter errors, missing docstrings etc... I'll get to those tomorrow, hopefully!)

Tomorrow, I'll start a new PR which builds off this one, and implements the second half of the "big new design": Reading in the spatial_and_temporal_locations_of_each_example.csv files and starting separate processes for each DataSource to prepare batches.

Yeah, this looks great! I like how its set up and going!

Making a start on the big new design! Sketched out the basic design i…

b102da8

…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.

JackKelly added enhancement New feature or request refactoring labels Oct 28, 2021

JackKelly requested a review from jacobbieker October 28, 2021 11:06

JackKelly self-assigned this Oct 28, 2021

JackKelly linked an issue Oct 28, 2021 that may be closed by this pull request

"Big new design" for nowcasting_dataset #213

Closed

38 tasks

jacobbieker reviewed Oct 28, 2021

View reviewed changes

JackKelly added 8 commits October 28, 2021 13:16

Implement arg_logger decorator

63f0a2a

enable load_solar_pv_data to load from any compute environment. Fixes #…

663852d

…286

Successfully gets t0 datetimes

61be554

fix incorrect logger message

ff18699

successfully checks for CSV file

8d5043b

Check there is no overlap between split datetimes. Fixes #299

8bef05c

Successfully creates directories and spatial_and_temporal_locations_o…

4d28923

…f_each_example.csv for each split

tidy up check_directories

33318b3

JackKelly changed the title ~~Big new design :)~~ Big new design part 1 :) Oct 28, 2021

JackKelly mentioned this pull request Oct 28, 2021

BUG: Fails to write last few batches to GCS #62

Closed

JackKelly marked this pull request as ready for review October 28, 2021 17:20

jacobbieker self-requested a review October 28, 2021 17:48

jacobbieker approved these changes Oct 28, 2021

View reviewed changes

Fix merge conflicts with main

856fe64

This was referenced Oct 29, 2021

Big new design Part 2 :) #307

Merged

Run validation script at the end of prepare_ml_data.py? #317

Open

JackKelly closed this Nov 1, 2021

JackKelly deleted the jack/big-new-design branch November 16, 2021 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big new design part 1 :) #300

Big new design part 1 :) #300

JackKelly commented Oct 28, 2021 •

edited

Loading

JackKelly commented Oct 28, 2021 •

edited

Loading

jacobbieker left a comment

jacobbieker Oct 28, 2021

JackKelly Oct 28, 2021

JackKelly commented Oct 28, 2021

jacobbieker commented Oct 28, 2021

Big new design part 1 :) #300

Big new design part 1 :) #300

Conversation

JackKelly commented Oct 28, 2021 • edited Loading

Pull Request

Description

How Has This Been Tested?

Checklist:

JackKelly commented Oct 28, 2021 • edited Loading

jacobbieker left a comment

Choose a reason for hiding this comment

jacobbieker Oct 28, 2021

Choose a reason for hiding this comment

JackKelly Oct 28, 2021

Choose a reason for hiding this comment

JackKelly commented Oct 28, 2021

jacobbieker commented Oct 28, 2021

JackKelly commented Oct 28, 2021 •

edited

Loading

JackKelly commented Oct 28, 2021 •

edited

Loading