Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Big new design part 1 :) #300

Closed
wants to merge 10 commits into from
Closed

Big new design part 1 :) #300

wants to merge 10 commits into from

Conversation

JackKelly
Copy link
Member

@JackKelly JackKelly commented Oct 28, 2021

Pull Request

Description

Imlement roughly the first half of the "Big New Design"! This is quite a big PR, sorry, because it's plugging together the new code.

Broadly implements an updated version of the design first sketched out in #213 (comment)

Also implements / fixes some other issues which were blocking this PR:

How Has This Been Tested?

The new prepare_ml_data.py runs succesfully.

But the unit ests are currently failing (deliberately! I haven't updated the tests yet!)

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.
@JackKelly JackKelly added enhancement New feature or request refactoring labels Oct 28, 2021
@JackKelly JackKelly requested a review from jacobbieker October 28, 2021 11:06
@JackKelly JackKelly self-assigned this Oct 28, 2021
@JackKelly
Copy link
Member Author

JackKelly commented Oct 28, 2021

Hi @jacobbieker. As we discussed yesterday, I've sketched out the very rough design in main() in prepare_ml_data.py. The code is pretty broken at the moment so please don't pay any attention to the details yet! But, if you get a minute, please do take a look at the broad sequence of steps sketched out in main() in prepare_ml_data.py and let me know if you have any comments! Thanks!

@JackKelly JackKelly linked an issue Oct 28, 2021 that may be closed by this pull request
38 tasks
Copy link
Member

@jacobbieker jacobbieker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I like the plan and the simplicity of it. Just one comment

manager.make_destination_paths()
manager.check_paths_exist()
# TODO: If not overwrite, for each DataSource, get the maximum_batch_id already on disk.
# TODO: Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the paths exist, but this doesn't, how would the script know what the spatial and temporal locations of each example is? Should this throw an error if the paths exist, but this doesn't? I guess, I think think this should go before the getting the max batch ID, and error out if its not overwrite, but batches do exist, and this file does not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! As you suggested, I've moved "Check if the spatial_and_temporal_locations_of_each_example.csv files exist. If not, create these files." above checking for max_batch_id. Thanks! Good spot!

@JackKelly
Copy link
Member Author

Hi @jacobbieker OK, I think I'll stop here in this PR; and continue in a subsequent PR tomorrow!

This PR implements a rough draft of (almost) all the steps listed in prepare_ml_data.main() up to and including creating spatial_and_temporal_locations_of_each_example.csv for each split.

prepare_ml_data.py runs. But the unittests still fail, and there are a bunch of linter errors, and I need to write a bunch of new unittests. But, if you fancy it, please do skim-read the code to make sure you're happy with the broad direction (but please don't worry about linter errors, missing docstrings etc... I'll get to those tomorrow, hopefully!)

Tomorrow, I'll start a new PR which builds off this one, and implements the second half of the "big new design": Reading in the spatial_and_temporal_locations_of_each_example.csv files and starting separate processes for each DataSource to prepare batches.

@jacobbieker
Copy link
Member

Hi @jacobbieker OK, I think I'll stop here in this PR; and continue in a subsequent PR tomorrow!

This PR implements a rough draft of (almost) all the steps listed in prepare_ml_data.main() up to and including creating spatial_and_temporal_locations_of_each_example.csv for each split.

prepare_ml_data.py runs. But the unittests still fail, and there are a bunch of linter errors, and I need to write a bunch of new unittests. But, if you fancy it, please do skim-read the code to make sure you're happy with the broad direction (but please don't worry about linter errors, missing docstrings etc... I'll get to those tomorrow, hopefully!)

Tomorrow, I'll start a new PR which builds off this one, and implements the second half of the "big new design": Reading in the spatial_and_temporal_locations_of_each_example.csv files and starting separate processes for each DataSource to prepare batches.

Yeah, this looks great! I like how its set up and going!

@jacobbieker jacobbieker self-requested a review October 28, 2021 17:48
@JackKelly JackKelly closed this Nov 1, 2021
@JackKelly JackKelly deleted the jack/big-new-design branch November 16, 2021 21:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.