-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
03 cifar pipeline #9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I've added a couple of comments, mostly minor Python-style things.
And a general ARC code-style thing to discuss separately: Maybe we should have recommended ARC styles for docstrings and type hints? E.g. I like type-hints in function definitions but personally tend not to go as far as things like *args: Any
or -> None
.
index, _ = train_test_split( | ||
self.dataset_train.indices, | ||
test_size=self.drop, | ||
stratify=labels, | ||
random_state=self.seed, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you likely to need the indices of the dropped rows for anything later? You could figure it out from the remaining ones that are kept but might make your life easier to store them here somehow if you might need them.
Also, if this function will be called multiple times from the same script you'd need to be wary that the random seed will always have the same value (unless you change self.seed
between calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will need them later - we just need a certain % to be dropped via the dataloader in a consistent way
I did the seed bit intentionally - the idea is that given a certain seed, you can create the same datasets A and B but then set one to drop a % of the training data. My understanding in this case is we would want that stage to then be replicable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely want them to be replicable 🙂
It was more a general comment than a suggestion to change anything, just that doing the below will give identical A and B, so in a situation where you want A and B to have different indices dropped you'd need to create a new instance of the class (or change the implementation):
cifar = CIFAR10DataModulePlus(drop=0.2, seed=123)
A = cifar.train_dataloader()
B = cifar.train_dataloader()
# A and B are identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the next case to implement and I guess will require either a new class or refactoring of this one. Here though the idea is to have A and B be identical except for dropping some examples from one but not both of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm ok with A
and B
being separate instances of CIFAR10DataModulePlus
rather than separate calls to the train_dataloder
.
A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course!
B = CIFAR10DataModulePlus(drop=0.2, seed=123)
It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.
Also, maybe a good opportunity to try to add a test for this? |
Co-authored-by: Jack Roberts <[email protected]>
I've pushed a couple more commits that:
I've also accepted the suggested move of comment into docstring |
I've further added a test folder with a simple test for the dataloader. I'll think about some more tests to add tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! That test is probably enough for now IMO. Being picky, Python style convention would be to call the function test_drop_loader
rather than testDropLoader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, Phil, really clean and a nice size of PR to review. Great stuff on the typing too :D
A few minor points in comments
Also, not strictly this project but could you port over appropriate bits of the README from https://github.com/alan-turing-institute/ARC-project-template ?
index, _ = train_test_split( | ||
self.dataset_train.indices, | ||
test_size=self.drop, | ||
stratify=labels, | ||
random_state=self.seed, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm ok with A
and B
being separate instances of CIFAR10DataModulePlus
rather than separate calls to the train_dataloder
.
A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course!
B = CIFAR10DataModulePlus(drop=0.2, seed=123)
It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.
Agree @jack89roberts, we should have a recommendation for style here. I'm personally a big fan of typing and like the clarity of Let's discuss at next team meeting how we do style guide stuff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. One very minor comment but happy to be merged in after that!
This pull request adds: