03 cifar pipeline #9

philswatton · 2022-12-09T15:46:47Z

This pull request adds:

PyTorch, torchvision dependencies for macos (assumes M1). Raised issue System Dependencies for PyTorch (or TensorFlow) #8 to highlight this
scikit-learn dependency
a slightly adjusted CIFAR-10 datamodule. this adds some functionality to the core pl_bolts one:
- 'drop' parameter in class init method
- dataloader will use drop to drop that % from each class.
- a function allowing the drop parameter to be easily modified.

jack89roberts

Nice! I've added a couple of comments, mostly minor Python-style things.

And a general ARC code-style thing to discuss separately: Maybe we should have recommended ARC styles for docstrings and type hints? E.g. I like type-hints in function definitions but personally tend not to go as far as things like *args: Any or -> None.

src/modsim2/data/loader.py

jack89roberts · 2022-12-09T16:58:19Z

src/modsim2/data/loader.py

+            index, _ = train_test_split(
+                self.dataset_train.indices,
+                test_size=self.drop,
+                stratify=labels,
+                random_state=self.seed,
+            )


Are you likely to need the indices of the dropped rows for anything later? You could figure it out from the remaining ones that are kept but might make your life easier to store them here somehow if you might need them.

Also, if this function will be called multiple times from the same script you'd need to be wary that the random seed will always have the same value (unless you change self.seed between calls).

I don't think we will need them later - we just need a certain % to be dropped via the dataloader in a consistent way

I did the seed bit intentionally - the idea is that given a certain seed, you can create the same datasets A and B but then set one to drop a % of the training data. My understanding in this case is we would want that stage to then be replicable

Definitely want them to be replicable 🙂

It was more a general comment than a suggestion to change anything, just that doing the below will give identical A and B, so in a situation where you want A and B to have different indices dropped you'd need to create a new instance of the class (or change the implementation):

cifar = CIFAR10DataModulePlus(drop=0.2, seed=123) A = cifar.train_dataloader() B = cifar.train_dataloader() # A and B are identical

That's the next case to implement and I guess will require either a new class or refactoring of this one. Here though the idea is to have A and B be identical except for dropping some examples from one but not both of them

I think I'm ok with A and B being separate instances of CIFAR10DataModulePlus rather than separate calls to the train_dataloder.

A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course! B = CIFAR10DataModulePlus(drop=0.2, seed=123)

It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.

jack89roberts · 2022-12-09T17:09:48Z

Also, maybe a good opportunity to try to add a test for this?

Co-authored-by: Jack Roberts <[email protected]>

philswatton · 2022-12-12T16:06:30Z

I've pushed a couple more commits that:

Change the class name to be more informative
Removed the superflous method to change the drop%

I've also accepted the suggested move of comment into docstring

philswatton · 2022-12-12T17:15:04Z

I've further added a test folder with a simple test for the dataloader. I'll think about some more tests to add tomorrow

jack89roberts

Looks good to me! That test is probably enough for now IMO. Being picky, Python style convention would be to call the function test_drop_loader rather than testDropLoader.

lannelin

Looks great, Phil, really clean and a nice size of PR to review. Great stuff on the typing too :D

A few minor points in comments

Also, not strictly this project but could you port over appropriate bits of the README from https://github.com/alan-turing-institute/ARC-project-template ?

lannelin · 2022-12-14T12:01:55Z

src/modsim2/data/loader.py

+            index, _ = train_test_split(
+                self.dataset_train.indices,
+                test_size=self.drop,
+                stratify=labels,
+                random_state=self.seed,
+            )


I think I'm ok with A and B being separate instances of CIFAR10DataModulePlus rather than separate calls to the train_dataloder.

A = CIFAR10DataModulePlus(drop=0.0, seed=123) # could just use CIFAR10DataModule, of course! B = CIFAR10DataModulePlus(drop=0.2, seed=123)

It wouldn't be the most efficient if you needed to hold A and B in memory together but you should only need one at a time for our experiments.

pyproject.toml

tests/test_data.py

lannelin · 2022-12-14T13:12:46Z

Nice! I've added a couple of comments, mostly minor Python-style things.

And a general ARC code-style thing to discuss separately: Maybe we should have recommended ARC styles for docstrings and type hints? E.g. I like type-hints in function definitions but personally tend not to go as far as things like *args: Any or -> None.

Agree @jack89roberts, we should have a recommendation for style here. I'm personally a big fan of typing and like the clarity of -> None etc. I'm more relaxed on docstrings if we have type hints and happy just to use them for anything that's not obvious from the hints or to just give a general explanation for a function with behavior isn't immediately obvious from the name.

Let's discuss at next team meeting how we do style guide stuff

lannelin

Looks great. One very minor comment but happy to be merged in after that!

tests/test_data.py

philswatton added 6 commits December 2, 2022 15:21

torch installed - note issues with M1 involved

ca57ec5

torch, torchvision, and torch bolts working

feb888a

pytorch lightning

8d6e0dd

cifar datamodule with loader that drops a % of data

7e45caa

loader now uses self.seed

501061d

scikit-learn dependency added

a355aff

philswatton requested review from jack89roberts and lannelin December 9, 2022 15:46

jack89roberts reviewed Dec 9, 2022

View reviewed changes

philswatton and others added 2 commits December 12, 2022 14:02

Moved comment to docstring

cd5f316

Co-authored-by: Jack Roberts <[email protected]>

class name changed, superfluous method removed

7b22899

philswatton added 2 commits December 12, 2022 17:02

updated package name

926d73f

extra line

3fc6d79

jack89roberts approved these changes Dec 13, 2022

View reviewed changes

changed test name to snake case

2b5361d

lannelin requested changes Dec 14, 2022

View reviewed changes

philswatton added 4 commits December 14, 2022 14:05

added pytest options to toml file

65599b8

stratification test for cifar drop loader

2be7a11

some README bits relevant to #3

4e808f5

changed to absolute diff between values

f627676

philswatton requested a review from lannelin December 14, 2022 16:50

lannelin approved these changes Dec 14, 2022

View reviewed changes

tests/test_data.py Show resolved Hide resolved

added comments to stratification test

da4c8b2

philswatton merged commit 1765c68 into develop Dec 15, 2022

philswatton deleted the 03_cifar_pipeline branch December 15, 2022 10:51

philswatton mentioned this pull request Dec 15, 2022

CIFAR-10 dataset in pipeline #3

Closed

philswatton mentioned this pull request Jan 9, 2023

Finished first version of README file #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03 cifar pipeline #9

03 cifar pipeline #9

philswatton commented Dec 9, 2022

jack89roberts left a comment

jack89roberts Dec 9, 2022

philswatton Dec 12, 2022

jack89roberts Dec 12, 2022

philswatton Dec 12, 2022

lannelin Dec 14, 2022

jack89roberts commented Dec 9, 2022

philswatton commented Dec 12, 2022

philswatton commented Dec 12, 2022

jack89roberts left a comment

lannelin left a comment

lannelin Dec 14, 2022

lannelin commented Dec 14, 2022 •

edited

Loading

lannelin left a comment

03 cifar pipeline #9

03 cifar pipeline #9

Conversation

philswatton commented Dec 9, 2022

jack89roberts left a comment

Choose a reason for hiding this comment

jack89roberts Dec 9, 2022

Choose a reason for hiding this comment

philswatton Dec 12, 2022

Choose a reason for hiding this comment

jack89roberts Dec 12, 2022

Choose a reason for hiding this comment

philswatton Dec 12, 2022

Choose a reason for hiding this comment

lannelin Dec 14, 2022

Choose a reason for hiding this comment

jack89roberts commented Dec 9, 2022

philswatton commented Dec 12, 2022

philswatton commented Dec 12, 2022

jack89roberts left a comment

Choose a reason for hiding this comment

lannelin left a comment

Choose a reason for hiding this comment

lannelin Dec 14, 2022

Choose a reason for hiding this comment

lannelin commented Dec 14, 2022 • edited Loading

lannelin left a comment

Choose a reason for hiding this comment

lannelin commented Dec 14, 2022 •

edited

Loading