Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Train on 2020. Test on 2021. No validation set? #390

Closed
Tracked by #393
JackKelly opened this issue Nov 16, 2021 · 9 comments · Fixed by #424
Closed
Tracked by #393

Train on 2020. Test on 2021. No validation set? #390

JackKelly opened this issue Nov 16, 2021 · 9 comments · Fixed by #424
Assignees
Labels
data New data source or feature; or modification of existing data source discussion enhancement New feature or request

Comments

@JackKelly
Copy link
Member

To align with ESO's PV forecasts, let's configure nowcasting_dataset to use 2020 for the training set, and use 2021-01-01 to 2021-08-31 for the test set.

Do we need a separate validation set? I'm not sure we do, given that we're unlikely to use early stopping. And, "just" a year of training data feels a bit tight, so I'm keen to make sure we use as much of our data as possible for training. What do you guys think?

@JackKelly JackKelly added discussion enhancement New feature or request data New data source or feature; or modification of existing data source labels Nov 16, 2021
@JackKelly JackKelly moved this to Todo in Nowcasting Nov 16, 2021
@peterdudfield
Copy link
Contributor

is it nice to have a seperate validation set for the ML models to use - i.e so they dont over-fit.

Perhaps as a balance we could do

  • train: 90% of 2020
  • validation: 10% of 2020
  • test: all of 2021

This would be able to do this - just need to turn it on
https://github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/split/split.py#L152

@JackKelly
Copy link
Member Author

is it nice to have a seperate validation set for the ML models to use - i.e so they dont over-fit

I'm really sorry, I haven't had enough coffee yet 🙂 please could you explain how the 10% validation data would be used during model training?

@peterdudfield
Copy link
Contributor

Ah, maybe Ive mis understood something. But I thought you can provide various ML models and training dataloader and a validation data loader. Then at the end of each epoch the validation data loader is used to measure some sort of metric. This can be then used for 'early stopping' when training the model. I was using this a bit in 'predict_pv_yield'

Perhaps there is a more modern way to do it, or just a different way to handle this now

@JackKelly
Copy link
Member Author

Yeah, I'm honestly not sure what's best!

You're right that, if we're using early stopping, then we should have separate test and validation sets.

But, if we're not using early stopping, then I think it's OK to give PyTorch 2021 as PyTorch's "validation DataLoader", so we can see metrics on the "real" test set (although perhaps that's a bit naughty because we'll be optimising hyper params based on the score on the 2021 data).

But, if I've understood correctly, most self-attention papers don't seem to use early stopping: They just train for as long as they can! (But, that said, the Perceiver IO paper did go to great lengths to limit over-fitting... maybe that included early stopping, I don't quite remember!)

@jacobbieker do you plan to use early stopping? What do you think about whether we should split our data two or three ways (train, test & validation)? 🙂

@jacobbieker
Copy link
Member

I am fine without a validation set, I think its nice to have, so that we aren't optimizing to the 'future' we are predicting for, I think that might skew the results and make our models look better than they actually are. But yeah, just training as long as possible and seeing what happens would also work. I am just a bit hesitant on optimizing the model against the data we are forecasting for directly, that just seems to bias ours to look better. But for now, since we don't have huge amounts of data ready, then its fine, but I'd prefer a validation set once we start training on the whole time we have

@peterdudfield
Copy link
Contributor

So maybe for now we go for

train: 100% of 2020
validation: 0% of 2020
test: all of 2021

@peterdudfield peterdudfield self-assigned this Nov 17, 2021
@peterdudfield
Copy link
Contributor

peterdudfield commented Nov 17, 2021

Ill create a PR for this - shouldnt be to much work

@JackKelly
Copy link
Member Author

SGTM! Thanks!

@JackKelly JackKelly moved this from Todo to In Progress in Nowcasting Nov 17, 2021
Repository owner moved this from In Progress to Done in Nowcasting Nov 17, 2021
@peterdudfield
Copy link
Contributor

relates to #322

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source discussion enhancement New feature or request
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants