-
-
Notifications
You must be signed in to change notification settings - Fork 6
Train on 2020. Test on 2021. No validation set? #390
Comments
is it nice to have a seperate validation set for the ML models to use - i.e so they dont over-fit. Perhaps as a balance we could do
This would be able to do this - just need to turn it on |
I'm really sorry, I haven't had enough coffee yet 🙂 please could you explain how the 10% validation data would be used during model training? |
Ah, maybe Ive mis understood something. But I thought you can provide various ML models and training dataloader and a validation data loader. Then at the end of each epoch the validation data loader is used to measure some sort of metric. This can be then used for 'early stopping' when training the model. I was using this a bit in 'predict_pv_yield' Perhaps there is a more modern way to do it, or just a different way to handle this now |
Yeah, I'm honestly not sure what's best! You're right that, if we're using early stopping, then we should have separate test and validation sets. But, if we're not using early stopping, then I think it's OK to give PyTorch 2021 as PyTorch's "validation DataLoader", so we can see metrics on the "real" test set (although perhaps that's a bit naughty because we'll be optimising hyper params based on the score on the 2021 data). But, if I've understood correctly, most self-attention papers don't seem to use early stopping: They just train for as long as they can! (But, that said, the Perceiver IO paper did go to great lengths to limit over-fitting... maybe that included early stopping, I don't quite remember!) @jacobbieker do you plan to use early stopping? What do you think about whether we should split our data two or three ways (train, test & validation)? 🙂 |
I am fine without a validation set, I think its nice to have, so that we aren't optimizing to the 'future' we are predicting for, I think that might skew the results and make our models look better than they actually are. But yeah, just training as long as possible and seeing what happens would also work. I am just a bit hesitant on optimizing the model against the data we are forecasting for directly, that just seems to bias ours to look better. But for now, since we don't have huge amounts of data ready, then its fine, but I'd prefer a validation set once we start training on the whole time we have |
So maybe for now we go for train: 100% of 2020 |
Ill create a PR for this - shouldnt be to much work |
SGTM! Thanks! |
relates to #322 |
To align with ESO's PV forecasts, let's configure
nowcasting_dataset
to use 2020 for the training set, and use 2021-01-01 to 2021-08-31 for the test set.Do we need a separate validation set? I'm not sure we do, given that we're unlikely to use early stopping. And, "just" a year of training data feels a bit tight, so I'm keen to make sure we use as much of our data as possible for training. What do you guys think?
The text was updated successfully, but these errors were encountered: