Train on 2020. Test on 2021. No validation set? #390

JackKelly · 2021-11-16T07:18:03Z

To align with ESO's PV forecasts, let's configure nowcasting_dataset to use 2020 for the training set, and use 2021-01-01 to 2021-08-31 for the test set.

Do we need a separate validation set? I'm not sure we do, given that we're unlikely to use early stopping. And, "just" a year of training data feels a bit tight, so I'm keen to make sure we use as much of our data as possible for training. What do you guys think?

The text was updated successfully, but these errors were encountered:

peterdudfield · 2021-11-16T09:17:44Z

is it nice to have a seperate validation set for the ML models to use - i.e so they dont over-fit.

Perhaps as a balance we could do

train: 90% of 2020
validation: 10% of 2020
test: all of 2021

This would be able to do this - just need to turn it on
https://github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/dataset/split/split.py#L152

JackKelly · 2021-11-16T10:14:41Z

is it nice to have a seperate validation set for the ML models to use - i.e so they dont over-fit

I'm really sorry, I haven't had enough coffee yet 🙂 please could you explain how the 10% validation data would be used during model training?

peterdudfield · 2021-11-16T21:21:17Z

Ah, maybe Ive mis understood something. But I thought you can provide various ML models and training dataloader and a validation data loader. Then at the end of each epoch the validation data loader is used to measure some sort of metric. This can be then used for 'early stopping' when training the model. I was using this a bit in 'predict_pv_yield'

Perhaps there is a more modern way to do it, or just a different way to handle this now

JackKelly · 2021-11-16T21:44:15Z

Yeah, I'm honestly not sure what's best!

You're right that, if we're using early stopping, then we should have separate test and validation sets.

But, if we're not using early stopping, then I think it's OK to give PyTorch 2021 as PyTorch's "validation DataLoader", so we can see metrics on the "real" test set (although perhaps that's a bit naughty because we'll be optimising hyper params based on the score on the 2021 data).

But, if I've understood correctly, most self-attention papers don't seem to use early stopping: They just train for as long as they can! (But, that said, the Perceiver IO paper did go to great lengths to limit over-fitting... maybe that included early stopping, I don't quite remember!)

@jacobbieker do you plan to use early stopping? What do you think about whether we should split our data two or three ways (train, test & validation)? 🙂

jacobbieker · 2021-11-17T11:24:18Z

I am fine without a validation set, I think its nice to have, so that we aren't optimizing to the 'future' we are predicting for, I think that might skew the results and make our models look better than they actually are. But yeah, just training as long as possible and seeing what happens would also work. I am just a bit hesitant on optimizing the model against the data we are forecasting for directly, that just seems to bias ours to look better. But for now, since we don't have huge amounts of data ready, then its fine, but I'd prefer a validation set once we start training on the whole time we have

peterdudfield · 2021-11-17T11:32:33Z

So maybe for now we go for

train: 100% of 2020
validation: 0% of 2020
test: all of 2021

peterdudfield · 2021-11-17T11:49:13Z

Ill create a PR for this - shouldnt be to much work

JackKelly · 2021-11-17T12:42:37Z

SGTM! Thanks!

peterdudfield · 2021-11-18T09:15:01Z

relates to #322

JackKelly added discussion enhancement New feature or request data New data source or feature; or modification of existing data source labels Nov 16, 2021

JackKelly added this to Nowcasting Nov 16, 2021

JackKelly moved this to Todo in Nowcasting Nov 16, 2021

JackKelly mentioned this issue Nov 16, 2021

Stuff that needs to be finished before we can create a new pre-prepared dataset #393

Closed

34 tasks

peterdudfield self-assigned this Nov 17, 2021

peterdudfield mentioned this issue Nov 17, 2021

update to split on training in one year, test after a date #424

Merged

7 tasks

peterdudfield mentioned this issue Nov 17, 2021

Split_data config #426

Open

JackKelly moved this from Todo to In Progress in Nowcasting Nov 17, 2021

peterdudfield closed this as completed in #424 Nov 17, 2021

Repository owner moved this from In Progress to Done in Nowcasting Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train on 2020. Test on 2021. No validation set? #390

Train on 2020. Test on 2021. No validation set? #390

JackKelly commented Nov 16, 2021

peterdudfield commented Nov 16, 2021

JackKelly commented Nov 16, 2021

peterdudfield commented Nov 16, 2021

JackKelly commented Nov 16, 2021

jacobbieker commented Nov 17, 2021

peterdudfield commented Nov 17, 2021

peterdudfield commented Nov 17, 2021 •

edited

Loading

JackKelly commented Nov 17, 2021

peterdudfield commented Nov 18, 2021

Train on 2020. Test on 2021. No validation set? #390

Train on 2020. Test on 2021. No validation set? #390

Comments

JackKelly commented Nov 16, 2021

peterdudfield commented Nov 16, 2021

JackKelly commented Nov 16, 2021

peterdudfield commented Nov 16, 2021

JackKelly commented Nov 16, 2021

jacobbieker commented Nov 17, 2021

peterdudfield commented Nov 17, 2021

peterdudfield commented Nov 17, 2021 • edited Loading

JackKelly commented Nov 17, 2021

peterdudfield commented Nov 18, 2021

peterdudfield commented Nov 17, 2021 •

edited

Loading