Add Positional Encoders #33

jacobbieker · 2021-10-12T08:40:17Z

Pull Request

Description

This adds the positional encoders to fix #30 as well as utilities to subselect Fourier Features for different modalities from one "main" position encoding. Also gives the option of using absolute or relative positional encodings as well.

Fixes #30

How Has This Been Tested?

Unit tests

No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

This will be changed more, but serves as a baseline. I don't particularly want to add perceiver-model as a dependency for this repo, so having some code duplication is probably more acceptable. Additionally, since this will be extended and modified more than the version in perceiver-model, I think it should be okay

Update pre-commit to ignore certain conventions

jacobbieker · 2021-10-12T13:08:24Z

@JackKelly @peterdudfield For the datetime features, if we are switching to computing them on the fly, would we want them to be computed here? Or computed in the nowcasting_dataset and nowcasting-utils can just assume that it'll be passed the computed features? Relates to this code in terms of how I structure the absolute positioning encoding code

JackKelly · 2021-10-12T13:12:57Z

Good questions!

It would be good to be able to use the datetime encodings for CNN models (as well as for self-attention models).

So I guess there are two slightly separate issues:

Encode the absolute datetimes (using sin and cos) for any ML architecture (CNNs, self-attention, fully-connected, etc.).
Encode the full "position" of each row of input data for self-attention models (i.e. encode the position in space and time, perhaps re-using the temporal encoding from step 1).

JackKelly · 2021-10-12T13:19:16Z

In terms of where to put the code...

My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).

I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?

Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).
In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)

And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.

But I really don't have strong feelings about any of this. What do you guys think?

jacobbieker · 2021-10-12T13:23:19Z

Good questions!

It would be good to be able to use the datetime encodings for CNN models (as well as for self-attention models).

So I guess there are two slightly separate issues:

Encode the absolute datetimes (using sin and cos) for any ML architecture (CNNs, self-attention, fully-connected, etc.).

Encode the full "position" of each row of input data for self-attention models (i.e. encode the position in space and time, perhaps re-using the temporal encoding from step 1).

I think being able to encode the full 'position' for any architecture would be useful too, essentially doing what CoordConv does, which can help CNNs too, but yeah, I agree they are slightly separate! The code as it currently is does them completely separately and just concatenates them at the end

jacobbieker · 2021-10-12T13:29:15Z

In terms of where to put the code...

My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).

I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?

Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).

In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)

And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.

But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.

For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.

I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

JackKelly · 2021-10-12T13:38:04Z

Sounds good to me! Do you have any concerns about this approach, @peterdudfield?

peterdudfield · 2021-10-12T13:52:13Z

In terms of where to put the code...
My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).
I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?
Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).

In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)
And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.
But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.

For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.

I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

I'm always a fan of breaking repos up. But we should be sure there is an easy way to check that 'nowcasting-dataloder' can be trigger when 'nowcasting-dataset' runs. Do you either know a good way for this?

I'm personally ok with torch being in dataset, and having it as an optional thing.

But if we do want to split it up, we need a common place where the interface is defined i.e how these files are structures. Like what is in these .nc files. It feels like the interface is pretty fluid at the moment, so it might be better to not split until its a bit more settled.

General we should also be careful, does 'utils' depend on 'dataset' or the otherway round.

peterdudfield · 2021-10-12T13:53:29Z

tests/models/test_position_encoding.py

+import pytest
+
+def test_fourier_encoding():
+    pass


Yeah, this PR is very much not done!

Just wanted to get more thoughts on the design before I actually finished this, incase we want to move it elsewhere, I can make it more simplified, etc.

jacobbieker · 2021-10-12T14:03:05Z

In terms of where to put the code...
My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).
I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?
Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).

In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)
And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.
But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.
For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.
I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

I'm always a fan of breaking repos up. But we should be sure there is an easy way to check that 'nowcasting-dataloder' can be trigger when 'nowcasting-dataset' runs. Do you either know a good way for this?

I'm personally ok with torch being in dataset, and having it as an optional thing.

But if we do want to split it up, we need a common place where the interface is defined i.e how these files are structures. Like what is in these .nc files. It feels like the interface is pretty fluid at the moment, so it might be better to not split until its a bit more settled.

General we should also be careful, does 'utils' depend on 'dataset' or the otherway round.

The easiest way would be to install the nowcasting-dataloader in the nowcasting-dataset tests and run through some tests with it, so it is always tested on changes to nowcasting-dataset. But it seems you can trigger CI/CD from other repos, its a little finnicky, but should work: https://github.community/t/triggering-by-other-repository/16163/5

nowcasting-dataset does not rely on any of the other repos, and nowcasting-utils relies on dataset, I think? How I've been sorta structuring it is it goes from Satp/nowcasting-dataset -> nowcasting-utils/potential nowcasting-dataloader -> SatFlow/predict_pv_yield

As for the interface being fluid, yeah, that's a bit of a concern that I have too, but I don't think its too difficult, the .nc files are defined by nowcasting-dataset and that's the source of truth, if that changes, we have to update other places, but thats where the interface is defined, validated, etc.

peterdudfield · 2021-10-12T14:06:57Z

https://github.community/t/triggering-by-other-repository/16163/5

Sounds like its worth giving a go. Perhaps can copy things out to nowcasting-dataloder, and then get various CI working. And if its all ok, then it can removed from nowcasting-dataset

jacobbieker · 2021-10-12T14:11:39Z

https://github.community/t/triggering-by-other-repository/16163/5

Sounds like its worth giving a go. Perhaps can copy things out to nowcasting-dataloder, and then get various CI working. And if its all ok, then it can removed from nowcasting-dataset

Sounds good! I'll start on that and move this PR over to that repo once its created

jacobbieker · 2021-10-12T14:19:27Z

Its started here: https://github.com/openclimatefix/nowcasting_dataloader

jacobbieker · 2021-10-12T15:00:56Z

I'll move this PR over soon, so closing for now

jacobbieker added the enhancement New feature or request label Oct 12, 2021

jacobbieker self-assigned this Oct 12, 2021

jacobbieker added 8 commits October 12, 2021 09:49

Add top-level docstring

c36f34a

Add stub tests

5242c9f

Fix docstring

6ca30d2

Add more to docstring

62d7114

Add assert

a3562dc

Update pre-commit to ignore certain conventions

Add more into absolute positioning

a1e3590

Add encoding geospatial coordinates

65722bd

Fill out absolute position encoding

14af5d8

peterdudfield reviewed Oct 12, 2021

View reviewed changes

jacobbieker closed this Oct 12, 2021

jacobbieker deleted the jacob/position-encoding branch October 12, 2021 15:33

jacobbieker mentioned this pull request Oct 12, 2021

Add Absolute Position Encoder openclimatefix/nowcasting_dataloader#2

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Positional Encoders #33

Add Positional Encoders #33

jacobbieker commented Oct 12, 2021 •

edited

Loading

jacobbieker commented Oct 12, 2021

JackKelly commented Oct 12, 2021

JackKelly commented Oct 12, 2021 •

edited

Loading

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield Oct 12, 2021

jacobbieker Oct 12, 2021

jacobbieker Oct 12, 2021

jacobbieker commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

Add Positional Encoders #33

Add Positional Encoders #33

Conversation

jacobbieker commented Oct 12, 2021 • edited Loading

Pull Request

Description

How Has This Been Tested?

Checklist:

jacobbieker commented Oct 12, 2021

JackKelly commented Oct 12, 2021

JackKelly commented Oct 12, 2021 • edited Loading

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

JackKelly commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

peterdudfield Oct 12, 2021

Choose a reason for hiding this comment

jacobbieker Oct 12, 2021

Choose a reason for hiding this comment

jacobbieker Oct 12, 2021

Choose a reason for hiding this comment

jacobbieker commented Oct 12, 2021

peterdudfield commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021

jacobbieker commented Oct 12, 2021 •

edited

Loading

JackKelly commented Oct 12, 2021 •

edited

Loading