Skip to content
This repository was archived by the owner on Nov 27, 2023. It is now read-only.

Add Positional Encoders #33

Closed
wants to merge 9 commits into from
Closed

Add Positional Encoders #33

wants to merge 9 commits into from

Conversation

jacobbieker
Copy link
Member

@jacobbieker jacobbieker commented Oct 12, 2021

Pull Request

Description

This adds the positional encoders to fix #30 as well as utilities to subselect Fourier Features for different modalities from one "main" position encoding. Also gives the option of using absolute or relative positional encodings as well.

Fixes #30

How Has This Been Tested?

Unit tests

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

This will be changed more, but serves as a baseline. I don't particularly want to add perceiver-model as a dependency for this repo, so having some code duplication is probably more acceptable. Additionally, since this will be extended and modified more than the version in perceiver-model, I think it should be okay
@jacobbieker jacobbieker added the enhancement New feature or request label Oct 12, 2021
@jacobbieker jacobbieker self-assigned this Oct 12, 2021
@jacobbieker
Copy link
Member Author

@JackKelly @peterdudfield For the datetime features, if we are switching to computing them on the fly, would we want them to be computed here? Or computed in the nowcasting_dataset and nowcasting-utils can just assume that it'll be passed the computed features? Relates to this code in terms of how I structure the absolute positioning encoding code

@JackKelly
Copy link
Member

Good questions!

It would be good to be able to use the datetime encodings for CNN models (as well as for self-attention models).

So I guess there are two slightly separate issues:

  1. Encode the absolute datetimes (using sin and cos) for any ML architecture (CNNs, self-attention, fully-connected, etc.).
  2. Encode the full "position" of each row of input data for self-attention models (i.e. encode the position in space and time, perhaps re-using the temporal encoding from step 1).

@JackKelly
Copy link
Member

JackKelly commented Oct 12, 2021

In terms of where to put the code...

My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).

I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?

Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

  • nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).
  • In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)

And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.

But I really don't have strong feelings about any of this. What do you guys think?

@jacobbieker
Copy link
Member Author

Good questions!

It would be good to be able to use the datetime encodings for CNN models (as well as for self-attention models).

So I guess there are two slightly separate issues:

  1. Encode the absolute datetimes (using sin and cos) for any ML architecture (CNNs, self-attention, fully-connected, etc.).
  2. Encode the full "position" of each row of input data for self-attention models (i.e. encode the position in space and time, perhaps re-using the temporal encoding from step 1).

I think being able to encode the full 'position' for any architecture would be useful too, essentially doing what CoordConv does, which can help CNNs too, but yeah, I agree they are slightly separate! The code as it currently is does them completely separately and just concatenates them at the end

@jacobbieker
Copy link
Member Author

In terms of where to put the code...

My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).

I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?

Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

  • nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).
  • In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)

And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.

But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.

For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.

I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

@JackKelly
Copy link
Member

Sounds good to me! Do you have any concerns about this approach, @peterdudfield?

@peterdudfield
Copy link
Contributor

In terms of where to put the code...
My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).
I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?
Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

  • nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).
  • In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)
And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.
But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.

For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.

I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

I'm always a fan of breaking repos up. But we should be sure there is an easy way to check that 'nowcasting-dataloder' can be trigger when 'nowcasting-dataset' runs. Do you either know a good way for this?

I'm personally ok with torch being in dataset, and having it as an optional thing.

But if we do want to split it up, we need a common place where the interface is defined i.e how these files are structures. Like what is in these .nc files. It feels like the interface is pretty fluid at the moment, so it might be better to not split until its a bit more settled.

General we should also be careful, does 'utils' depend on 'dataset' or the otherway round.

import pytest

def test_fourier_encoding():
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR is very much not done!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to get more thoughts on the design before I actually finished this, incase we want to move it elsewhere, I can make it more simplified, etc.

@jacobbieker
Copy link
Member Author

In terms of where to put the code...
My initial guess would be to put the "datetime encoding" in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk (as a function which just does datetime encoding).
I'm less sure about where to put code that computes the attention-specific encoding of the "full" position. Perhaps that code should also live in nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk?
Maybe one way to distinguish between nowcasting_dataset and nowcasting_utils is that:

  • nowcasting_dataset's thin wrapper for loading pre-prepared batches is all about preparing the input data for our ML models (including computing position encodings).
  • In contrast, nowcasting_utils is for loss functions, plotting functions, etc.?

(Sorry for not thinking more about this earlier!)
And, also, I'm increasingly thinking that maybe we should create a new repo for nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk; if only because it's tiring to type "nowcasting_dataset's thin wrapper which loads the pre-prepared batches off disk" every time we want to refer to that bit of code! And it would make it super-clear that nowcasting_dataset is just for pre-preparing batches.
But I really don't have strong feelings about any of this. What do you guys think?

I think wherever we put the encoding for the datetime we should also put the encoding for space, just so that there is one place where all this encoding comes from. As for where, I don't mind too much.
For splitting up nowcasting-dataset's thin wrapper out of itself, I kinda agree. If we just make a repo for the nowcasting-dataloader or something where all the PyTorch parts of it live could work well. And I could move the special SatFlow extensions to that repo as well, so we have our two model repos, a common PyTorch dataloader repo, common nowcasting-utils for the non-dataloading utilities, and then nowcasting-dataset and Satip for getting, transforming, and preparing data.
I still do like keeping the dataloader code near the code that generates the data the dataloader is loading, but if we can setup automated testing that can make sure changes to nowcasting-dataset doesn't break a nowcasting-dataloader without us knowing about it, I think it would be fine.

I'm always a fan of breaking repos up. But we should be sure there is an easy way to check that 'nowcasting-dataloder' can be trigger when 'nowcasting-dataset' runs. Do you either know a good way for this?

I'm personally ok with torch being in dataset, and having it as an optional thing.

But if we do want to split it up, we need a common place where the interface is defined i.e how these files are structures. Like what is in these .nc files. It feels like the interface is pretty fluid at the moment, so it might be better to not split until its a bit more settled.

General we should also be careful, does 'utils' depend on 'dataset' or the otherway round.

The easiest way would be to install the nowcasting-dataloader in the nowcasting-dataset tests and run through some tests with it, so it is always tested on changes to nowcasting-dataset. But it seems you can trigger CI/CD from other repos, its a little finnicky, but should work: https://github.community/t/triggering-by-other-repository/16163/5

nowcasting-dataset does not rely on any of the other repos, and nowcasting-utils relies on dataset, I think? How I've been sorta structuring it is it goes from Satp/nowcasting-dataset -> nowcasting-utils/potential nowcasting-dataloader -> SatFlow/predict_pv_yield

As for the interface being fluid, yeah, that's a bit of a concern that I have too, but I don't think its too difficult, the .nc files are defined by nowcasting-dataset and that's the source of truth, if that changes, we have to update other places, but thats where the interface is defined, validated, etc.

@peterdudfield
Copy link
Contributor

https://github.community/t/triggering-by-other-repository/16163/5

Sounds like its worth giving a go. Perhaps can copy things out to nowcasting-dataloder, and then get various CI working. And if its all ok, then it can removed from nowcasting-dataset

@jacobbieker
Copy link
Member Author

https://github.community/t/triggering-by-other-repository/16163/5

Sounds like its worth giving a go. Perhaps can copy things out to nowcasting-dataloder, and then get various CI working. And if its all ok, then it can removed from nowcasting-dataset

Sounds good! I'll start on that and move this PR over to that repo once its created

@jacobbieker
Copy link
Member Author

Its started here: https://github.com/openclimatefix/nowcasting_dataloader

@jacobbieker
Copy link
Member Author

I'll move this PR over soon, so closing for now

@jacobbieker jacobbieker deleted the jacob/position-encoding branch October 12, 2021 15:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add PV/Unified Position encoding
3 participants