Temporal downsampling in loss computation

Hi,
thanks for releasing this work! I have problem understand the dimension for the loss computation. I manually stepped through the code and have the following shapes:

`forward_encoder()` receives an imgs tensor which has shape `(N,3,T,H,W)` where T is the temporal grid size (i.e. how many patches we have in a clip).

`forward_decoder()` outputs a `(N, t*h*w, p*p*3)` which is consistent with the comment in the `forward()` function but
  not with the comment in `forward_loss()` which expects the prediction to be `(N, t*h*w, u*p*p*3)` where u is the 
  temporal component.
  - **Question**: Why does the decoder output only patches with temporal dimension 1? From my understanding, since we try to reconstruct the entire input, we would also need the temporal component in the prediction.
  
`forward_loss()` temporally downsamples the images from the clip which has already been downsampled and then converts it into patches
  - **Question**: Since imgs is the reconstruction target, why is only a subset used? Shouldn't all input frames be used? 

https://github.com/facebookresearch/mae_st/blob/c5dec1bc01062097906ea67fe47935ec33fd46df/models_mae.py#L407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporal downsampling in loss computation #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Temporal downsampling in loss computation #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions