Skip to content

Temporal downsampling in loss computation #27

@git-bauerseb

Description

@git-bauerseb

Hi,
thanks for releasing this work! I have problem understand the dimension for the loss computation. I manually stepped through the code and have the following shapes:

forward_encoder() receives an imgs tensor which has shape (N,3,T,H,W) where T is the temporal grid size (i.e. how many patches we have in a clip).

forward_decoder() outputs a (N, t*h*w, p*p*3) which is consistent with the comment in the forward() function but
not with the comment in forward_loss() which expects the prediction to be (N, t*h*w, u*p*p*3) where u is the
temporal component.

  • Question: Why does the decoder output only patches with temporal dimension 1? From my understanding, since we try to reconstruct the entire input, we would also need the temporal component in the prediction.

forward_loss() temporally downsamples the images from the clip which has already been downsampled and then converts it into patches

  • Question: Since imgs is the reconstruction target, why is only a subset used? Shouldn't all input frames be used?

_imgs = torch.index_select(

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions