Hi,
thanks for releasing this work! I have problem understand the dimension for the loss computation. I manually stepped through the code and have the following shapes:
forward_encoder() receives an imgs tensor which has shape (N,3,T,H,W) where T is the temporal grid size (i.e. how many patches we have in a clip).
forward_decoder() outputs a (N, t*h*w, p*p*3) which is consistent with the comment in the forward() function but
not with the comment in forward_loss() which expects the prediction to be (N, t*h*w, u*p*p*3) where u is the
temporal component.
- Question: Why does the decoder output only patches with temporal dimension 1? From my understanding, since we try to reconstruct the entire input, we would also need the temporal component in the prediction.
forward_loss() temporally downsamples the images from the clip which has already been downsampled and then converts it into patches
- Question: Since imgs is the reconstruction target, why is only a subset used? Shouldn't all input frames be used?
|
_imgs = torch.index_select( |
Hi,
thanks for releasing this work! I have problem understand the dimension for the loss computation. I manually stepped through the code and have the following shapes:
forward_encoder()receives an imgs tensor which has shape(N,3,T,H,W)where T is the temporal grid size (i.e. how many patches we have in a clip).forward_decoder()outputs a(N, t*h*w, p*p*3)which is consistent with the comment in theforward()function butnot with the comment in
forward_loss()which expects the prediction to be(N, t*h*w, u*p*p*3)where u is thetemporal component.
forward_loss()temporally downsamples the images from the clip which has already been downsampled and then converts it into patchesmae_st/models_mae.py
Line 407 in c5dec1b