-
Notifications
You must be signed in to change notification settings - Fork 62
[RFC] Decoder-native transforms design #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
## Open Questions: | ||
|
||
1. Is `torchcodec.transforms` the right namespace? | ||
2. For random transforms, when should the value be fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be probably be transforms dependent, but I agree with your call that in general this will be per-video. I think that ideally this would be "per-scene", but that's not in scope.
3. Transforms such as Resize don't actually implement a `make_params()` | ||
method. How does TorchVision get their parameters? How will TorchCodec? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_params()
is mostly relevant for random transforms, either those that are randomly applied, or those that do random sampling. Resize isn't random. The relevant parameters are just stored as public attributes on the Transform object.
2. The parameters in a format that the C++ layer will understand. We | ||
obtain them by calling `make_params()` on the Transform object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three things come to mind:
- For non-random transforms
make_transforms()
doesn't exist, the relevant attributes are just attributes on the Transform object (mentioned below as well). That's not an issue, we can easily work around that. make_params()
in its current state requires the input frames to be passed (flat_inputs
). Those are expected to be materialized tensors, which we definitely won't have by the time we need to callmake_params()
in TorchCodec. Most of the time, all of what's actually needed fromflat_inputs
are the H and W of the frames. So we should be able to get around that by slightly modifying the torchvision part, and accept H and W instead offlat_input
as input tomake_params()
(or similar new hook). We'll probably want to keep this private for the time being.- the logic in
make_param()
is often dependent on the H and W of the input frames, as described above. For most streams that's not an issue because H and W are constant. For the streams with variable resolution, that quickly becomes very hairy. Do we callmake_params()
again on the new H and W? Where? We're risking sampling new random values that may lead to nonsensical transforms for a given video!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, on the three things:
- We can work around it, but it does mean we'll probably need to do a lot of
if isinstance(t, Resize)
, as that means there is no uniform way to discover a transforms parameters. We'll need to know that for transform X it has params Y and Z. - Noted. I think we can start the implementation with
Resize
, and punt on these issues until we doRandomCrop
. - FFmpeg filters such as
crop
actually accept expressions, not just number values: https://ffmpeg.org/ffmpeg-filters.html#crop. Some are evaluated per-frame, so we may be able to use those to deal with variable resolution videos. For first versions, I also think it's reasonable to say variable resolution videos are not fully supported.
1. A name which matches the FFmpeg filter name. | ||
2. One member for each supported parameter. | ||
3. A virtual member function that knows how to produce a string that | ||
can be passed to FFmpeg's filtergraph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to stick to filtergraph for this, but I wonder how this affect our current use of libswscale for color-conversion? Maybe we'll have to declare that any use of the transforms
parameter means filtergraph is used for color-conversion, despite libswscale
being faster according to our past benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do a bit of both. Some transforms (such as Resize
) we can support on libswscale. I assume there's a lot of more involved transforms that are not going to work on libswscale. We can add a virtual member function to this C++ class that returns a libswscale string, if possible. If all transforms are supportable on libswscale and all other its other criteria are met, we use that. If not, we have to use filtergraph.
We can also start with just filtergraph, and do libswscale as an optimization.
|
||
1. We define a new module, `torchcodec.decoders.transforms`. | ||
2. All transforms we define in there inherit from | ||
`torchvision.transforms.v2.Transform`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good solution for Design Consideration 3. Do you foresee any issues from adding a dependency on torchvision in torchcodec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out. I'd like to explore if we can avoid an explicit dependency and instead go a duck-typing route. We absolutely want to accept TorchVision transforms. But it would be great if we could avoid an explicit dependency from TorchCodec to TorchVision.
Sketching a design and implementation plan for #526.