[RFC] Decoder-native transforms design #885

scotts · 2025-09-08T19:16:17Z

Sketching a design and implementation plan for #526.

decoder_native_transforms.md

NicolasHug · 2025-09-09T11:11:20Z

decoder_native_transforms.md

+## Open Questions:
+
+1. Is `torchcodec.transforms` the right namespace?
+2. For random transforms, when should the value be fixed?


It will be probably be transforms dependent, but I agree with your call that in general this will be per-video. I think that ideally this would be "per-scene", but that's not in scope.

NicolasHug · 2025-09-09T11:12:43Z

decoder_native_transforms.md

+3. Transforms such as Resize don't actually implement a `make_params()`
+   method. How does TorchVision get their parameters? How will TorchCodec?


make_params() is mostly relevant for random transforms, either those that are randomly applied, or those that do random sampling. Resize isn't random. The relevant parameters are just stored as public attributes on the Transform object.

NicolasHug · 2025-09-09T11:25:49Z

decoder_native_transforms.md

+   2. The parameters in a format that the C++ layer will understand. We
+      obtain them by calling `make_params()` on the Transform object.


Three things come to mind:

For non-random transforms make_transforms() doesn't exist, the relevant attributes are just attributes on the Transform object (mentioned below as well). That's not an issue, we can easily work around that.

make_params() in its current state requires the input frames to be passed (flat_inputs). Those are expected to be materialized tensors, which we definitely won't have by the time we need to call make_params() in TorchCodec. Most of the time, all of what's actually needed from flat_inputs are the H and W of the frames. So we should be able to get around that by slightly modifying the torchvision part, and accept H and W instead of flat_input as input to make_params() (or similar new hook). We'll probably want to keep this private for the time being.

the logic in make_param() is often dependent on the H and W of the input frames, as described above. For most streams that's not an issue because H and W are constant. For the streams with variable resolution, that quickly becomes very hairy. Do we call make_params() again on the new H and W? Where? We're risking sampling new random values that may lead to nonsensical transforms for a given video!

@NicolasHug, on the three things:

We can work around it, but it does mean we'll probably need to do a lot of if isinstance(t, Resize), as that means there is no uniform way to discover a transforms parameters. We'll need to know that for transform X it has params Y and Z.

Noted. I think we can start the implementation with Resize, and punt on these issues until we do RandomCrop.

FFmpeg filters such as crop actually accept expressions, not just number values: https://ffmpeg.org/ffmpeg-filters.html#crop. Some are evaluated per-frame, so we may be able to use those to deal with variable resolution videos. For first versions, I also think it's reasonable to say variable resolution videos are not fully supported.

NicolasHug · 2025-09-09T11:27:25Z

decoder_native_transforms.md

+   1. A name which matches the FFmpeg filter name.
+   2. One member for each supported parameter.
+   3. A virtual member function that knows how to produce a string that
+      can be passed to FFmpeg's filtergraph.


Sounds reasonable to stick to filtergraph for this, but I wonder how this affect our current use of libswscale for color-conversion? Maybe we'll have to declare that any use of the transforms parameter means filtergraph is used for color-conversion, despite libswscale being faster according to our past benchmarks?

I think we can do a bit of both. Some transforms (such as Resize) we can support on libswscale. I assume there's a lot of more involved transforms that are not going to work on libswscale. We can add a virtual member function to this C++ class that returns a libswscale string, if possible. If all transforms are supportable on libswscale and all other its other criteria are met, we use that. If not, we have to use filtergraph.

We can also start with just filtergraph, and do libswscale as an optimization.

Dan-Flores · 2025-09-10T17:34:54Z

decoder_native_transforms.md

+
+1. We define a new module, `torchcodec.decoders.transforms`.
+2. All transforms we define in there inherit from
+   `torchvision.transforms.v2.Transform`.


This is a good solution for Design Consideration 3. Do you foresee any issues from adding a dependency on torchvision in torchcodec?

Good call out. I'd like to explore if we can avoid an explicit dependency and instead go a duck-typing route. We absolutely want to accept TorchVision transforms. But it would be great if we could avoid an explicit dependency from TorchCodec to TorchVision.

scotts added 2 commits September 8, 2025 12:10

Initial design draft

567bdc5

Formatting

a9e8182

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 8, 2025

scotts added 9 commits September 8, 2025 12:18

List formatting

449f500

More list formatting

6aea38d

More more list formatting

65e258e

Almost there

2089840

Formatting

130b7e0

Formatting

07f1d73

Another open question

90f41ed

Headers

d452d66

Questions

d03d0e3

NicolasHug reviewed Sep 9, 2025

View reviewed changes

scotts added 2 commits September 9, 2025 06:03

Use size param

d70f4a1

Period

1e0da61

Dan-Flores reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Decoder-native transforms design #885

[RFC] Decoder-native transforms design #885

scotts commented Sep 8, 2025

Uh oh!

Uh oh!

NicolasHug Sep 9, 2025

Uh oh!

NicolasHug Sep 9, 2025

Uh oh!

NicolasHug Sep 9, 2025

Uh oh!

scotts Sep 9, 2025

Uh oh!

NicolasHug Sep 9, 2025

Uh oh!

scotts Sep 9, 2025

Uh oh!

Dan-Flores Sep 10, 2025

Uh oh!

scotts Sep 12, 2025

Uh oh!

Uh oh!

		3. Transforms such as Resize don't actually implement a `make_params()`
		method. How does TorchVision get their parameters? How will TorchCodec?

		2. The parameters in a format that the C++ layer will understand. We
		obtain them by calling `make_params()` on the Transform object.

[RFC] Decoder-native transforms design #885

Are you sure you want to change the base?

[RFC] Decoder-native transforms design #885

Conversation

scotts commented Sep 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!