BETA CUDA interface: NVCUVID decoder implementation 1/N #910

NicolasHug · 2025-09-25T17:12:04Z

This is the first of a long series of PR that implements our own NVCUVID decoder backend via a new BETA CUDA interface.

High-level context

NVDEC == hardware decoder present on an Nvidia GPU. It's literally a physical piece of silicon that can decode videos.
NVCUVID == the C library that we can use to program NVDEC. Docs are here.

Currently in main, we support CUDA decoding by going through the NVCUVID implementation of FFmpeg. This makes it very easy for us, but we have limited control of the underlying resources. Specifically, caching the underlying CUDecoder can lead to massive performance gains, but relying on FFmpeg's NVCUVID implementation doesn't allow us to control that.

So this PR implements a separate, independent CUDA decoding path (the BETA CUDA interface), where we directly rely on NVCUVID. A lot of our implementation was inspired by DALI's.

Design principles

As much as possible, I tried for this new decoder to require minimal changes from our existing code-base, and mostly just treat it purely as a new extension by creating a new DeviceInterface.

Crucially, we're still demuxing through FFmpeg. That is, we still rely on FFmpeg to generate the AVPacket from a stream. This is key: it means we can leave the SingleStreamDecoder largely untouched. All of the new decoding logic is encapsulated in a new BetaCudaDeviceInterface which exposes new methods, like:

sendPacket(AVPacket) which is the moral equivalent of avcodec_send_packet(AVPacket)
receiveFrame(AVFrame), which is the moral equivalent of avcodec_receive_frame(AVFrame).

This wasn't trivial, because NVCUVID isn't designed to work well with the non-blocking send/receive API of FFmpeg (see design alternatives section below).

What is and isn't supported right now

Support is very limited ATM, and definitely buggy (don't worry, everything is private).

h264 only
seek_mode="exact" only
index-based APIs only
no backwards seeks

In terms of features, adding more codec support as well as approximate seek_mode will be top priority. Supporting time-based and backwards seeks is lower pri, and may never be supported. It will depend on how hard that is.

There are plenty other things that aren't supported, and a million TODOs. Also, most guards preventing you from doing bad things aren't yet in place: if you look at the new decoder the wrong way it will be angry at you and deadlock (at best). I will be working through all the features, bugs, and guards, in follow-ups.

How to review this

Ignore the header files in nvcuvid_include. They're the headers from NVCUVID. We have to vendor them because they're not part of the normal CUDA toolkit.
Start with the tests to get an idea of the Python API, and of what is currently supported. You'll see that we can request the new BETA interface by passing device="cuda:0:beta". That's not a legible pytorch device string, so we can consider this API to be private. It is subject to change anyway. For now we just need a convenient way to expose this in Python, for testing.
Take a look at _video_decoder.py to further explore how the Python API is exposed.
Take a brief look at the changes made to the DeviceInterface registration in DeviceInterface.[h, cpp]: previously, we could only register one interface per device type (one interface for CPU, one for CUDA, etc.). Now, we can register multiple interfaces per device type with the "device_variant" key. This is how we can enable both device="cuda" and device="cuda:0:beta"` and have both the default and the beta CUDA interfaces.
Now look at SingleStreamDecoder and pay attention to the new extension points added to the DeviceInterface. You'll see sendPacket, receiveFrame, and a few other things. They mirror the existing FFmpeg APIs. Don't get too scared by the new bit-stream filtering (BSF) logic: it's just something that converts an AVPacket into an other AVPacket, with a different binary format. It's needed for some codecs. Eventually, we'll move that away from the SingleStreamDecoder, and encapsulate it within the interface (this is a TODO).
Now you can start looking at the code of BetaCudaDeviceInterface. You're now in a whole new rabbit hole with a lot of new concepts. We're literally writing our own decoder here, and that's not something we've done yet in TorchCodec (we were always relying on FFmpeg so far). There's too much to write to describe how it works, and anything I write now will likely be obsolete within a few days, so let's go over it in our sync.

Previous design considerations (rejected)

I discussed a lot of that during meetings already, but just for ref:

I thought about registering ourselves as a HWAccel object. The HWAccel registration is what FFmpeg does in its own NVCUVID backend (which is what we use as our current CUDA interface). There would be a major upside of registering ourselves as a HWAcccel : we would be able to rely on FFmpeg’s native parser and FFmpeg’s frame-reordering logic, instead of relying on NVCUVID parser and having to implement the frame recording logic ourselves. See this? This is converting FFmpeg’s parser info (right side) to NVCUVID’s built-in frame info (left side). Among other things, this frame meta-data is needed for the NVDEC hardware decoder to know which frame it should decode first when there are frame dependencies (like B-frames). If we were to implement our own HWAcccel we’d have to implement this metadata mapping ourselves, for all codecs. That’s obviously far from trivial in terms of required knowledge, but more importantly, it’s technically impossible: all of the FFmpeg parser info is private. It’s not ABI stable and it relies on private headers anyway. So NVCUVID’s claim that users can rely on a third-party parser instead of their own isn't really true for us. No one, except for FFmpeg themselves, can use the FFmpeg parser. We have to build our own parser (not happening), or rely on NVCUVID’s parser.
NVCUVID is designed around callabcks which are triggered when a packet is being parsed. Crucially there is the pfnDisplayPicture callback which is triggered when a frame is fully decoded and ready to be displayed, in display order (which is great). Unfortunately, we cannot rely on this callback: it is triggered by the parser within a call to cuvidParseVideoData(packet), which means that the only way to know that a frame is ready is to send a packet. I guess that probably makes sense in streaming applications? But that's incompatible with FFmpeg send/receive non-blocking APIs, where we should be able to query whether a frame is ready without having to send a packet. So if we were to rely on this callback, we'd likely need massive changes to our SingleStreamDecoder architecture - which is something we don't want to do. So, we can't rely on this callback, and we have to figure out the frame display order ourselves. EDIT: I might be wrong about this. We may be able to rely on the pfnDisplayPicture callback after all.

facebook-github-bot · 2025-09-26T09:11:50Z

@NicolasHug has imported this pull request. If you are a Meta employee, you can view this in D83333067.

src/torchcodec/_core/StreamOptions.h

src/torchcodec/decoders/_video_decoder.py

src/torchcodec/_core/custom_ops.cpp

src/torchcodec/_core/DeviceInterface.h

src/torchcodec/_core/SingleStreamDecoder.cpp

src/torchcodec/_core/BetaCudaDeviceInterface.h

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

src/torchcodec/_core/NVDECCache.h

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

scotts · 2025-09-29T18:34:57Z

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

+  unsigned int frameRateNum = videoFormat_.frame_rate.numerator;
+  unsigned int frameRateDen = videoFormat_.frame_rate.denominator;
+  int64_t duration = static_cast<int64_t>((frameRateDen * timeBase_.den)) /
+      (frameRateNum * timeBase_.num);


We should probably pull the time utilities out of SingleStreamDecoder and put them in FFMPEGCommon. It's reasonable to do since they're really time utilities for FFmpeg.

I don't think we have a similar logic within SingleStreamDecoder where we compute a duration from the frame rate and the timebase. Maybe I misunderstand?

We don't do this same calculation in SingleStreamDecoder? My thinking was that we should try to keep all calculations with AVRationals in convenience functions to make sure we're doing the right thing.

Hm, the only uses of AVRational I can find in SingleStreamDecoder is

double ptsToSeconds(int64_t pts, const AVRational& timeBase) { // To perform the multiplication before the division, av_q2d is not used return static_cast<double>(pts) * timeBase.num / timeBase.den; } int64_t secondsToClosestPts(double seconds, const AVRational& timeBase) { return static_cast<int64_t>( std::round(seconds * timeBase.den / timeBase.num)); }

but that's a converting ints to and from floats. Here, we're computing a duration while keeping all pts as int.

BTW the code above will fail with a zero division error if videoFormat_.frame_rate is not set, which happens in our AV1 video. I'll fix that ASAP when adding AV1 support

That the only use of AVRational is in one place means we successfully localized it all to those functions. :) At a higher level, what I'm advocating for is for us to not have to do dimensional analysis when reading the decoder code. I can, of course, but I always have to stop and think carefully if we're multiplying and dividing correctly to get the right dimensions after the computation. Abstracting that into a function, I think, makes it more likely we'll not make a mistake and it's easier to read.

Makes sense - I will add a P0 TODO to move all these into FFmpegCommon. I just want to slightly delay that if that's OK, because that part will slightly change soon when I address the potential zero division issue.

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

src/torchcodec/_core/NVDECCache.h

…ptional

NicolasHug · 2025-09-30T11:48:20Z

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

+  // Free the original packet's data which isn't needed anymore, and move the
+  // fields of the filtered packet into the original packet. The filtered packet
+  // fields are re-set by av_packet_move_ref, so when it goes out of scope and
+  // gets destructed, it's not going to affect the original packet.
+  av_packet_unref(packet.get());
+  av_packet_move_ref(packet.get(), filteredPacket.get());


I ... I hate this, I think?

We're modifying the input ReferenceAVPacket& packet internal buffers inplace. It means that when you call interface_->sendPacket(avPacket) (and thus applyBSF(avPacket)) , the avPacket is modified after the call.

I think the alternative is for applyBSF() to return the new filtered packet. But that's not trivial either: this means we need to declare the new AutoAVPacket in the right place, to prevent it from being freed before it's used. I think we'd need to do that either:

In sendPacket(), and then pass it down to applyBSF() - meh?

or at construction of the interface, by storing the AutoAVPacket as a field - meh ?

CC @scotts

The tension, as I understand it is:

The member function applyBSF(ReferenceAVPacket& ) takes a reference to a packet, and the point of that is for the reference to be immutable.

The FFmpeg function av_bsf_receive_packet() needs a "clean" packet that is freshly allocated. That means we can't give it the reference we have, and we need to somehow override the reference we have with the result.

I think this breaks the assumptions we have with AutoAVPacket and ReferenceAVPacket. Roughly, we assumed:

AutoAVPacket would be defined outside of the decoding loop. It managed allocation (av_packet_alloc()) and deallocation (av_packet_free()), but had no ability to access the underlying packet.

ReferenceAVPacket would be defined inside of the decoding loop. It did not manage allocations, but was the only way to access the underlying packet. It does nothing on creation, but on destruction it calls av_packet_unref().

The combination of the two made sure we only allocated one packet for all iterations of the loop, memory was always freed and we always had a valid reference to a packet. But applyBSF(), due to av_bsf_receive_packet() is the first time we need to actually replace a packet. Replacing a packet breaks the model we set up above. We can break the model because we're using C++ and you can always smash the guardrails you established for yourself. :)

I had briefly (well, not so briefly; I just deleted a decent sized paragraph) considered just replacing all of that with a std::unique_ptr. But that's too simple; we need to manage allocations and doing the unref thing. I think the best solution here is to officially bless your hack as ReferenceAVPacket::reset(ReferenceAVPacket& other), inspired by std::unique_ptr::reset(). Then in this function, we'll just do:

packet.reset(filteredPacket);

Thank you for the feedback. Done - and added a P0 TODO to re-think that as well.

scotts · 2025-09-30T13:36:40Z

src/torchcodec/_core/NVDECCache.h

+    unsigned height;
+    cudaVideoChromaFormat chromaFormat;
+    unsigned int bitDepthLumaMinus8;
+    unsigned char numDecodeSurfaces;


Let's be explicit about integer size. I know we're getting these values from the CUDA code, but we can be explicit about what we're storing them in. For example, I'm not actually sure what integer size width will be here - I think that will become a uint32_t, but I'm not actually sure what the C++ rules are.

That's an oversight on my part!
I defined them as unsigned int now - is that what you wanted, or do you prefer types like uint32_t?

I prefer types like uint32_t. :)

scotts · 2025-09-30T13:53:45Z

@NicolasHug, forgot to say yesterday: amazing work! I think we're on the right path, this is very promising.

scotts · 2025-10-02T13:24:57Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+          AutoAVPacket eofAutoPacket;
+          ReferenceAVPacket eofPacket(eofAutoPacket);
+          eofPacket->data = nullptr;
+          eofPacket->size = 0;


We should be able to define a global, constant EOF packet that is local to this cpp file. Or if we can't do that (because maybe we need to imperatively set the fields data and size and we can't do it on object construction?) we can define a getEoFPacket() function which does this.

I'll add a P0 TODO to address this. Defining a global isn't possible as you mentioned, and having getEoFPacket() isn't easy either because of the relationship between AutoAVPacket and ReferenceAVPacket. I think we'll be able to clean that all up by adding a new sendEOFPacket() to the interface, eventually.

scotts

🚀 Let's goooooooo!

Let's just commit 3k loc in a single commit

78ab058

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 25, 2025

NicolasHug changed the title ~~BETA CUDA interface: NVCUVID decoder implementation~~ BETA CUDA interface: NVCUVID decoder implementation 1/N Sep 25, 2025

Fixes

b45decc

NicolasHug commented Sep 26, 2025

View reviewed changes

src/torchcodec/_core/StreamOptions.h Show resolved Hide resolved