-
Notifications
You must be signed in to change notification settings - Fork 62
BETA CUDA interface: NVCUVID decoder implementation 1/N #910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@NicolasHug has imported this pull request. If you are a Meta employee, you can view this in D83333067. |
unsigned int frameRateNum = videoFormat_.frame_rate.numerator; | ||
unsigned int frameRateDen = videoFormat_.frame_rate.denominator; | ||
int64_t duration = static_cast<int64_t>((frameRateDen * timeBase_.den)) / | ||
(frameRateNum * timeBase_.num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably pull the time utilities out of SingleStreamDecoder
and put them in FFMPEGCommon
. It's reasonable to do since they're really time utilities for FFmpeg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a similar logic within SingleStreamDecoder
where we compute a duration from the frame rate and the timebase. Maybe I misunderstand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do this same calculation in SingleStreamDecoder
? My thinking was that we should try to keep all calculations with AVRational
s in convenience functions to make sure we're doing the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the only uses of AVRational
I can find in SingleStreamDecoder
is
double ptsToSeconds(int64_t pts, const AVRational& timeBase) {
// To perform the multiplication before the division, av_q2d is not used
return static_cast<double>(pts) * timeBase.num / timeBase.den;
}
int64_t secondsToClosestPts(double seconds, const AVRational& timeBase) {
return static_cast<int64_t>(
std::round(seconds * timeBase.den / timeBase.num));
}
but that's a converting ints to and from floats. Here, we're computing a duration while keeping all pts as int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the code above will fail with a zero division error if videoFormat_.frame_rate
is not set, which happens in our AV1 video. I'll fix that ASAP when adding AV1 support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the only use of AVRational
is in one place means we successfully localized it all to those functions. :) At a higher level, what I'm advocating for is for us to not have to do dimensional analysis when reading the decoder code. I can, of course, but I always have to stop and think carefully if we're multiplying and dividing correctly to get the right dimensions after the computation. Abstracting that into a function, I think, makes it more likely we'll not make a mistake and it's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - I will add a P0 TODO to move all these into FFmpegCommon. I just want to slightly delay that if that's OK, because that part will slightly change soon when I address the potential zero division issue.
// Free the original packet's data which isn't needed anymore, and move the | ||
// fields of the filtered packet into the original packet. The filtered packet | ||
// fields are re-set by av_packet_move_ref, so when it goes out of scope and | ||
// gets destructed, it's not going to affect the original packet. | ||
av_packet_unref(packet.get()); | ||
av_packet_move_ref(packet.get(), filteredPacket.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ... I hate this, I think?
We're modifying the input ReferenceAVPacket& packet
internal buffers inplace. It means that when you call interface_->sendPacket(avPacket)
(and thus applyBSF(avPacket)
) , the avPacket
is modified after the call.
I think the alternative is for applyBSF()
to return the new filtered packet. But that's not trivial either: this means we need to declare the new AutoAVPacket
in the right place, to prevent it from being freed before it's used. I think we'd need to do that either:
- In
sendPacket()
, and then pass it down to applyBSF() - meh? - or at construction of the interface, by storing the
AutoAVPacket
as a field - meh ?
CC @scotts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tension, as I understand it is:
- The member function
applyBSF(ReferenceAVPacket& )
takes a reference to a packet, and the point of that is for the reference to be immutable. - The FFmpeg function
av_bsf_receive_packet()
needs a "clean" packet that is freshly allocated. That means we can't give it the reference we have, and we need to somehow override the reference we have with the result.
I think this breaks the assumptions we have with AutoAVPacket
and ReferenceAVPacket
. Roughly, we assumed:
AutoAVPacket
would be defined outside of the decoding loop. It managed allocation (av_packet_alloc()
) and deallocation (av_packet_free()
), but had no ability to access the underlying packet.ReferenceAVPacket
would be defined inside of the decoding loop. It did not manage allocations, but was the only way to access the underlying packet. It does nothing on creation, but on destruction it callsav_packet_unref()
.
The combination of the two made sure we only allocated one packet for all iterations of the loop, memory was always freed and we always had a valid reference to a packet. But applyBSF()
, due to av_bsf_receive_packet()
is the first time we need to actually replace a packet. Replacing a packet breaks the model we set up above. We can break the model because we're using C++ and you can always smash the guardrails you established for yourself. :)
I had briefly (well, not so briefly; I just deleted a decent sized paragraph) considered just replacing all of that with a std::unique_ptr
. But that's too simple; we need to manage allocations and doing the unref thing. I think the best solution here is to officially bless your hack as ReferenceAVPacket::reset(ReferenceAVPacket& other)
, inspired by std::unique_ptr::reset()
. Then in this function, we'll just do:
packet.reset(filteredPacket);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. Done - and added a P0 TODO to re-think that as well.
src/torchcodec/_core/NVDECCache.h
Outdated
unsigned height; | ||
cudaVideoChromaFormat chromaFormat; | ||
unsigned int bitDepthLumaMinus8; | ||
unsigned char numDecodeSurfaces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be explicit about integer size. I know we're getting these values from the CUDA code, but we can be explicit about what we're storing them in. For example, I'm not actually sure what integer size width
will be here - I think that will become a uint32_t
, but I'm not actually sure what the C++ rules are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an oversight on my part!
I defined them as unsigned int
now - is that what you wanted, or do you prefer types like uint32_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer types like uint32_t
. :)
@NicolasHug, forgot to say yesterday: amazing work! I think we're on the right path, this is very promising. |
AutoAVPacket eofAutoPacket; | ||
ReferenceAVPacket eofPacket(eofAutoPacket); | ||
eofPacket->data = nullptr; | ||
eofPacket->size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to define a global, constant EOF packet that is local to this cpp file. Or if we can't do that (because maybe we need to imperatively set the fields data
and size
and we can't do it on object construction?) we can define a getEoFPacket()
function which does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a P0 TODO to address this. Defining a global isn't possible as you mentioned, and having getEoFPacket()
isn't easy either because of the relationship between AutoAVPacket and ReferenceAVPacket. I think we'll be able to clean that all up by adding a new sendEOFPacket()
to the interface, eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Let's goooooooo!
This is the first of a long series of PR that implements our own NVCUVID decoder backend via a new BETA CUDA interface.
High-level context
Currently in
main
, we support CUDA decoding by going through the NVCUVID implementation of FFmpeg. This makes it very easy for us, but we have limited control of the underlying resources. Specifically, caching the underlyingCUDecoder
can lead to massive performance gains, but relying on FFmpeg's NVCUVID implementation doesn't allow us to control that.So this PR implements a separate, independent CUDA decoding path (the BETA CUDA interface), where we directly rely on NVCUVID. A lot of our implementation was inspired by DALI's.
Design principles
As much as possible, I tried for this new decoder to require minimal changes from our existing code-base, and mostly just treat it purely as a new extension by creating a new
DeviceInterface
.Crucially, we're still demuxing through FFmpeg. That is, we still rely on FFmpeg to generate the AVPacket from a stream. This is key: it means we can leave the
SingleStreamDecoder
largely untouched. All of the new decoding logic is encapsulated in a newBetaCudaDeviceInterface
which exposes new methods, like:sendPacket(AVPacket)
which is the moral equivalent ofavcodec_send_packet(AVPacket)
receiveFrame(AVFrame)
, which is the moral equivalent ofavcodec_receive_frame(AVFrame)
.This wasn't trivial, because NVCUVID isn't designed to work well with the non-blocking send/receive API of FFmpeg (see design alternatives section below).
What is and isn't supported right now
Support is very limited ATM, and definitely buggy (don't worry, everything is private).
seek_mode="exact"
onlyIn terms of features, adding more codec support as well as approximate seek_mode will be top priority. Supporting time-based and backwards seeks is lower pri, and may never be supported. It will depend on how hard that is.
There are plenty other things that aren't supported, and a million TODOs. Also, most guards preventing you from doing bad things aren't yet in place: if you look at the new decoder the wrong way it will be angry at you and deadlock (at best). I will be working through all the features, bugs, and guards, in follow-ups.
How to review this
nvcuvid_include
. They're the headers from NVCUVID. We have to vendor them because they're not part of the normal CUDA toolkit.device="cuda:0:beta"
. That's not a legible pytorch device string, so we can consider this API to be private. It is subject to change anyway. For now we just need a convenient way to expose this in Python, for testing._video_decoder.py
to further explore how the Python API is exposed.DeviceInterface
registration inDeviceInterface.[h, cpp]
: previously, we could only register one interface per device type (one interface for CPU, one for CUDA, etc.). Now, we can register multiple interfaces per device type with the "device_variant" key. This is how we can enable bothdevice="cuda"
and device="cuda:0:beta"` and have both the default and the beta CUDA interfaces.SingleStreamDecoder
and pay attention to the new extension points added to theDeviceInterface
. You'll seesendPacket
,receiveFrame
, and a few other things. They mirror the existing FFmpeg APIs. Don't get too scared by the new bit-stream filtering (BSF) logic: it's just something that converts an AVPacket into an other AVPacket, with a different binary format. It's needed for some codecs. Eventually, we'll move that away from theSingleStreamDecoder
, and encapsulate it within the interface (this is a TODO).BetaCudaDeviceInterface
. You're now in a whole new rabbit hole with a lot of new concepts. We're literally writing our own decoder here, and that's not something we've done yet in TorchCodec (we were always relying on FFmpeg so far). There's too much to write to describe how it works, and anything I write now will likely be obsolete within a few days, so let's go over it in our sync.Previous design considerations (rejected)
I discussed a lot of that during meetings already, but just for ref:
I thought about registering ourselves as a HWAccel object. The HWAccel registration is what FFmpeg does in its own NVCUVID backend (which is what we use as our current CUDA interface). There would be a major upside of registering ourselves as a HWAcccel : we would be able to rely on FFmpeg’s native parser and FFmpeg’s frame-reordering logic, instead of relying on NVCUVID parser and having to implement the frame recording logic ourselves. See this? This is converting FFmpeg’s parser info (right side) to NVCUVID’s built-in frame info (left side). Among other things, this frame meta-data is needed for the NVDEC hardware decoder to know which frame it should decode first when there are frame dependencies (like B-frames). If we were to implement our own HWAcccel we’d have to implement this metadata mapping ourselves, for all codecs. That’s obviously far from trivial in terms of required knowledge, but more importantly, it’s technically impossible: all of the FFmpeg parser info is private. It’s not ABI stable and it relies on private headers anyway. So NVCUVID’s claim that users can rely on a third-party parser instead of their own isn't really true for us. No one, except for FFmpeg themselves, can use the FFmpeg parser. We have to build our own parser (not happening), or rely on NVCUVID’s parser.
EDIT: I might be wrong about this. We may be able to rely on theNVCUVID
is designed around callabcks which are triggered when a packet is being parsed. Crucially there is thepfnDisplayPicture
callback which is triggered when a frame is fully decoded and ready to be displayed, in display order (which is great). Unfortunately, we cannot rely on this callback: it is triggered by the parser within a call tocuvidParseVideoData(packet)
, which means that the only way to know that a frame is ready is to send a packet. I guess that probably makes sense in streaming applications? But that's incompatible with FFmpeg send/receive non-blocking APIs, where we should be able to query whether a frame is ready without having to send a packet. So if we were to rely on this callback, we'd likely need massive changes to ourSingleStreamDecoder
architecture - which is something we don't want to do. So, we can't rely on this callback, and we have to figure out the frame display order ourselves.pfnDisplayPicture
callback after all.