-
Notifications
You must be signed in to change notification settings - Fork 953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance of video inference and increase GPU utilization #178
base: master
Are you sure you want to change the base?
Conversation
…and threading CPU video encoding Two improvements are made in this fork: 1. Removed repeated copying of background to GPU memory 2. Minimized idle GPU time by passing video encoding work to children threads as soon as it is copied to CPU memory, allowing for higher GPU utilization.
Breaks the loop instead of exiting directly
Fixed Replaced type(int) with (int)
Fixed imports
I found out that the time CPU spent with DataLoader is another 30-40% of the execution time. I added a thread for loading data and reserved the main thread for controlling the GPU.
Added if __name__ == '__main__' so that windows recognizes Process
Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders. |
I have made a version of it to work with Nvidia's vpf library that takes advantage of nvenc and nvdec hardware accelerators for video, and directly creates GPU tensors without involving the CPU. It works inside a docker container under WSL. However, I don't plan to publish the code since I don't think I can redistribute or publish nvenc/nvdec/x264 binaries and my glue code only works with a self compiled version when I wrote the code. One thing I can verify is that the claimed inference speed is achievable on consumer grade GPUs, and GeForce RTX series GPU can be faster than Quadro RTX simply because of the nvenc/nvdec performance |
Two improvements are made in this contribution:
This modification allowed for about three times the performance on my system with R7 5800H and RTX3060 mobile. Using the same 4k video on both resnet50 and resnet101 models, the original version ran at 2.20it/s whereas this runs at an average of 7.5it/s.