Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance of video inference and increase GPU utilization #178

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

h9419
Copy link

@h9419 h9419 commented Apr 6, 2022

Two improvements are made in this contribution:

  1. Removed repeated copying of background image to GPU memory, minimizing the effect of a memory bandwidth bottleneck
  2. Increased GPU utilization by offloading the CPU video encoding to children threads as soon as it is copied to CPU memory, freeing up the parent process to begin processing the next frame
  3. Further increased GPU utilization by offloading the CPU video decoding to another thread so that the main thread can focus on feeding the GPU

This modification allowed for about three times the performance on my system with R7 5800H and RTX3060 mobile. Using the same 4k video on both resnet50 and resnet101 models, the original version ran at 2.20it/s whereas this runs at an average of 7.5it/s.

h9419 and others added 8 commits April 6, 2022 21:47
…and threading CPU video encoding

Two improvements are made in this fork:

1. Removed repeated copying of background to GPU memory
2. Minimized idle GPU time by passing video encoding work to children threads as soon as it is copied to CPU memory, allowing for higher GPU utilization.
Breaks the loop instead of exiting directly
Fixed Replaced type(int) with (int)
I found out that the time CPU spent with DataLoader is another 30-40% of the execution time. I added a thread for loading data and reserved the main thread for controlling the GPU.
Added if __name__ == '__main__' so that windows recognizes Process
@h9419
Copy link
Author

h9419 commented Apr 17, 2022

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

@h9419
Copy link
Author

h9419 commented Jan 12, 2023

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

I have made a version of it to work with Nvidia's vpf library that takes advantage of nvenc and nvdec hardware accelerators for video, and directly creates GPU tensors without involving the CPU. It works inside a docker container under WSL.

However, I don't plan to publish the code since I don't think I can redistribute or publish nvenc/nvdec/x264 binaries and my glue code only works with a self compiled version when I wrote the code.

One thing I can verify is that the claimed inference speed is achievable on consumer grade GPUs, and GeForce RTX series GPU can be faster than Quadro RTX simply because of the nvenc/nvdec performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant