-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to use pytorch for GPU-based image preprocessing #941
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 task
`gpu_speedups` code review
PawelPeczek-Roboflow
previously approved these changes
Feb 17, 2025
grzegorz-roboflow
approved these changes
Feb 17, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR includes three meaningful changes. Apologies that they are all bundled into one.
The first is that I modify the preprocessing pipeline to enable running operations such as resize and pad on GPU instead of CPU by using pytorch instead of opencv to power them. This happens by dynamically routing images to opencv or pytorch analogues depending on the type of the image object. Then, choosing GPU or CPU processing can be reduced to placing the image on GPU via torch or CPU via numpy at the start of the preprocessing block. This also makes it easy to run the pipeline without a torch dependency; if torch doesn't exist, all the routing stays within numpy via opencv.
The second is to modify how we execute our ONNX graph. Doing the preprocessing for an image on GPU and then moving it to CPU to pipe it into an ONNX graph that lives on GPU adds an extra round trip from VRAM to RAM back to VRAM. We can avoid this by directly pointing ONNX to the memory location of the tensor, regardless of whether or not the tensor lives on CPU or GPU, via io binding. So for all ONNX-based models, we replace the session.run approach with an oi binding based approach.
The third is to modify our integration and unit testing so that these changes can be robustly tested. The existing implementation for unit testing an instance segmentation model assumed that the polygon representations of masks produced by slightly different implementations of a model would be similar. However, small changes to a mask can lead to drastically different polygon representations. I found a case where the existing unit testing failed for masks that had .9998 IOU, which I would consider a pass. I therefore replaced our segmentation-based testing with testing based on the IOU of the predicted mask vs a reference mask. There were some tests that did not have a reference mask and instead only counted the number of points in the polygon as the test. For those, I generated new reference predictions via the existing numpy pipeline running entirely on CPU (including CPU-based ONNX). I used this as an excuse to refactor many of the unit tests to use a common testing function, significantly decreasing redundant code.
I also found that testing of the CUDAExecutionProvider for ONNX graphs was disabled via an environmental variable in the highest level conftest. When adding back in the CUDAExecutionProvider, unit tests for models failed on the main branch of inference. After making the above change to mask testing and loosening some error tolerances in other tests, unit tests again began passing when executed via ONNX CUDA runtime.
These changes were motivated by a Roboflow client running small models on very large images (2048x2048). In this case we would expect the runtime to be dominated by the CPU-based image resize operation, as opposed to the GPU-based ONNX graph. I tested latency using the server-benchmark repo. By setting USE_PYTORCH_FOR_PREPROCESSING to True, latency dropped from 130-140ms down to 50-60ms.
This change should in general provide decreases in model latency. However, for models running on small to medium images, the decrease likely will be marginal. Additionally, there is a potential for latency increase when the image is very large and the target size is very small. Specifically, the overhead of first transferring the large image to GPU VRAM as opposed to transferring the smaller image later may overshadow the expected gains in image resize speed. In a well-optimized pipeline that parallelizes CPU-based preprocessing of one batch of images and GPU-based inference with another, this change may also lead to decreases in throughput. Finally, existing users are not required to install pytorch when they install inference via CLI, so not all users will necessarily have pytorch available. Pytorch can be a heavy dependency so it may not make sense to force everyone to install it moving forward. Therefore, I've opted to introduce an environmental variable to enable or disable this behavior. The code is intended to raise a warning in the case GPU-based preprocessing is requested but torch is not found.
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
I tested
tests/inference/integration_tests/regression_test.py
on my development GPU machine which has an L4 GPU, with and without the environmental variable set on docker run. I also testedtests/inference/model_predictions_tests
with and without the environmental variable, and enabling and disabling pytorch GPU access viaCUDA_VISIBLE_DEVICES
variable. I tested decrease in latency via the server-benchmark repository.I called @grzegorz-roboflow to discuss CI for these kinds of changes that are feature flags that require GPU for testing. I showed him some proposed changes to our github actions workflows that I have committed here. However, it is currently unclear whether our GPU-based CI is running; the Jetson-based workflows do not appear to have been run in months. Before merging this PR, I would like to have some way to confirm that these tests pass on the target hardware. That means that I do not yet feel comfortable merging this change.
Any specific deployment considerations
We can add to the docs that there is now an environmental variable that can be set that should result in decreased latency for people running models at high resolutions.
Docs