Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to use pytorch for GPU-based image preprocessing #941

Merged
merged 35 commits into from
Feb 17, 2025

Conversation

isaacrob-roboflow
Copy link
Contributor

@isaacrob-roboflow isaacrob-roboflow commented Jan 14, 2025

Description

This PR includes three meaningful changes. Apologies that they are all bundled into one.

The first is that I modify the preprocessing pipeline to enable running operations such as resize and pad on GPU instead of CPU by using pytorch instead of opencv to power them. This happens by dynamically routing images to opencv or pytorch analogues depending on the type of the image object. Then, choosing GPU or CPU processing can be reduced to placing the image on GPU via torch or CPU via numpy at the start of the preprocessing block. This also makes it easy to run the pipeline without a torch dependency; if torch doesn't exist, all the routing stays within numpy via opencv.

The second is to modify how we execute our ONNX graph. Doing the preprocessing for an image on GPU and then moving it to CPU to pipe it into an ONNX graph that lives on GPU adds an extra round trip from VRAM to RAM back to VRAM. We can avoid this by directly pointing ONNX to the memory location of the tensor, regardless of whether or not the tensor lives on CPU or GPU, via io binding. So for all ONNX-based models, we replace the session.run approach with an oi binding based approach.

The third is to modify our integration and unit testing so that these changes can be robustly tested. The existing implementation for unit testing an instance segmentation model assumed that the polygon representations of masks produced by slightly different implementations of a model would be similar. However, small changes to a mask can lead to drastically different polygon representations. I found a case where the existing unit testing failed for masks that had .9998 IOU, which I would consider a pass. I therefore replaced our segmentation-based testing with testing based on the IOU of the predicted mask vs a reference mask. There were some tests that did not have a reference mask and instead only counted the number of points in the polygon as the test. For those, I generated new reference predictions via the existing numpy pipeline running entirely on CPU (including CPU-based ONNX). I used this as an excuse to refactor many of the unit tests to use a common testing function, significantly decreasing redundant code.

I also found that testing of the CUDAExecutionProvider for ONNX graphs was disabled via an environmental variable in the highest level conftest. When adding back in the CUDAExecutionProvider, unit tests for models failed on the main branch of inference. After making the above change to mask testing and loosening some error tolerances in other tests, unit tests again began passing when executed via ONNX CUDA runtime.

These changes were motivated by a Roboflow client running small models on very large images (2048x2048). In this case we would expect the runtime to be dominated by the CPU-based image resize operation, as opposed to the GPU-based ONNX graph. I tested latency using the server-benchmark repo. By setting USE_PYTORCH_FOR_PREPROCESSING to True, latency dropped from 130-140ms down to 50-60ms.

This change should in general provide decreases in model latency. However, for models running on small to medium images, the decrease likely will be marginal. Additionally, there is a potential for latency increase when the image is very large and the target size is very small. Specifically, the overhead of first transferring the large image to GPU VRAM as opposed to transferring the smaller image later may overshadow the expected gains in image resize speed. In a well-optimized pipeline that parallelizes CPU-based preprocessing of one batch of images and GPU-based inference with another, this change may also lead to decreases in throughput. Finally, existing users are not required to install pytorch when they install inference via CLI, so not all users will necessarily have pytorch available. Pytorch can be a heavy dependency so it may not make sense to force everyone to install it moving forward. Therefore, I've opted to introduce an environmental variable to enable or disable this behavior. The code is intended to raise a warning in the case GPU-based preprocessing is requested but torch is not found.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I tested tests/inference/integration_tests/regression_test.py on my development GPU machine which has an L4 GPU, with and without the environmental variable set on docker run. I also tested tests/inference/model_predictions_tests with and without the environmental variable, and enabling and disabling pytorch GPU access via CUDA_VISIBLE_DEVICES variable. I tested decrease in latency via the server-benchmark repository.

I called @grzegorz-roboflow to discuss CI for these kinds of changes that are feature flags that require GPU for testing. I showed him some proposed changes to our github actions workflows that I have committed here. However, it is currently unclear whether our GPU-based CI is running; the Jetson-based workflows do not appear to have been run in months. Before merging this PR, I would like to have some way to confirm that these tests pass on the target hardware. That means that I do not yet feel comfortable merging this change.

Any specific deployment considerations

We can add to the docs that there is now an environmental variable that can be set that should result in decreased latency for people running models at high resolutions.

Docs

  • Docs updated? What were the changes:

@isaacrob-roboflow isaacrob-roboflow changed the title poc 2x speedup via torch-based preprocessing Add an option to use pytorch for GPU-based image preprocessing Feb 12, 2025
@grzegorz-roboflow grzegorz-roboflow mentioned this pull request Feb 17, 2025
1 task
@grzegorz-roboflow grzegorz-roboflow merged commit b9f473a into main Feb 17, 2025
42 checks passed
@grzegorz-roboflow grzegorz-roboflow deleted the gpu_speedups branch February 17, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants