Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

Open
danielmimimi opened this issue Feb 10, 2025 · 2 comments
Assignees
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. Investigating Issue needs further investigation Module:Accuracy Output mismatch between TensorRT and other frameworks triaged Issue has been triaged by maintainers

Comments

@danielmimimi
Copy link

Description

I have tried to convert Unimatch FP32 gmflow-scale1 Model to a Float16 engine model. However the Float 16 Model is not really usable, this concolusion comes from visually inspecting the result as such as the polygraphy tool. For converting the model I have used the following commands :

  • trtexec --onnx=exportedOnnxyModel --saveEngine=trexec_fp16_model.engine --fp16
  • polygraphy runexportedOnnxyModel --fp16 --trt --save-engine polygraphy_fp16_model.engine

Both share the same warning output :

[02/10/2025-11:52:28] [W] [TRT] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help preserving accuracy.

However I have been comparing the fp32 and fp16 model with polygraphy :

polygraphy run --trt trexec_fp32_model.engine \
   --save-inputs inputs.json --save-outputs outputs_fp32.json

polygraphy run --trt trexec_fp16_model.engine \
   --load-inputs inputs.json --load-outputs outputs_fp32.json \
   --atol 0.001 --rtol 0

Applying the last command reveals the following :

[I] Accuracy Comparison | trt-runner-N0-02/10/25-11:56:38 vs. trt-runner-N0-02/10/25-11:46:36
[I]     Comparing Output: '5242' (dtype=float32, shape=(2, 512, 512)) with '5242' (dtype=float32, shape=(2, 512, 512))
[I]         Tolerance: [abs=0.001, rel=0] | Checking elemwise error
[I]         trt-runner-N0-02/10/25-11:56:38: 5242 | Stats: mean=-6.6426, std-dev=3.1498, var=9.9214, median=-6.0293, min=-11.602 at (0, 198, 279), max=-1.5293 at (1, 511, 7), avg-magnitude=6.6426, p90=-2.8984, p95=-2.6699, p99=-2.207
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-11.6, -10.6) |      42189 | ###########
                (-10.6, -9.51) |     113680 | ##############################
                (-9.51, -8.46) |      94404 | #########################
                (-8.46, -7.41) |      11855 | ###
                (-7.41, -6.37) |         16 | 
                (-6.37, -5.32) |          0 | 
                (-5.32, -4.27) |      31276 | ########
                (-4.27, -3.22) |     149044 | ########################################
                (-3.22, -2.18) |      77238 | ####################
                (-2.18, -1.13) |       4586 | #
[I]         trt-runner-N0-02/10/25-11:46:36: 5242 | Stats: mean=-4.1858, std-dev=1.7002, var=2.8905, median=-3.7579, min=-6.9407 at (1, 16, 55), max=-1.1289 at (0, 7, 151), avg-magnitude=4.1858, p90=-2.0205, p95=-1.7581, p99=-1.4932
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-11.6, -10.6) |          0 | 
                (-10.6, -9.51) |          0 | 
                (-9.51, -8.46) |          0 | 
                (-8.46, -7.41) |          0 | 
                (-7.41, -6.37) |      42160 | #########
                (-6.37, -5.32) |     152490 | ###################################
                (-5.32, -4.27) |      65357 | ###############
                (-4.27, -3.22) |      26894 | ######
                (-3.22, -2.18) |     171081 | ########################################
                (-2.18, -1.13) |      66306 | ###############
[I]         Error Metrics: 5242
[I]             Minimum Required Tolerance: elemwise error | [abs=9.6093] OR [rel=7.0797] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=4.6614, std-dev=2.5247, var=6.374, median=3.6656, min=1.7329 at (1, 407, 496), max=9.6093 at (0, 1, 280), avg-magnitude=4.6614, p90=7.722, p95=8.1335, p99=8.8531
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (1.73, 2.52) |     249836 | ########################################
                    (2.52, 3.31) |      12308 | #
                    (3.31, 4.1 ) |          0 | 
                    (4.1 , 4.88) |        148 | 
                    (4.88, 5.67) |       8790 | #
                    (5.67, 6.46) |      46781 | #######
                    (6.46, 7.25) |      89001 | ##############
                    (7.25, 8.03) |      86080 | #############
                    (8.03, 8.82) |      25450 | ####
                    (8.82, 9.61) |       5894 | 
[I]             Relative Difference | Stats: mean=1.6571, std-dev=1.4322, var=2.0513, median=1.0151, min=0.26921 at (1, 7, 272), max=7.0797 at (0, 15, 39), avg-magnitude=1.6571, p90=3.7972, p95=4.2712, p99=5.3135
[I]                 ---- Histogram ----
                    Bin Range     |  Num Elems | Visualization
                    (0.269, 0.95) |     262144 | ########################################
                    (0.95 , 1.63) |        288 | 
                    (1.63 , 2.31) |      81557 | ############
                    (2.31 , 2.99) |      84821 | ############
                    (2.99 , 3.67) |      36171 | #####
                    (3.67 , 4.36) |      36182 | #####
                    (4.36 , 5.04) |      15488 | ##
                    (5.04 , 5.72) |       4978 | 
                    (5.72 , 6.4 ) |       2027 | 
                    (6.4  , 7.08) |        632 | 
[E]         FAILED | Output: '5242' | Difference exceeds tolerance (rel=0, abs=0.001)
[E]     FAILED | Mismatched outputs: ['5242']

That the output is completely different. According to the error message concerning the layernorm I have tried to actively not convert it with the tensorrt python api. The goal was to let the entire attention run on fp32. However it yielded the same result.

    if layer.type in [trt.LayerType.NORMALIZATION , trt.LayerType.REDUCE,trt.LayerType.MATRIX_MULTIPLY,trt.LayerType.SOFTMAX,trt.LayerType.ACTIVATION ]:
        layer.precision = trt.float32
        for output_idx in range(layer.num_outputs):
            layer.set_output_type(output_idx, trt.float32)

I have received the same behaviour converting this network also on the jetson orin 8GB (TensorRt 8.6 and 10.0). For extended tests I switched to another device.

Environment

I am using the docker nvcr.io/nvidia/tensorrt:25.01-py3 environment.

TensorRT Version:
10.8.0.43-1
NVIDIA GPU:
NVIDIA RTX 4500
NVIDIA Driver Version:
535.183.01
CUDA Version:
cuda12.8
CUDNN Version:

Operating System:
Ubuntu 24.04.1 LTS
Python Version (if applicable):
3.12.3
Tensorflow Version (if applicable):
No
PyTorch Version (if applicable):
No
Baremetal or Container (if so, version):
nvcr.io/nvidia/tensorrt:25.01-py3

Relevant Files

Model link:
Unimatch gmflow-scale1

Steps To Reproduce

  • trtexec --onnx=gmflow-scale1_simplified.onnx --saveEngine=trexec_fp32_model.engine
  • trtexec --onnx=gmflow-scale1_simplified.onnx --saveEngine=trexec_fp16_model.engine --fp16
  • polygraphy run --trt trexec_fp32_model.engine
    --save-inputs inputs.json --save-outputs outputs_fp32.json
    *polygraphy run --trt trexec_fp16_model.engine
    --load-inputs inputs.json --load-outputs outputs_fp32.json
    --atol 0.001 --rtol 0

Commands or scripts:

Have you tried the latest release?:
Yes, the docker is quite new.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
Yes, but not on FP16 I suppose.

@danielmimimi danielmimimi changed the title XXX failure of TensorRT X.Y when running XXX on GPU XXX failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 Feb 10, 2025
@LeoZDong LeoZDong self-assigned this Feb 10, 2025
@LeoZDong LeoZDong added triaged Issue has been triaged by maintainers Module:Polygraphy Issues with Polygraphy Investigating Issue needs further investigation labels Feb 10, 2025
@brnguyen2 brnguyen2 added Module:Accuracy Output mismatch between TensorRT and other frameworks and removed Module:Polygraphy Issues with Polygraphy labels Feb 12, 2025
@galagam
Copy link

galagam commented Feb 16, 2025

As a work around, please consider using strongly typed mode - https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#strongly-typed-networks .

@LeoZDong LeoZDong added the internal-bug-tracked Tracked internally, will be fixed in a future release. label Feb 18, 2025
@galagam
Copy link

galagam commented Feb 18, 2025

@danielmimimi Can you provide sample inputs for this model? Using random inputs is not always a good measurement of the network's accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. Investigating Issue needs further investigation Module:Accuracy Output mismatch between TensorRT and other frameworks triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants