failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

danielmimimi · 2025-02-10T12:18:30Z

Description

I have tried to convert Unimatch FP32 gmflow-scale1 Model to a Float16 engine model. However the Float 16 Model is not really usable, this concolusion comes from visually inspecting the result as such as the polygraphy tool. For converting the model I have used the following commands :

trtexec --onnx=exportedOnnxyModel --saveEngine=trexec_fp16_model.engine --fp16
polygraphy runexportedOnnxyModel --fp16 --trt --save-engine polygraphy_fp16_model.engine

Both share the same warning output :

[02/10/2025-11:52:28] [W] [TRT] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help preserving accuracy.

However I have been comparing the fp32 and fp16 model with polygraphy :

polygraphy run --trt trexec_fp32_model.engine \
   --save-inputs inputs.json --save-outputs outputs_fp32.json

polygraphy run --trt trexec_fp16_model.engine \
   --load-inputs inputs.json --load-outputs outputs_fp32.json \
   --atol 0.001 --rtol 0

Applying the last command reveals the following :

[I] Accuracy Comparison | trt-runner-N0-02/10/25-11:56:38 vs. trt-runner-N0-02/10/25-11:46:36
[I]     Comparing Output: '5242' (dtype=float32, shape=(2, 512, 512)) with '5242' (dtype=float32, shape=(2, 512, 512))
[I]         Tolerance: [abs=0.001, rel=0] | Checking elemwise error
[I]         trt-runner-N0-02/10/25-11:56:38: 5242 | Stats: mean=-6.6426, std-dev=3.1498, var=9.9214, median=-6.0293, min=-11.602 at (0, 198, 279), max=-1.5293 at (1, 511, 7), avg-magnitude=6.6426, p90=-2.8984, p95=-2.6699, p99=-2.207
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-11.6, -10.6) |      42189 | ###########
                (-10.6, -9.51) |     113680 | ##############################
                (-9.51, -8.46) |      94404 | #########################
                (-8.46, -7.41) |      11855 | ###
                (-7.41, -6.37) |         16 | 
                (-6.37, -5.32) |          0 | 
                (-5.32, -4.27) |      31276 | ########
                (-4.27, -3.22) |     149044 | ########################################
                (-3.22, -2.18) |      77238 | ####################
                (-2.18, -1.13) |       4586 | #
[I]         trt-runner-N0-02/10/25-11:46:36: 5242 | Stats: mean=-4.1858, std-dev=1.7002, var=2.8905, median=-3.7579, min=-6.9407 at (1, 16, 55), max=-1.1289 at (0, 7, 151), avg-magnitude=4.1858, p90=-2.0205, p95=-1.7581, p99=-1.4932
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-11.6, -10.6) |          0 | 
                (-10.6, -9.51) |          0 | 
                (-9.51, -8.46) |          0 | 
                (-8.46, -7.41) |          0 | 
                (-7.41, -6.37) |      42160 | #########
                (-6.37, -5.32) |     152490 | ###################################
                (-5.32, -4.27) |      65357 | ###############
                (-4.27, -3.22) |      26894 | ######
                (-3.22, -2.18) |     171081 | ########################################
                (-2.18, -1.13) |      66306 | ###############
[I]         Error Metrics: 5242
[I]             Minimum Required Tolerance: elemwise error | [abs=9.6093] OR [rel=7.0797] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=4.6614, std-dev=2.5247, var=6.374, median=3.6656, min=1.7329 at (1, 407, 496), max=9.6093 at (0, 1, 280), avg-magnitude=4.6614, p90=7.722, p95=8.1335, p99=8.8531
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (1.73, 2.52) |     249836 | ########################################
                    (2.52, 3.31) |      12308 | #
                    (3.31, 4.1 ) |          0 | 
                    (4.1 , 4.88) |        148 | 
                    (4.88, 5.67) |       8790 | #
                    (5.67, 6.46) |      46781 | #######
                    (6.46, 7.25) |      89001 | ##############
                    (7.25, 8.03) |      86080 | #############
                    (8.03, 8.82) |      25450 | ####
                    (8.82, 9.61) |       5894 | 
[I]             Relative Difference | Stats: mean=1.6571, std-dev=1.4322, var=2.0513, median=1.0151, min=0.26921 at (1, 7, 272), max=7.0797 at (0, 15, 39), avg-magnitude=1.6571, p90=3.7972, p95=4.2712, p99=5.3135
[I]                 ---- Histogram ----
                    Bin Range     |  Num Elems | Visualization
                    (0.269, 0.95) |     262144 | ########################################
                    (0.95 , 1.63) |        288 | 
                    (1.63 , 2.31) |      81557 | ############
                    (2.31 , 2.99) |      84821 | ############
                    (2.99 , 3.67) |      36171 | #####
                    (3.67 , 4.36) |      36182 | #####
                    (4.36 , 5.04) |      15488 | ##
                    (5.04 , 5.72) |       4978 | 
                    (5.72 , 6.4 ) |       2027 | 
                    (6.4  , 7.08) |        632 | 
[E]         FAILED | Output: '5242' | Difference exceeds tolerance (rel=0, abs=0.001)
[E]     FAILED | Mismatched outputs: ['5242']

That the output is completely different. According to the error message concerning the layernorm I have tried to actively not convert it with the tensorrt python api. The goal was to let the entire attention run on fp32. However it yielded the same result.

    if layer.type in [trt.LayerType.NORMALIZATION , trt.LayerType.REDUCE,trt.LayerType.MATRIX_MULTIPLY,trt.LayerType.SOFTMAX,trt.LayerType.ACTIVATION ]:
        layer.precision = trt.float32
        for output_idx in range(layer.num_outputs):
            layer.set_output_type(output_idx, trt.float32)

I have received the same behaviour converting this network also on the jetson orin 8GB (TensorRt 8.6 and 10.0). For extended tests I switched to another device.

Environment

I am using the docker nvcr.io/nvidia/tensorrt:25.01-py3 environment.

TensorRT Version:
10.8.0.43-1
NVIDIA GPU:
NVIDIA RTX 4500
NVIDIA Driver Version:
535.183.01
CUDA Version:
cuda12.8
CUDNN Version:

Operating System:
Ubuntu 24.04.1 LTS
Python Version (if applicable):
3.12.3
Tensorflow Version (if applicable):
No
PyTorch Version (if applicable):
No
Baremetal or Container (if so, version):
nvcr.io/nvidia/tensorrt:25.01-py3

Relevant Files

Model link:
Unimatch gmflow-scale1

Steps To Reproduce

trtexec --onnx=gmflow-scale1_simplified.onnx --saveEngine=trexec_fp32_model.engine
trtexec --onnx=gmflow-scale1_simplified.onnx --saveEngine=trexec_fp16_model.engine --fp16
polygraphy run --trt trexec_fp32_model.engine
--save-inputs inputs.json --save-outputs outputs_fp32.json
*polygraphy run --trt trexec_fp16_model.engine
--load-inputs inputs.json --load-outputs outputs_fp32.json
--atol 0.001 --rtol 0

Commands or scripts:

Have you tried the latest release?:
Yes, the docker is quite new.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
Yes, but not on FP16 I suppose.

The text was updated successfully, but these errors were encountered:

galagam · 2025-02-16T07:00:29Z

As a work around, please consider using strongly typed mode - https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#strongly-typed-networks .

galagam · 2025-02-18T21:32:12Z

@danielmimimi Can you provide sample inputs for this model? Using random inputs is not always a good measurement of the network's accuracy.

danielmimimi changed the title ~~XXX failure of TensorRT X.Y when running XXX on GPU XXX~~ failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 Feb 10, 2025

LeoZDong self-assigned this Feb 10, 2025

LeoZDong added triaged Issue has been triaged by maintainers Module:Polygraphy Issues with Polygraphy Investigating Issue needs further investigation labels Feb 10, 2025

brnguyen2 added Module:Accuracy Output mismatch between TensorRT and other frameworks and removed Module:Polygraphy Issues with Polygraphy labels Feb 12, 2025

LeoZDong added the internal-bug-tracked Tracked internally, will be fixed in a future release. label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

danielmimimi commented Feb 10, 2025

galagam commented Feb 16, 2025

galagam commented Feb 18, 2025

failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

failure of TensorRT 10.8.0.43 when running Unimatch Fp32 to Fp16 conversion on GPU Jetson Orin 8GB and NVIDIA RTX 4500 #4355

Comments

danielmimimi commented Feb 10, 2025

Description

Environment

Relevant Files

Steps To Reproduce

galagam commented Feb 16, 2025

galagam commented Feb 18, 2025