- enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
- removed
wait()
call at the end of the benchmark on Linux
|----------------.------------------------------------------------------------|
| Device ID | 9 |
| Device Name | NVIDIA GeForce RTX 2080 Ti |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 525.89.02 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) |
| Memory, Cache | 11011 MB, 2176 KB global / 48 KB local |
| Buffer Limits | 2752 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.517 TFLOPs/s (1/24) |
| FP32 compute 16.597 TFLOPs/s ( 1x ) |
-| FP16 compute not supported |
+| FP16 compute 33.054 TFLOPs/s ( 2x ) |
| INT64 compute 3.563 TIOPs/s (1/4 ) |
| INT32 compute 16.385 TIOPs/s ( 1x ) |
| INT16 compute 13.286 TIOPs/s ( 1x ) |
| INT8 compute 10.502 TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read ) 532.76 GB/s |
| Memory Bandwidth ( coalesced write) 548.88 GB/s |
| Memory Bandwidth (misaligned read ) 534.43 GB/s |
| Memory Bandwidth (misaligned write) 157.78 GB/s |
| PCIe Bandwidth (send ) 12.86 GB/s |
| PCIe Bandwidth ( receive ) 12.99 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s |
|-----------------------------------------------------------------------------|