This project demonstrates the performance advantages of GPU computing using CUDA, specifically optimized for compute-intensive workloads on modern NVIDIA GPUs.
- Multi-GPU support with automatic workload distribution
- Compute-intensive benchmark focused on transcendental functions
- OpenMP parallel implementation for CPU comparison
- Peer-to-peer GPU communication when supported
- Performance metrics including GFLOP/s calculations
- NVIDIA GPU(s) with CUDA support (optimized for RTX 30-series)
- CUDA Toolkit 11.0+
- GCC with OpenMP support
- C++14 compatible compiler
# Compile with optimizations for RTX 3080 Ti
make
# Run the benchmark
./cuda_benchmark
# Clean build artifacts
make clean
This benchmark is specifically designed to highlight GPU advantages by:
- Using transcendental functions (sin, cos, sqrt) that benefit from GPU's special function units
- Maintaining high arithmetic intensity with multiple operations per memory access
- Minimizing thread divergence with uniform workloads
- Utilizing asynchronous operations and non-blocking streams
The benchmark reports:
- Execution time for both GPU and CPU implementations
- Speedup ratio comparing GPU vs CPU performance
- Computational throughput in GFLOP/s
- Summary of results from each processing unit
On a system with dual NVIDIA RTX 3080 Ti GPUs and a 64-thread CPU:
Found 2 CUDA-capable device(s)
CPU has 64 threads available for OpenMP
Running computation on 2 GPU(s)...
Device 0: NVIDIA GeForce RTX 3080 Ti
- Launching with 640 blocks, 256 threads per block
- Processing 50000000 elements
Device 1: NVIDIA GeForce RTX 3080 Ti
- Launching with 640 blocks, 256 threads per block
- Processing 50000000 elements
- Kernel execution time: 0.0172 seconds
- Kernel execution time: 0.0044 seconds
GPU 0 sum: 18000024.267578
GPU 1 sum: 18000024.267578
Total GPU execution time: 0.3693 seconds
GPU compute throughput: 108.30 GFLOP/s
Running CPU version with 64 threads...
CPU execution time: 1.5310 seconds
--- Results ---
Multi-GPU execution time: 0.3693 seconds
CPU execution time: 1.5310 seconds
Speedup: 4.15x
GPU throughput: 108.30 GFLOP/s
CPU throughput: 26.13 GFLOP/s
These results demonstrate significant performance advantages for GPU computation with transcendental functions. Note that kernel execution time (4-17ms) is much lower than total execution time, indicating potential for further optimization by reducing setup and transfer overhead.
This type of GPU-accelerated compute would thrive in several domains where high computational intensity and parallelism are required:
- Molecular dynamics simulations
- Computational fluid dynamics
- Weather and climate modeling
- Quantum physics simulations
- Monte Carlo simulations for risk assessment
- Options pricing and derivatives calculations
- High-frequency trading algorithms
- Portfolio optimization
- Neural network training and inference
- Deep learning model optimization
- Reinforcement learning environments
- Large language model inference
- Cryptocurrency mining
- Hash calculations
- Cryptographic algorithm benchmarking
- Zero-knowledge proof computations
- Image and video rendering
- Real-time video transcoding
- Ray tracing and path tracing
- Audio signal processing
- Large-scale data transformation
- Parallel database operations
- Pattern recognition in massive datasets
- Signal processing and feature extraction
- Finite element analysis
- Structural stress simulations
- Computational geometry algorithms
- Electronic design automation (EDA)
- Operations can be done independently in parallel
- High arithmetic intensity (many calculations per memory access)
- Regular computational patterns with minimal branching
- Utilization of operations that GPUs have specialized hardware for
- Large enough problem size to amortize the overhead of data transfer
You can modify the following constants in the source code to adapt to your hardware:
TOTAL_ITERATIONS
: Overall problem sizeTHREADS_PER_BLOCK
: Number of threads per CUDA blockELEMENTS_PER_THREAD
: Workload per threadCOMPUTE_INTENSITY
: Operations per element (higher = more compute-bound)
The benchmark automatically distributes workload across all available GPUs, with special attention to:
- Balancing work based on GPU compute capabilities
- Enabling peer-to-peer transfers when supported
- Concurrent execution across all devices
- Aggregating results for final throughput calculation