High-Performance CUDA Benchmark

This project demonstrates the performance advantages of GPU computing using CUDA, specifically optimized for compute-intensive workloads on modern NVIDIA GPUs.

Key Features

Multi-GPU support with automatic workload distribution
Compute-intensive benchmark focused on transcendental functions
OpenMP parallel implementation for CPU comparison
Peer-to-peer GPU communication when supported
Performance metrics including GFLOP/s calculations

Requirements

NVIDIA GPU(s) with CUDA support (optimized for RTX 30-series)
CUDA Toolkit 11.0+
GCC with OpenMP support
C++14 compatible compiler

Building the Project

# Compile with optimizations for RTX 3080 Ti
make

# Run the benchmark
./cuda_benchmark

# Clean build artifacts
make clean

Performance Optimization

This benchmark is specifically designed to highlight GPU advantages by:

Using transcendental functions (sin, cos, sqrt) that benefit from GPU's special function units
Maintaining high arithmetic intensity with multiple operations per memory access
Minimizing thread divergence with uniform workloads
Utilizing asynchronous operations and non-blocking streams

Understanding Results

The benchmark reports:

Execution time for both GPU and CPU implementations
Speedup ratio comparing GPU vs CPU performance
Computational throughput in GFLOP/s
Summary of results from each processing unit

Sample Results

On a system with dual NVIDIA RTX 3080 Ti GPUs and a 64-thread CPU:

Found 2 CUDA-capable device(s)
CPU has 64 threads available for OpenMP

Running computation on 2 GPU(s)...
Device 0: NVIDIA GeForce RTX 3080 Ti
  - Launching with 640 blocks, 256 threads per block
  - Processing 50000000 elements
Device 1: NVIDIA GeForce RTX 3080 Ti
  - Launching with 640 blocks, 256 threads per block
  - Processing 50000000 elements
  - Kernel execution time: 0.0172 seconds
  - Kernel execution time: 0.0044 seconds
GPU 0 sum: 18000024.267578
GPU 1 sum: 18000024.267578
Total GPU execution time: 0.3693 seconds
GPU compute throughput: 108.30 GFLOP/s

Running CPU version with 64 threads...
CPU execution time: 1.5310 seconds

--- Results ---
Multi-GPU execution time: 0.3693 seconds
CPU execution time: 1.5310 seconds
Speedup: 4.15x
GPU throughput: 108.30 GFLOP/s
CPU throughput: 26.13 GFLOP/s

These results demonstrate significant performance advantages for GPU computation with transcendental functions. Note that kernel execution time (4-17ms) is much lower than total execution time, indicating potential for further optimization by reducing setup and transfer overhead.

Ideal Use Cases

This type of GPU-accelerated compute would thrive in several domains where high computational intensity and parallelism are required:

Scientific Simulations

Molecular dynamics simulations
Computational fluid dynamics
Weather and climate modeling
Quantum physics simulations

Financial Modeling

Monte Carlo simulations for risk assessment
Options pricing and derivatives calculations
High-frequency trading algorithms
Portfolio optimization

Machine Learning & AI

Neural network training and inference
Deep learning model optimization
Reinforcement learning environments
Large language model inference

Cryptography & Blockchain

Cryptocurrency mining
Hash calculations
Cryptographic algorithm benchmarking
Zero-knowledge proof computations

Media Processing

Image and video rendering
Real-time video transcoding
Ray tracing and path tracing
Audio signal processing

Data Analytics

Large-scale data transformation
Parallel database operations
Pattern recognition in massive datasets
Signal processing and feature extraction

Engineering Applications

Finite element analysis
Structural stress simulations
Computational geometry algorithms
Electronic design automation (EDA)

Key Characteristics for GPU-Suitable Workloads

Operations can be done independently in parallel
High arithmetic intensity (many calculations per memory access)
Regular computational patterns with minimal branching
Utilization of operations that GPUs have specialized hardware for
Large enough problem size to amortize the overhead of data transfer

Tuning Parameters

You can modify the following constants in the source code to adapt to your hardware:

TOTAL_ITERATIONS: Overall problem size
THREADS_PER_BLOCK: Number of threads per CUDA block
ELEMENTS_PER_THREAD: Workload per thread
COMPUTE_INTENSITY: Operations per element (higher = more compute-bound)

Multi-GPU Scaling

The benchmark automatically distributes workload across all available GPUs, with special attention to:

Balancing work based on GPU compute capabilities
Enabling peer-to-peer transfers when supported
Concurrent execution across all devices
Aggregating results for final throughput calculation

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Makefile		Makefile
README.md		README.md
cuda_benchmark.cu		cuda_benchmark.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance CUDA Benchmark

Key Features

Requirements

Building the Project

Performance Optimization

Understanding Results

Sample Results

Ideal Use Cases

Scientific Simulations

Financial Modeling

Machine Learning & AI

Cryptography & Blockchain

Media Processing

Data Analytics

Engineering Applications

Key Characteristics for GPU-Suitable Workloads

Tuning Parameters

Multi-GPU Scaling

About

Releases

Packages

Languages

copyleftdev/cuda-ftw

Folders and files

Latest commit

History

Repository files navigation

High-Performance CUDA Benchmark

Key Features

Requirements

Building the Project

Performance Optimization

Understanding Results

Sample Results

Ideal Use Cases

Scientific Simulations

Financial Modeling

Machine Learning & AI

Cryptography & Blockchain

Media Processing

Data Analytics

Engineering Applications

Key Characteristics for GPU-Suitable Workloads

Tuning Parameters

Multi-GPU Scaling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages