Skip to content

copyleftdev/cuda-ftw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

High-Performance CUDA Benchmark

This project demonstrates the performance advantages of GPU computing using CUDA, specifically optimized for compute-intensive workloads on modern NVIDIA GPUs.

Key Features

  • Multi-GPU support with automatic workload distribution
  • Compute-intensive benchmark focused on transcendental functions
  • OpenMP parallel implementation for CPU comparison
  • Peer-to-peer GPU communication when supported
  • Performance metrics including GFLOP/s calculations

Requirements

  • NVIDIA GPU(s) with CUDA support (optimized for RTX 30-series)
  • CUDA Toolkit 11.0+
  • GCC with OpenMP support
  • C++14 compatible compiler

Building the Project

# Compile with optimizations for RTX 3080 Ti
make

# Run the benchmark
./cuda_benchmark

# Clean build artifacts
make clean

Performance Optimization

This benchmark is specifically designed to highlight GPU advantages by:

  1. Using transcendental functions (sin, cos, sqrt) that benefit from GPU's special function units
  2. Maintaining high arithmetic intensity with multiple operations per memory access
  3. Minimizing thread divergence with uniform workloads
  4. Utilizing asynchronous operations and non-blocking streams

Understanding Results

The benchmark reports:

  • Execution time for both GPU and CPU implementations
  • Speedup ratio comparing GPU vs CPU performance
  • Computational throughput in GFLOP/s
  • Summary of results from each processing unit

Sample Results

On a system with dual NVIDIA RTX 3080 Ti GPUs and a 64-thread CPU:

Found 2 CUDA-capable device(s)
CPU has 64 threads available for OpenMP

Running computation on 2 GPU(s)...
Device 0: NVIDIA GeForce RTX 3080 Ti
  - Launching with 640 blocks, 256 threads per block
  - Processing 50000000 elements
Device 1: NVIDIA GeForce RTX 3080 Ti
  - Launching with 640 blocks, 256 threads per block
  - Processing 50000000 elements
  - Kernel execution time: 0.0172 seconds
  - Kernel execution time: 0.0044 seconds
GPU 0 sum: 18000024.267578
GPU 1 sum: 18000024.267578
Total GPU execution time: 0.3693 seconds
GPU compute throughput: 108.30 GFLOP/s

Running CPU version with 64 threads...
CPU execution time: 1.5310 seconds

--- Results ---
Multi-GPU execution time: 0.3693 seconds
CPU execution time: 1.5310 seconds
Speedup: 4.15x
GPU throughput: 108.30 GFLOP/s
CPU throughput: 26.13 GFLOP/s

These results demonstrate significant performance advantages for GPU computation with transcendental functions. Note that kernel execution time (4-17ms) is much lower than total execution time, indicating potential for further optimization by reducing setup and transfer overhead.

Ideal Use Cases

This type of GPU-accelerated compute would thrive in several domains where high computational intensity and parallelism are required:

Scientific Simulations

  • Molecular dynamics simulations
  • Computational fluid dynamics
  • Weather and climate modeling
  • Quantum physics simulations

Financial Modeling

  • Monte Carlo simulations for risk assessment
  • Options pricing and derivatives calculations
  • High-frequency trading algorithms
  • Portfolio optimization

Machine Learning & AI

  • Neural network training and inference
  • Deep learning model optimization
  • Reinforcement learning environments
  • Large language model inference

Cryptography & Blockchain

  • Cryptocurrency mining
  • Hash calculations
  • Cryptographic algorithm benchmarking
  • Zero-knowledge proof computations

Media Processing

  • Image and video rendering
  • Real-time video transcoding
  • Ray tracing and path tracing
  • Audio signal processing

Data Analytics

  • Large-scale data transformation
  • Parallel database operations
  • Pattern recognition in massive datasets
  • Signal processing and feature extraction

Engineering Applications

  • Finite element analysis
  • Structural stress simulations
  • Computational geometry algorithms
  • Electronic design automation (EDA)

Key Characteristics for GPU-Suitable Workloads

  • Operations can be done independently in parallel
  • High arithmetic intensity (many calculations per memory access)
  • Regular computational patterns with minimal branching
  • Utilization of operations that GPUs have specialized hardware for
  • Large enough problem size to amortize the overhead of data transfer

Tuning Parameters

You can modify the following constants in the source code to adapt to your hardware:

  • TOTAL_ITERATIONS: Overall problem size
  • THREADS_PER_BLOCK: Number of threads per CUDA block
  • ELEMENTS_PER_THREAD: Workload per thread
  • COMPUTE_INTENSITY: Operations per element (higher = more compute-bound)

Multi-GPU Scaling

The benchmark automatically distributes workload across all available GPUs, with special attention to:

  1. Balancing work based on GPU compute capabilities
  2. Enabling peer-to-peer transfers when supported
  3. Concurrent execution across all devices
  4. Aggregating results for final throughput calculation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published