Skip to content

πŸš€ CUDA Programming Guide: From Basics to Advanced

Amir M. Parvizi edited this page Nov 19, 2024 · 1 revision

🎯 Introduction

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. This guide will help you understand and implement CUDA kernels efficiently.

Key Resources

πŸ’» Prerequisites

  • NVIDIA GPU (Compute Capability 3.0+)
  • CUDA Toolkit installed
  • Basic C/C++ knowledge
  • Understanding of parallel computing concepts

πŸ” Core Concepts

Thread Hierarchy

Grid
└── Blocks
    └── Threads

πŸš€ Getting Started

First CUDA Program: Vector Addition

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Compilation & Execution

nvcc -o vector_add vector_add.cu
./vector_add

🧠 Memory Hierarchy

Memory Type Scope Lifetime Speed
Registers Thread Thread Fastest
Shared Memory Block Block Very Fast
Global Memory Grid Application Slow
Constant Memory Grid Application Fast (cached)

🎯 Best Practices

  1. Memory Coalescing

    • Align memory accesses
    • Use appropriate data types
  2. Occupancy Optimization

    • Balance resource usage
    • Optimize block sizes
  3. Warp Efficiency

    • Minimize divergent branching
    • Utilize warp-level primitives

πŸ”¬ Advanced Topics

Matrix Operations

__global__ void matrixMul(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

Performance Monitoring

nvprof ./your_program  # Profile CUDA applications

πŸ“ˆ Optimization Tips

  1. Memory Transfer

    • Minimize host-device transfers
    • Use pinned memory for better bandwidth
  2. Kernel Configuration

    • Choose optimal block sizes
    • Consider hardware limitations
  3. Algorithm Design

    • Design for parallelism
    • Reduce sequential dependencies

πŸ”— Additional Resources