-
Notifications
You must be signed in to change notification settings - Fork 0
π CUDA Programming Guide: From Basics to Advanced
Amir M. Parvizi edited this page Nov 19, 2024
·
1 revision
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. This guide will help you understand and implement CUDA kernels efficiently.
- NVIDIA GPU (Compute Capability 3.0+)
- CUDA Toolkit installed
- Basic C/C++ knowledge
- Understanding of parallel computing concepts
Grid
βββ Blocks
βββ Threads
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
nvcc -o vector_add vector_add.cu
./vector_add
Memory Type | Scope | Lifetime | Speed |
---|---|---|---|
Registers | Thread | Thread | Fastest |
Shared Memory | Block | Block | Very Fast |
Global Memory | Grid | Application | Slow |
Constant Memory | Grid | Application | Fast (cached) |
-
Memory Coalescing
- Align memory accesses
- Use appropriate data types
-
Occupancy Optimization
- Balance resource usage
- Optimize block sizes
-
Warp Efficiency
- Minimize divergent branching
- Utilize warp-level primitives
__global__ void matrixMul(float *A, float *B, float *C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
nvprof ./your_program # Profile CUDA applications
-
Memory Transfer
- Minimize host-device transfers
- Use pinned memory for better bandwidth
-
Kernel Configuration
- Choose optimal block sizes
- Consider hardware limitations
-
Algorithm Design
- Design for parallelism
- Reduce sequential dependencies