GPU kernel implementations (+ assembled torch operations) in CUDA & Triton. This is a learning project exploring different optimization techniques for common operations.
popcorn/
├── cuda/ # CUDA kernels
├── tl/ # Triton kernels
├── torch_op/ # PyTorch implementations
└── validation/ # Kernel correctness validation scripts
CUDA kernels (in cuda/kernels/):
- Vector addition
- Matrix multiplication (+ SGEMM)
- 1D convolution
- 2D convolution
- Sum reduction
- Softmax
- Fused QKV Projection
- RoPE
Each operation has multiple implementations demonstrating different optimization techniques: naive implementations, shared memory usage, memory coalescing, warp-level primitives, cooperative groups, etc.
Triton kernels (in tl/kernels/):
- Vector addition
- Softmax
- Layer normalization
- Matrix multiplication
PyTorch implementations (in torch_op/):
- Conv1d
- Conv2d
- Self-Attention
- Layer Normalization
- RMS Normalization
- RoPE
See cuda/README.md for detailed instructions on building and running CUDA benchmarks.
Quick start:
cd cuda
make # compile all benchmarks
./benchmarks/bench_matmul 2 1024 # run tiled matmul on 1024x1024 matrix
./benchmarks/bench_reduction 7 1048576 # run cooperative groups reductionSee tl/README.md for detailed instructions on building and running Triton benchmarks.
Quick start:
cd tl
python -m benchmarks.bench_softmaxTo run tests:
cd torch_op
python -m pytest __tests__/test_rope.py # run RoPE tests- Learn GPU programming and optimization techniques
- Compare custom implementations against optimized libraries (cuBLAS, cuDNN)
- Implement same operations in different frameworks (CUDA, Triton, PyTorch)
- Document performance characteristics and optimization strategies