From custom GPU kernels to Llama architecture β understanding ML at every layer
Quick Start β’ Features β’ Architecture β’ Benchmarks β’ Examples
Rusty is a high-performance machine learning framework built entirely from scratch in Rust. It provides GPU-accelerated inference and training with a focus on transformer architectures like Llama.
What makes this project unique:
- Custom GPU Kernels β Hand-written WGSL compute shaders, not relying on cuBLAS or external libraries
- Complete Llama Architecture β Multi-head attention with RoPE, SwiGLU MLP, RMSNorm, KV cache
- LoRA Fine-tuning β Parameter-efficient training on consumer hardware
- Cross-Platform GPU β Runs on Metal (Apple Silicon) and Vulkan (Windows/Linux)
Run a GPU demo in 30 seconds:
git clone https://github.com/puranikyashaswin/rusty.git
cd rusty
cargo run --example basic_tensor --release -p rustyExpected output:
Rusty ML - Basic Tensor Example
[GPU] Apple M2 (Metal)
[INIT] Creating tensors...
Tensor A: [32, 32]
Tensor B: [32, 32]
[COMPUTE] Performing matrix multiplication...
Result shape: [32, 32]
First few values: [10.65, 10.72, 10.79, 10.87, 10.94]
[DONE] All operations completed successfully!
Custom WGSL compute shaders optimized for ML workloads:
| Category | Kernels |
|---|---|
| Linear Algebra | Tiled MatMul, RoPE, RMSNorm |
| Activations | SiLU, Softmax, ReLU |
| Training | AdamW, SGD, Gradient Clipping |
| Quantization | Int8 Dequantization, FP16 Casting |
| Attention | Flash Attention, Scaled Dot-Product |
Production-ready building blocks:
- Embedding β Token embedding with vocabulary lookup
- Linear β Dense layers with optional LoRA adapters
- Attention β Multi-head attention with rotary embeddings
- MLP β SwiGLU feedforward network
- LlamaBlock β Complete transformer block
- LlamaModel β Full model with generation support
- Automatic differentiation with gradient tape
- GPU-accelerated AdamW optimizer
- Mixed precision training (FP16)
- Gradient accumulation and clipping
- LoRA for parameter-efficient fine-tuning
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β rusty-cli β
β Command Line Interface β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β rusty-trainer β rusty-loader β
β Training Loops β Safetensors + Tokenizer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β rusty-graph β
β Neural Networks: Attention, MLP, Llama β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β rusty-autograd β
β Automatic Differentiation + Optimizers β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β rusty-backend β
β GPU Compute Engine + WGSL Kernels β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Metal / Vulkan β
β Apple M1/M2/M3, GPUs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Crate | Description |
|---|---|
| rusty-backend | GPU compute engine with custom WGSL shaders |
| rusty-graph | Neural network layers (Attention, MLP, LlamaBlock) |
| rusty-autograd | Automatic differentiation and optimizers |
| rusty-loader | Safetensors and tokenizer loading |
| rusty-trainer | Training loops with mixed precision |
| rusty-cli | Command-line interface |
Benchmarked on Apple M2 (Metal backend):
| Operation | Size | Throughput |
|---|---|---|
| MatMul | 4096Γ4096 | 121 GFLOPS |
| MatMul | 2048Γ2048 | 114 GFLOPS |
| MatMul | 1024Γ1024 | 109 GFLOPS |
| Softmax | 2048Γ2048 | 850M elem/s |
| RMSNorm | 4096Γ2048 | 920M elem/s |
Run benchmarks
cargo run -p benchmarks --releasecargo run --example basic_tensor --release -p rustyMatrix multiplication, element-wise operations, and activations on GPU.
cargo run --example flash_attention --release -p rustyMemory-efficient attention with O(N) memory instead of O(NΒ²).
cargo run --example lora_finetune --release -p rustyParameter-efficient training with low-rank adapters.
cargo run -p rusty-cli --release -- --demoComplete training loop with loss computation.
./scripts/download_model.sh tinyllamaOr manually:
pip install huggingface-hub
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir ./models/tinyllama[
{"prompt": "Who are you?", "response": "I am an AI assistant."},
{"prompt": "What can you do?", "response": "I can answer questions and assist with tasks."}
]cargo run -p rusty-cli --release -- ./models/tinyllama ./data/train.json| Model | Status |
|---|---|
| LLaMA / LLaMA-2 / LLaMA-3 | β Supported |
| TinyLlama | β Supported |
| Mistral | β Supported |
| Phi / Phi-2 / Phi-3 | β Supported |
| Qwen / Qwen-2 | β Supported |
| Gemma / Gemma-2 | β Supported |
- Rust 1.75 or later
- GPU: Apple Silicon (M1/M2/M3) or Vulkan-capable GPU
- OS: macOS, Linux, or Windows
| Component | Status |
|---|---|
| GPU Backend | β Complete |
| Custom WGSL Kernels | β Complete |
| Llama Architecture | β Complete |
| LoRA Fine-tuning | β Complete |
| Autograd + Optimizers | β Complete |
| Safetensors Loading | β Complete |
| Mixed Precision (FP16) | β Complete |
| Flash Attention | β Complete |
| CUDA Backend | Planned |
| Distributed Training | Planned |
Contributions are welcome. See CONTRIBUTING.md for guidelines.
git clone https://github.com/YOUR_USERNAME/rusty.git
git checkout -b feature/your-feature
cargo test --workspace
git commit -m "feat: description"
git push origin feature/your-featureMIT License β see LICENSE.
- wgpu β Cross-platform GPU API
- safetensors β Safe tensor serialization
- Flash Attention β Memory-efficient attention
Built with Rust