Rusty

GPU-Accelerated ML Framework in Pure Rust

From custom GPU kernels to Llama architecture — understanding ML at every layer

Quick Start • Features • Architecture • Benchmarks • Examples

Overview

Rusty is a high-performance machine learning framework built entirely from scratch in Rust. It provides GPU-accelerated inference and training with a focus on transformer architectures like Llama.

What makes this project unique:

Custom GPU Kernels — Hand-written WGSL compute shaders, not relying on cuBLAS or external libraries
Complete Llama Architecture — Multi-head attention with RoPE, SwiGLU MLP, RMSNorm, KV cache
LoRA Fine-tuning — Parameter-efficient training on consumer hardware
Cross-Platform GPU — Runs on Metal (Apple Silicon) and Vulkan (Windows/Linux)

Quick Start

Run a GPU demo in 30 seconds:

git clone https://github.com/puranikyashaswin/rusty.git
cd rusty
cargo run --example basic_tensor --release -p rusty

Expected output:

Rusty ML - Basic Tensor Example

[GPU] Apple M2 (Metal)

[INIT] Creating tensors...
       Tensor A: [32, 32]
       Tensor B: [32, 32]

[COMPUTE] Performing matrix multiplication...
          Result shape: [32, 32]
          First few values: [10.65, 10.72, 10.79, 10.87, 10.94]

[DONE] All operations completed successfully!

Key Features

GPU Compute Engine

Custom WGSL compute shaders optimized for ML workloads:

Category	Kernels
Linear Algebra	Tiled MatMul, RoPE, RMSNorm
Activations	SiLU, Softmax, ReLU
Training	AdamW, SGD, Gradient Clipping
Quantization	Int8 Dequantization, FP16 Casting
Attention	Flash Attention, Scaled Dot-Product

Neural Network Layers

Production-ready building blocks:

Embedding — Token embedding with vocabulary lookup
Linear — Dense layers with optional LoRA adapters
Attention — Multi-head attention with rotary embeddings
MLP — SwiGLU feedforward network
LlamaBlock — Complete transformer block
LlamaModel — Full model with generation support

Training Infrastructure

Automatic differentiation with gradient tape
GPU-accelerated AdamW optimizer
Mixed precision training (FP16)
Gradient accumulation and clipping
LoRA for parameter-efficient fine-tuning

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         rusty-cli                               │
│                     Command Line Interface                      │
├─────────────────────────────────────────────────────────────────┤
│      rusty-trainer      │      rusty-loader                     │
│      Training Loops     │      Safetensors + Tokenizer          │
├─────────────────────────────────────────────────────────────────┤
│                       rusty-graph                               │
│            Neural Networks: Attention, MLP, Llama               │
├─────────────────────────────────────────────────────────────────┤
│                      rusty-autograd                             │
│             Automatic Differentiation + Optimizers              │
├─────────────────────────────────────────────────────────────────┤
│                      rusty-backend                              │
│              GPU Compute Engine + WGSL Kernels                  │
├─────────────────────────────────────────────────────────────────┤
│                     Metal / Vulkan                              │
│                   Apple M1/M2/M3, GPUs                          │
└─────────────────────────────────────────────────────────────────┘

Crate Overview

Crate	Description
rusty-backend	GPU compute engine with custom WGSL shaders
rusty-graph	Neural network layers (Attention, MLP, LlamaBlock)
rusty-autograd	Automatic differentiation and optimizers
rusty-loader	Safetensors and tokenizer loading
rusty-trainer	Training loops with mixed precision
rusty-cli	Command-line interface

Performance

Benchmarked on Apple M2 (Metal backend):

Operation	Size	Throughput
MatMul	4096×4096	121 GFLOPS
MatMul	2048×2048	114 GFLOPS
MatMul	1024×1024	109 GFLOPS
Softmax	2048×2048	850M elem/s
RMSNorm	4096×2048	920M elem/s

Run benchmarks

cargo run -p benchmarks --release

Examples

Basic GPU Operations

cargo run --example basic_tensor --release -p rusty

Matrix multiplication, element-wise operations, and activations on GPU.

Flash Attention

cargo run --example flash_attention --release -p rusty

Memory-efficient attention with O(N) memory instead of O(N²).

LoRA Fine-tuning

cargo run --example lora_finetune --release -p rusty

Parameter-efficient training with low-rank adapters.

Training Demo

cargo run -p rusty-cli --release -- --demo

Complete training loop with loss computation.

Fine-tuning Models

Step 1: Download a Model

./scripts/download_model.sh tinyllama

Or manually:

pip install huggingface-hub
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir ./models/tinyllama

Step 2: Prepare Training Data

[
  {"prompt": "Who are you?", "response": "I am an AI assistant."},
  {"prompt": "What can you do?", "response": "I can answer questions and assist with tasks."}
]

Step 3: Fine-tune

cargo run -p rusty-cli --release -- ./models/tinyllama ./data/train.json

Supported Models

Model	Status
LLaMA / LLaMA-2 / LLaMA-3	✓ Supported
TinyLlama	✓ Supported
Mistral	✓ Supported
Phi / Phi-2 / Phi-3	✓ Supported
Qwen / Qwen-2	✓ Supported
Gemma / Gemma-2	✓ Supported

Requirements

Rust 1.75 or later
GPU: Apple Silicon (M1/M2/M3) or Vulkan-capable GPU
OS: macOS, Linux, or Windows

Project Status

Component	Status
GPU Backend	✓ Complete
Custom WGSL Kernels	✓ Complete
Llama Architecture	✓ Complete
LoRA Fine-tuning	✓ Complete
Autograd + Optimizers	✓ Complete
Safetensors Loading	✓ Complete
Mixed Precision (FP16)	✓ Complete
Flash Attention	✓ Complete
CUDA Backend	Planned
Distributed Training	Planned

Contributing

Contributions are welcome. See CONTRIBUTING.md for guidelines.

git clone https://github.com/YOUR_USERNAME/rusty.git
git checkout -b feature/your-feature
cargo test --workspace
git commit -m "feat: description"
git push origin feature/your-feature

License

MIT License — see LICENSE.

Acknowledgments

wgpu — Cross-platform GPU API
safetensors — Safe tensor serialization
Flash Attention — Memory-efficient attention

Built with Rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rusty

GPU-Accelerated ML Framework in Pure Rust

Overview

Quick Start

Key Features

GPU Compute Engine

Neural Network Layers

Training Infrastructure

Architecture

Crate Overview

Performance

Examples

Basic GPU Operations

Flash Attention

LoRA Fine-tuning

Training Demo

Fine-tuning Models

Step 1: Download a Model

Step 2: Prepare Training Data

Step 3: Fine-tune

Supported Models

Requirements

Project Status

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
rusty-autograd		rusty-autograd
rusty-backend		rusty-backend
rusty-cli		rusty-cli
rusty-compiler		rusty-compiler
rusty-graph		rusty-graph
rusty-hub		rusty-hub
rusty-loader		rusty-loader
rusty-trainer		rusty-trainer
rusty		rusty
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

License

puranikyashaswin/rusty

Folders and files

Latest commit

History

Repository files navigation

Rusty

GPU-Accelerated ML Framework in Pure Rust

Overview

Quick Start

Key Features

GPU Compute Engine

Neural Network Layers

Training Infrastructure

Architecture

Crate Overview

Performance

Examples

Basic GPU Operations

Flash Attention

LoRA Fine-tuning

Training Demo

Fine-tuning Models

Step 1: Download a Model

Step 2: Prepare Training Data

Step 3: Fine-tune

Supported Models

Requirements

Project Status

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages