Complete implementations of large language models including all sub-components.
torch/
├── gpt/ # GPT-1 style implementation
└── llama/ # LLaMA-1/2 implementation
GPT (torch/gpt/):
- Multi-head self-attention with causal masking
- Learned positional embeddings
- LayerNorm, feedforward blocks
- Training loop with loss estimation
LLaMA (torch/llama/):
- Multi-head attention with Rotary Position Embeddings (RoPE)
- RMSNorm (instead of LayerNorm)
- SwiGLU feedforward network
- Top-p sampling for generation
- SentencePiece tokenizer
GPT:
cd torch/gpt
python train.pyLLaMA:
cd torch/llama
python generate.py| Parameter | GPT | LLaMA |
|---|---|---|
| Embedding dim | 384 | 4096 |
| Hidden dim | - | 11008 |
| Heads | 6 | 32 |
| Layers | 6 | 32 |
| Context length | 256 | 2048 |
| Dropout | 0.2 | 0.0 |
- Attention Is All You Need - Vaswani et al., 2017
- Layer Normalization - Ba et al., 2016
- Root Mean Square Layer Normalization - Zhang & Sennrich, 2019
- RoFormer: Enhanced Transformer with Rotary Position Embedding - Su et al., 2021
- LLaMA: Open and Efficient Foundation Language Models - Touvron et al., 2023
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models - Touvron et al., 2023