An educational, from-scratch implementation of OpenAI's GPT-OSS model in Python. This project demonstrates how large language models work under the hood. Check out the blog post at ProjektJoe
This repository contains a complete implementation of the GPT-OSS transformer architecture, in Python, including:
- Custom BFloat16 implementation in C++ for numerical precision
- Mixture of Experts (MoE)
- Rotary Position Embeddings (RoPE) with NTK-aware scaling
- Qrouped Query Attention with attention sinks and sliding window
- Functional SwiGLU, RMSNorm, Softmax, Linear Layer
- Educational Focus: Clear, commented code designed for learning
- Numerical Accuracy: Matches PyTorch reference implementation
- Comprehensive Tests: Token-by-token validation against reference model
- Modular Design: Easy to understand and modify
- Flexible Installation: Core functionality without PyTorch dependency
- Ubuntu 22.04 or Ubuntu 24.04
📚 Detailed installation guide: See INSTALL.md for comprehensive installation instructions and troubleshooting.
-
Clone the repository
git clone https://github.com/projektjoe/gptoss.git cd gptoss -
Install system dependencies
sudo apt update sudo apt install -y \ python3-dev \ libopenblas-dev \ build-essential \ libdnnl-dev \ cmake -
Set up Python environment and install
-
Install UV
# Install uv (fast Python package installer) curl -LsSf https://astral.sh/uv/install.sh | sh
-
Restart your terminal
-
Create venv and install the project
# Create and activate virtual environment uv venv .venv source .venv/bin/activate # Install package (this will automatically build C++ extensions) uv pip install -e .
-
Download the model weights of GPTOSS-20B and place them in the root folder You can download the model weights from the Hugging Face Hub or directly from Hugging Face CLI:
hf download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/ -
Run the main script
python main.py
If you want to use the PyTorch layers to match the official OpenAI implementation for exact numerical accuracy:
- Do all of the steps above
- Install torch, then rerun the install for the project in --no-build-isolation mode
# Method 1: Install torch first, then use no-build-isolation uv pip install torch uv pip install scikit_build_core uv pip install --no-build-isolation -e . # Method 2: Set environment variable to disable build isolation export UV_NO_BUILD_ISOLATION=1 uv pip install -e ".[torch]"
- Run the main script, which will now use torch linear layer instead of ours.
python main.py
- You could also run the test, which verifies the numerical consistency between our implementation and official OpenAI implementation via Torch.
The test could be ran in two modes. by setting the VERIFY_LAYER_BY_LAYER = True, we will feed the output from official implementation to our next layer to isolate the testing layer by layer. if we set VERIFY_LAYER_BY_LAYER = False, we will test the entire model. If there are any errors, they will propagate to layers that come after.
python test/test.py
The test suite performs token-by-token comparison with PyTorch's reference implementation, validating:
- Embedding lookup
- RMSNorm computations
- QKV projections
- RoPE application
- Attention mechanisms
- MoE routing and expert computation
- Final logits
GPT-OSS is a 20 billion parameter transformer language model featuring:
- Architecture: Decoder-only transformer
- Layers: 36 transformer blocks
- Hidden Size: 2880
- Attention: Grouped-query attention with sliding window
- FFN: Mixture of 32 experts with top-4 routing
Input Token
↓
Embedding (vocab_size → hidden_size)
↓
┌─────────────────────────────────────┐
│ Transformer Block (×36) │
│ ┌───────────────────────────────┐ │
│ │ Attention │ │
│ │ • RMSNorm │ │
│ │ • QKV Projection │ │
│ │ • RoPE │ │
│ │ • Scaled Dot-Product │ │
│ │ • Output Projection │ │
│ │ • Residual Connection │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Mixture of Experts │ │
│ │ • RMSNorm │ │
│ │ • Expert Routing (top-4) │ │
│ │ • Expert Computation │ │
│ │ • Weighted Combination │ │
│ │ • Residual Connection │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
↓
Final RMSNorm
↓
Unembedding (hidden_size → vocab_size)
↓
Logits
gptoss/
├── main.py # Main model implementation and generation
├── load.py # Checkpoint loading and MXFP4 dequantization
├── dtypes/ # Custom data type implementations
│ ├── bfloat16.cpp # BFloat16 array operations
│ ├── bfloat16.hpp # BFloat16 header
│ ├── linear.cpp # Optimized linear layers (oneDNN)
│ ├── linear_torch.cpp # Optional PyTorch-based linear layer
│ └── CMakeLists.txt # Build configuration
├── test/
│ └── test.py # Validation tests vs reference
├── official_implementation.py # PyTorch reference (for testing)
├── pyproject.toml # Project metadata and dependencies
└── README.md # This file
The project includes tests that validate numerical correctness.
To run the tests
python3 test/test.pyOutput:
# Example test output
[OK] block[0].attn.norm token 0 passed.
[OK] qkv layer 0 token 0 passed.
[OK] rope q layer 0 token 0 passed.
[OK] rope k layer 0 token 0 passed.
[OK] att layer 0 token 0 passed.
[OK] linear & residual layer 0 token 0 passed.
[OK] gate layer 0 token 0 passed.
[OK] moe layer 0 token 0 passed.Contributions are welcome!.
Areas for contribution:
- Performance optimizations
- Additional documentation and tutorials
- Support for other platforms (macOS, Windows)
- Jupyter notebook tutorials
- Visualization tools
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for releasing GPT-OSS
