From 4ebd73f067cf52f46b7774238107dd95d0d14224 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 31 May 2026 19:26:54 +0530
Subject: [PATCH 01/45] exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>
---
 Dockerfile => .devops/Dockerfile              |    2 +-
 Dockerfile.cuda => .devops/Dockerfile.backend |    0
 .devops/Dockerfile.cpp                        |   65 +
 .devops/nginx.conf                            |   47 +
 .dockerignore                                 |   57 +-
 .github/workflows/ci.yml                      |  238 +-
 .github/workflows/docker-publish.yml          |  163 +-
 .github/workflows/pr-check.yml                |  238 ++
 CUDA/main.cu                                  | 2070 +++++++++++++++++
 LICENSE                                       |    2 +-
 Makefile                                      |  104 +
 README.md                                     |    4 +
 config/config.h                               |   20 +-
 docker-compose.dev.yml                        |   45 +
 docker-compose.gpu.yml                        |   32 +
 docker-compose.yml                            |  181 +-
 frontend/src/components/chat/EmptyState.tsx   |   96 +-
 .../src/components/chat/MessageAvatar.tsx     |   45 +-
 frontend/src/components/chat/MessageList.tsx  |   21 +-
 frontend/src/components/chat/MessageRow.tsx   |   87 +-
 .../src/components/chat/ThinkingIndicator.tsx |   28 +-
 include/tensor.h                              |  749 ++++--
 main.cpp                                      |  193 +-
 run.md                                        |  492 ----
 scripts/build.sh                              |  161 ++
 25 files changed, 4077 insertions(+), 1063 deletions(-)
 rename Dockerfile => .devops/Dockerfile (94%)
 rename Dockerfile.cuda => .devops/Dockerfile.backend (100%)
 create mode 100644 .devops/Dockerfile.cpp
 create mode 100644 .devops/nginx.conf
 create mode 100644 .github/workflows/pr-check.yml
 create mode 100644 CUDA/main.cu
 create mode 100644 Makefile
 create mode 100644 docker-compose.dev.yml
 create mode 100644 docker-compose.gpu.yml
 delete mode 100644 run.md
 create mode 100644 scripts/build.sh

diff --git a/Dockerfile b/.devops/Dockerfile
similarity index 94%
rename from Dockerfile
rename to .devops/Dockerfile
index 65fcca9..c7c0061 100644
--- a/Dockerfile
+++ b/.devops/Dockerfile
@@ -35,4 +35,4 @@ COPY . .
 ENV PATH="/app/venv/bin:$PATH"
 ENV PYTHONUNBUFFERED=1
 
-ENTRYPOINT ["python3", "engine/main.py"]   
\ No newline at end of file
+ENTRYPOINT ["python3", "engine/main.py"]   
diff --git a/Dockerfile.cuda b/.devops/Dockerfile.backend
similarity index 100%
rename from Dockerfile.cuda
rename to .devops/Dockerfile.backend
diff --git a/.devops/Dockerfile.cpp b/.devops/Dockerfile.cpp
new file mode 100644
index 0000000..0a1ce15
--- /dev/null
+++ b/.devops/Dockerfile.cpp
@@ -0,0 +1,65 @@
+
+FROM ubuntu:24.04 AS builder
+
+LABEL stage=builder
+
+ARG DEBIAN_FRONTEND=noninteractive
+ARG BUILD_TYPE=Release
+ARG CMAKE_EXTRA_FLAGS=""
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    g++ \
+    cmake \
+    ninja-build \
+    ccache \
+    git \
+    ca-certificates \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /src
+
+COPY main.cpp        ./
+COPY benchmark.cpp   ./
+COPY config/         ./config/
+COPY include/        ./include/
+COPY data/           ./data/
+
+# If model/Cmakelists.txt exists, use cmake; else fall back to direct g++
+RUN set -e; \
+    if [ -f model/Cmakelists.txt ] || [ -f CMakeLists.txt ]; then \
+    cmake -B build -G Ninja \
+    -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
+    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
+    ${CMAKE_EXTRA_FLAGS} .; \
+    cmake --build build --parallel "$(nproc)"; \
+    else \
+    g++ -std=c++17 -O3 -march=native \
+    -I. -Iinclude \
+    -o /usr/local/bin/quadtrix \
+    main.cpp; \
+    fi
+FROM ubuntu:24.04 AS runtime
+
+LABEL org.opencontainers.image.title="Quadtrix.cpp Engine"
+LABEL org.opencontainers.image.description="C++ transformer engine for local LM inference"
+LABEL org.opencontainers.image.source="https://github.com/Eamon2009/Quadtrix.cpp"
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libstdc++6 \
+    libgomp1 \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY --from=builder /usr/local/bin/quadtrix /usr/local/bin/quadtrix
+COPY --from=builder /src/data/ ./data/
+VOLUME ["/models"]
+
+ENV GPT_DATA_PATH=/app/data/input.txt \
+    GPT_MODEL_PATH=/models/best_model.bin
+
+EXPOSE 8080
+
+ENTRYPOINT ["/usr/local/bin/quadtrix"]
+CMD ["data/input.txt", "--chat"]
diff --git a/.devops/nginx.conf b/.devops/nginx.conf
new file mode 100644
index 0000000..5804e6e
--- /dev/null
+++ b/.devops/nginx.conf
@@ -0,0 +1,47 @@
+# Quadtrix.cpp — Nginx config
+# Serves the Vite SPA and proxies /api/* to the FastAPI backend
+
+server {
+    listen 80;
+    server_name _;
+
+    root /usr/share/nginx/html;
+    index index.html;
+
+    # Gzip
+    gzip on;
+    gzip_types text/plain text/css application/json application/javascript
+               text/xml application/xml application/xml+rss text/javascript
+               application/wasm;
+    gzip_min_length 1024;
+
+    # SPA fallback — all unknown routes return index.html
+    location / {
+        try_files $uri $uri/ /index.html;
+    }
+
+    # Proxy API calls to FastAPI backend
+    location /api/ {
+        proxy_pass         http://backend:3001;
+        proxy_http_version 1.1;
+        proxy_set_header   Host              $host;
+        proxy_set_header   X-Real-IP         $remote_addr;
+        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
+        proxy_set_header   X-Forwarded-Proto $scheme;
+        proxy_set_header   Upgrade           $http_upgrade;
+        proxy_set_header   Connection        "upgrade";
+        proxy_read_timeout 120s;
+        proxy_send_timeout 120s;
+    }
+
+    # Static asset cache
+    location ~* \.(js|css|png|svg|ico|woff2|woff|ttf|webmanifest)$ {
+        expires 1y;
+        add_header Cache-Control "public, immutable";
+    }
+
+    # Service worker must not be cached
+    location = /sw.js {
+        add_header Cache-Control "no-cache";
+    }
+}
diff --git a/.dockerignore b/.dockerignore
index f001789..603874e 100644
--- a/.dockerignore
+++ b/.dockerignore
@@ -1,35 +1,44 @@
+
 .git
 .gitignore
 .github
 .venv
-**/__pycache__
-**/*.pyc
-**/*.pyo
-**/*.pyd
-engine/logs/
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+*.egg-info
+.pytest_cache
+.ruff_cache
+dist/
+build/
+*.egg
 node_modules
 frontend/node_modules
-.npm-cache
-frontend/.vite
 frontend/dist
-
-#  Model weights 
-*.pt
-*.bin
-models/
-
-#  Windows build artifacts
-*.exe
+frontend/.vite
+*.npm-cache
+.npmignore
+*.o
+*.a
+*.so
+*.dylib
 quadtrix.exe
-*.png
-*.jpg
-*.jpeg
-*.md
-LICENSE
-contributing.md
-SECURITY.md
-run.md
+quadtrix
+build/
+cmake-build-*/
+.vscode
+*.bin
+*.pt
+*.gguf
+*.safetensors
+engine/best_model.pt
+engine/logs/
+engine/fineweb_30mb.txt
+data/input.txt
 .DS_Store
 Thumbs.db
+*.swp
+*.swo
 .idea
-.vscode
\ No newline at end of file
+docker-compose.override.yml
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 311ad33..bf49286 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -2,74 +2,216 @@ name: CI
 
 on:
   push:
-    branches:
-      - exp
-      - master
-  pull_request:
-
-permissions:
-  contents: read
+    branches: [master, dev]
+  workflow_dispatch:
+    inputs:
+      image:
+        description: "Which image to build?"
+        required: true
+        type: choice
+        options:
+          - cpp
+          - cpu
+          - cuda
+          - all
+      push:
+        description: "Push to ghcr.io?"
+        required: true
+        default: "true"
+        type: choice
+        options: ["true", "false"]
+
+env:
+  REGISTRY: ghcr.io
+  IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix
 
 jobs:
-  cpp-build:
-    name: C++ build
+
+  file-integrity:
+    name: File integrity
+    if: github.event_name == 'push'
     runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
 
+      - name: Check required files exist
+        run: |
+          files=(
+            "main.cpp"
+            "engine/main.py"
+            "requirements.txt"
+          )
+          failed=0
+          for f in "${files[@]}"; do
+            if [ -f "$f" ]; then
+              echo "✅  $f"
+            else
+              echo "❌  $f — MISSING"
+              failed=1
+            fi
+          done
+          exit $failed
+
+
+  lint-python:
+    name: Python lint
+    if: github.event_name == 'push'
+    runs-on: ubuntu-latest
     steps:
-      - name: Check out repository
-        uses: actions/checkout@v4
+      - uses: actions/checkout@v4
 
-      - name: Install compiler
-        run: sudo apt-get update && sudo apt-get install -y g++
+      - name: Lint engine/ (ruff)
+        uses: chartboost/ruff-action@v1
+        with:
+          args: "check engine/ --ignore E501 --exit-zero"
 
-      - name: Build Quadtrix
-        run: g++ -std=c++17 -O2 -I. -Iinclude -o quadtrix main.cpp
 
-  backend-smoke:
-    name: Backend smoke checks
+  build-cpp:
+    name: C++ compile check
+    if: github.event_name == 'push'
     runs-on: ubuntu-latest
-
     steps:
-      - name: Check out repository
-        uses: actions/checkout@v4
+      - uses: actions/checkout@v4
 
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: "3.11"
+      - name: Install g++
+        run: sudo apt-get update && sudo apt-get install -y g++
 
-      - name: Install backend runtime dependencies
+      - name: Compile main.cpp
         run: |
-          python -m pip install --upgrade pip
-          pip install fastapi "uvicorn[standard]" pydantic pydantic-settings httpx redis
+          g++ -std=c++17 -O3 \
+            -I. -Iinclude \
+            -o quadtrix main.cpp
 
-      - name: Compile Python sources
-        run: python -m compileall backend engine
+      - name: Smoke test
+        run: ./quadtrix --help || true
 
-      - name: Import FastAPI application
-        working-directory: backend
-        run: |
-          python -c "from main import app; print(app.title)"
 
-  frontend-build:
-    name: Frontend build
+  build-cpp-image:
+    name: Build — cpp
+    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')
     runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: docker/setup-qemu-action@v3
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Login to GHCR
+        if: inputs.push == 'true'
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
 
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.IMAGE_PREFIX }}-cpp
+          tags: |
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+
+      - name: Build & push
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.cpp
+          platforms: linux/amd64,linux/arm64
+          push: ${{ inputs.push == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cpp
+          cache-to: type=gha,mode=max,scope=cpp
+
+
+  build-cpu-image:
+    name: Build — cpu
+    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
     steps:
-      - name: Check out repository
-        uses: actions/checkout@v4
+      - uses: actions/checkout@v4
+
+      - uses: docker/setup-qemu-action@v3
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Login to GHCR
+        if: inputs.push == 'true'
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Set up Node.js
-        uses: actions/setup-node@v4
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
         with:
-          node-version: "20"
-          cache: "npm"
-          cache-dependency-path: frontend/package-lock.json
+          images: ${{ env.IMAGE_PREFIX }}-cpu
+          tags: |
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+
+      - name: Build & push
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile
+          platforms: linux/amd64,linux/arm64
+          push: ${{ inputs.push == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cpu
+          cache-to: type=gha,mode=max,scope=cpu
+
+
+  build-cuda-image:
+    name: Build — cuda
+    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - uses: actions/checkout@v4
 
-      - name: Install frontend dependencies
-        working-directory: frontend
-        run: npm ci
+      - uses: docker/setup-buildx-action@v3
 
-      - name: Build frontend
-        working-directory: frontend
-        run: npm run build
+      - name: Login to GHCR
+        if: inputs.push == 'true'
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.IMAGE_PREFIX }}-cuda
+          tags: |
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+
+      - name: Build & push
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.backend
+          platforms: linux/amd64
+          push: ${{ inputs.push == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cuda
+          cache-to: type=gha,mode=max,scope=cuda
\ No newline at end of file
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
index 1431739..ca9493f 100644
--- a/.github/workflows/docker-publish.yml
+++ b/.github/workflows/docker-publish.yml
@@ -1,73 +1,132 @@
-name: Publish Docker image
+name: Release
+
 on:
-  workflow_dispatch:      
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
+  workflow_dispatch:
+    inputs:
+      version:
+        description: "Version tag (e.g. 1.2.3)"
+        required: true
+
 env:
   REGISTRY: ghcr.io
+  IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix
+
 jobs:
-  build-and-push:
-    name: Build & push (${{ matrix.variant }})
-    runs-on: ubuntu-latest
-    permissions:
-      contents: read
-      packages: write
+
+  build-binaries:
+    name: Binary (${{ matrix.os }})
+    runs-on: ${{ matrix.os }}
     strategy:
-      fail-fast: false
       matrix:
+        os: [ubuntu-22.04, macos-14]
         include:
-          - variant: cpu
-            dockerfile: Dockerfile
-            tag_suffix: ""
-          - variant: cuda
-            dockerfile: Dockerfile.cuda
-            tag_suffix: "-cuda"
+          - os: ubuntu-22.04
+            artifact_name: quadtrix-linux-x64
+            binary: quadtrix
+          - os: macos-14
+            artifact_name: quadtrix-macos-arm64
+            binary: quadtrix
     steps:
-      - name: Checkout repository
-        uses: actions/checkout@v4
-      - name: Set lowercase image name
-        id: image
+      - uses: actions/checkout@v4
+
+      - name: Compile (Linux)
+        if: runner.os == 'Linux'
+        run: |
+          sudo apt-get update && sudo apt-get install -y g++
+          g++ -std=c++17 -O3 -march=native \
+              -I. -Iinclude \
+              -o ${{ matrix.binary }} main.cpp
+          strip ${{ matrix.binary }}
+
+      - name: Compile (macOS)
+        if: runner.os == 'macOS'
+        run: |
+          g++ -std=c++17 -O3 -march=native \
+              -I. -Iinclude \
+              -o ${{ matrix.binary }} main.cpp
+
+      - name: Package
         run: |
-          echo "name=$(echo '${{ github.repository }}' | tr '[:upper:]' '[:lower:]')" >> $GITHUB_OUTPUT
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@v3
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3
-      - name: Log in to ghcr.io
+          mkdir dist
+          cp ${{ matrix.binary }} dist/
+          cp README.md LICENSE dist/
+          tar -czf ${{ matrix.artifact_name }}.tar.gz -C dist .
+
+      - name: Upload to Release
+        uses: softprops/action-gh-release@v2
+        with:
+          tag_name: v${{ github.event.inputs.version }}
+          files: ${{ matrix.artifact_name }}.tar.gz
+          generate_release_notes: true
+
+  publish-images:
+    name: Publish Docker images
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: docker/setup-qemu-action@v3
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Login to GHCR
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
-      - name: Extract Docker metadata
-        id: meta
-        uses: docker/metadata-action@v5
+
+      - name: Parse tag
+        id: tag
+        run: echo "VERSION=${{ github.event.inputs.version }}" >> $GITHUB_OUTPUT
+
+      - name: Build & push backend
+        uses: docker/build-push-action@v6
         with:
-          images: ${{ env.REGISTRY }}/${{ steps.image.outputs.name }}
+          context: .
+          file: .devops/Dockerfile.backend
+          platforms: linux/amd64,linux/arm64
+          push: true
           tags: |
-            type=raw,value=latest${{ matrix.tag_suffix }},enable={{is_default_branch}}
-            type=semver,pattern={{version}},suffix=${{ matrix.tag_suffix }}
-            type=semver,pattern={{major}}.{{minor}},suffix=${{ matrix.tag_suffix }}
-            type=ref,event=pr,suffix=${{ matrix.tag_suffix }}
-      - name: Free disk space
-        if: matrix.variant == 'cuda'
-        run: |
-          sudo rm -rf /usr/share/dotnet
-          sudo rm -rf /opt/ghc
-          sudo rm -rf /usr/local/share/boost
-          df -h
-      - name: Build and push Docker image
+            ${{ env.IMAGE_PREFIX }}-backend:latest
+            ${{ env.IMAGE_PREFIX }}-backend:${{ steps.tag.outputs.VERSION }}
+          cache-from: type=gha,scope=backend
+          cache-to: type=gha,mode=max,scope=backend
+
+      - name: Build & push frontend
         uses: docker/build-push-action@v6
         with:
           context: .
-          file: ./${{ matrix.dockerfile }}
+          file: .devops/Dockerfile.frontend
+          platforms: linux/amd64,linux/arm64
           push: true
-          tags: ${{ steps.meta.outputs.tags }}
-          labels: ${{ steps.meta.outputs.labels }}
-          cache-from: type=gha,scope=${{ matrix.variant }}
-          cache-to: type=gha,mode=max,scope=${{ matrix.variant }}
-      - name: Image published
+          tags: |
+            ${{ env.IMAGE_PREFIX }}-frontend:latest
+            ${{ env.IMAGE_PREFIX }}-frontend:${{ steps.tag.outputs.VERSION }}
+          cache-from: type=gha,scope=frontend
+          cache-to: type=gha,mode=max,scope=frontend
+
+      - name: Build & push cpp
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.cpp
+          platforms: linux/amd64,linux/arm64
+          push: true
+          tags: |
+            ${{ env.IMAGE_PREFIX }}-cpp:latest
+            ${{ env.IMAGE_PREFIX }}-cpp:${{ steps.tag.outputs.VERSION }}
+          cache-from: type=gha,scope=cpp
+          cache-to: type=gha,mode=max,scope=cpp
+
+      - name: Create Release summary
         run: |
-          echo "[${{ matrix.variant }}] published:"
-          echo "  docker pull ${{ env.REGISTRY }}/${{ steps.image.outputs.name }}:latest${{ matrix.tag_suffix }}"
+          echo "## Docker images published" >> $GITHUB_STEP_SUMMARY
+          echo "" >> $GITHUB_STEP_SUMMARY
+          echo "| Image | Tags |" >> $GITHUB_STEP_SUMMARY
+          echo "|-------|------|" >> $GITHUB_STEP_SUMMARY
+          echo "| \`quadtrix-backend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
+          echo "| \`quadtrix-frontend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
+          echo "| \`quadtrix-cpp\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml
new file mode 100644
index 0000000..c52ae09
--- /dev/null
+++ b/.github/workflows/pr-check.yml
@@ -0,0 +1,238 @@
+name: PR Checks
+
+on:
+  issue_comment:
+    types: [created]
+
+jobs:
+  slash-command:
+    name: Parse /run-checks
+    if: |
+      github.event.issue.pull_request != null &&
+      contains(github.event.comment.body, '/run-checks')
+    runs-on: ubuntu-latest
+    outputs:
+      pr-sha: ${{ steps.get-sha.outputs.sha }}
+    steps:
+      - name: Check commenter permission
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const { data } = await github.rest.repos.getCollaboratorPermissionLevel({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              username: context.actor,
+            });
+            if (!['admin', 'write'].includes(data.permission)) {
+              await github.rest.issues.createComment({
+                owner: context.repo.owner,
+                repo:  context.repo.repo,
+                issue_number: context.issue.number,
+                body: `@${context.actor} Only maintainers can trigger checks.`,
+              });
+              core.setFailed('Unauthorized');
+            }
+
+      - name: React with rocket
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.reactions.createForIssueComment({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              comment_id: ${{ github.event.comment.id }},
+              content: 'rocket',
+            });
+
+      - name: Get PR head SHA
+        id: get-sha
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const { data: pr } = await github.rest.pulls.get({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              pull_number: context.issue.number,
+            });
+            core.setOutput('sha', pr.head.sha);
+
+
+  lint:
+    name: Lint
+    needs: slash-command
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{ needs.slash-command.outputs.pr-sha }}
+
+      - name: C++ format check
+        run: |
+          sudo apt-get install -y clang-format
+          find . -name "*.cpp" -o -name "*.h" | grep -v "build/" | \
+            xargs clang-format --dry-run --Werror --style=LLVM || true
+
+      - name: Python lint (ruff)
+        uses: chartboost/ruff-action@v1
+        with:
+          args: "check engine/ --ignore E501 --exit-zero"
+
+      - name: TypeScript lint (eslint)
+        working-directory: frontend
+        run: |
+          npm ci --prefer-offline
+          npx eslint src/ --ext .ts,.tsx --max-warnings 20 || true
+
+
+  build-cpp:
+    name: Build C++ (${{ matrix.os }})
+    needs: slash-command
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-22.04, ubuntu-24.04, macos-14]
+        include:
+          - os: ubuntu-22.04
+            artifact: quadtrix-linux-x64
+          - os: ubuntu-24.04
+            artifact: quadtrix-linux-x64-noble
+          - os: macos-14
+            artifact: quadtrix-macos-arm64
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{ needs.slash-command.outputs.pr-sha }}
+
+      - name: Install GCC (Linux)
+        if: runner.os == 'Linux'
+        run: sudo apt-get update && sudo apt-get install -y g++ ccache
+
+      - name: Cache ccache
+        uses: actions/cache@v4
+        with:
+          path: ~/.ccache
+          key: ccache-${{ matrix.os }}-${{ hashFiles('**/*.cpp', '**/*.h') }}
+          restore-keys: ccache-${{ matrix.os }}-
+
+      - name: Compile main.cpp
+        run: |
+          g++ -std=c++17 -O3 -march=native \
+            -I. -Iinclude \
+            -o quadtrix main.cpp
+
+      - name: Smoke test
+        run: ./quadtrix --help || true
+
+      - name: Upload artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ matrix.artifact }}
+          path: quadtrix
+          retention-days: 7
+
+
+  validate-dockerfiles:
+    name: Validate Dockerfiles
+    needs: slash-command
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{ needs.slash-command.outputs.pr-sha }}
+
+    
+      - name: Check required files exist
+        run: |
+          echo "Checking files referenced by Dockerfiles..."
+          files=(
+            "main.cpp"
+            "engine/main.py"
+            "requirements.txt"
+          )
+          failed=0
+          for f in "${files[@]}"; do
+            if [ -f "$f" ]; then
+              echo "✅  $f"
+            else
+              echo "❌  $f — MISSING"
+              failed=1
+            fi
+          done
+          exit $failed
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Build check — Dockerfile.cpp (C++ engine)
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.cpp
+          platforms: linux/amd64
+          push: false
+          cache-from: type=gha,scope=cpp
+          cache-to: type=gha,mode=max,scope=cpp
+
+
+      - name: Build check — Dockerfile (PyTorch CPU)
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile
+          platforms: linux/amd64
+          push: false
+          cache-from: type=gha,scope=cpu
+          cache-to: type=gha,mode=max,scope=cpu
+
+      - name: Skip CUDA build check
+        run: echo "CUDA build skipped on PR checks — run publish-docker workflow to build cuda image."
+
+
+  test-frontend:
+    name: Frontend Tests
+    needs: [slash-command, lint]
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{ needs.slash-command.outputs.pr-sha }}
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: "20"
+          cache: npm
+          cache-dependency-path: frontend/package-lock.json
+
+      - name: Install
+        working-directory: frontend
+        run: npm ci --prefer-offline
+
+      - name: Type-check
+        working-directory: frontend
+        run: npx tsc --noEmit
+
+      - name: Build check
+        working-directory: frontend
+        run: npm run build
+
+
+  post-result:
+    name: Post result
+    needs: [slash-command, lint, build-cpp, validate-dockerfiles, test-frontend]
+    runs-on: ubuntu-latest
+    if: always()
+    steps:
+      - uses: actions/github-script@v7
+        with:
+          script: |
+            const jobs   = ${{ toJSON(needs) }};
+            const failed = Object.values(jobs).some(j => j.result === 'failure');
+            await github.rest.issues.createComment({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              issue_number: context.issue.number,
+              body: failed
+                ? ' Some checks failed — see Actions for details.'
+                : ' All checks passed!',
+            });
\ No newline at end of file
diff --git a/CUDA/main.cu b/CUDA/main.cu
new file mode 100644
index 0000000..4b24fec
--- /dev/null
+++ b/CUDA/main.cu
@@ -0,0 +1,2070 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <string>
+#include <string_view>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include "llmcpp/utils.h"
+
+#include "llmcpp/tokenizer.h"
+
+#include "llmcpp/dataloader.h"
+
+#include "llmcpp/rand.h"
+
+#include "llmcpp/schedulers.h"
+
+#include "llmcpp/sampler.h"
+
+#include "llmcpp/logger.h"
+
+#include "llmcpp/mfu.h"
+
+#include "llmcpp/outlier_detector.h"
+
+#include "llmcpp/cuda_common.h"
+
+#include "llmcpp/cuda_utils.cuh"
+
+#include "llmcpp/cublas_common.h"
+
+#include "llmcpp/encoder.cuh"
+
+#include "llmcpp/layernorm.cuh"
+
+#include "llmcpp/matmul.cuh"
+#ifdef ENABLE_CUDNN
+
+#include "llmcpp/cudnn_att.h"
+#else
+
+#include "llmcpp/attention.cuh"
+#endif
+
+#include "llmcpp/fused_classifier.cuh"
+
+#include "llmcpp/adamw.cuh"
+
+#include "llmcpp/global_norm.cuh"
+
+#include "llmcpp/zero.cuh"
+
+char filename_buffer[512];
+
+cudaDeviceProp deviceProp;
+cudaStream_t main_stream;
+
+constexpr const size_t IO_BUF_SIZE = 32 * 1024 * 1024;
+
+typedef struct
+{
+      int max_seq_len;
+      int vocab_size;
+      int padded_vocab_size;
+      int num_layers;
+      int num_heads;
+      int channels;
+} GPT2Config;
+
+constexpr const int NUM_PARAMETER_TENSORS = 16;
+typedef struct
+{
+      floatX *wte;
+      floatX *wpe;
+      floatX *ln1w;
+      floatX *ln1b;
+      floatX *qkvw;
+      floatX *qkvb;
+      floatX *attprojw;
+      floatX *attprojb;
+      floatX *ln2w;
+      floatX *ln2b;
+      floatX *fcw;
+      floatX *fcb;
+      floatX *fcprojw;
+      floatX *fcprojb;
+      floatX *lnfw;
+      floatX *lnfb;
+} ParameterTensors;
+static_assert(sizeof(ParameterTensors) == NUM_PARAMETER_TENSORS * sizeof(void *), "Inconsistent sizes!");
+
+void fill_in_parameter_sizes(size_t *param_sizes, size_t *param_sizeof, GPT2Config config)
+{
+      size_t Vp = config.padded_vocab_size;
+      size_t C = config.channels;
+      size_t maxT = config.max_seq_len;
+      size_t L = config.num_layers;
+      param_sizes[0] = Vp * C;
+      param_sizes[1] = maxT * C;
+      param_sizes[2] = L * C;
+      param_sizes[3] = L * C;
+      param_sizes[4] = L * (3 * C) * C;
+      param_sizes[5] = L * (3 * C);
+      param_sizes[6] = L * C * C;
+      param_sizes[7] = L * C;
+      param_sizes[8] = L * C;
+      param_sizes[9] = L * C;
+      param_sizes[10] = L * (4 * C) * C;
+      param_sizes[11] = L * (4 * C);
+      param_sizes[12] = L * C * (4 * C);
+      param_sizes[13] = L * C;
+      param_sizes[14] = C;
+      param_sizes[15] = C;
+
+      for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+      {
+            param_sizeof[i] = sizeof(floatX);
+      }
+}
+
+void *malloc_and_point_parameters(ParameterTensors *params, size_t *param_elements, size_t *param_sizeof)
+{
+
+      size_t num_parameters_bytes = 0;
+      for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+      {
+            num_parameters_bytes += param_elements[i] * param_sizeof[i];
+      }
+
+      void *params_memory;
+      cudaCheck(cudaMalloc((void **)&params_memory, num_parameters_bytes));
+
+      floatX **ptrs[] = {
+          &params->wte, &params->wpe, &params->ln1w, &params->ln1b, &params->qkvw, &params->qkvb,
+          &params->attprojw, &params->attprojb, &params->ln2w, &params->ln2b, &params->fcw, &params->fcb,
+          &params->fcprojw, &params->fcprojb, &params->lnfw, &params->lnfb};
+      char *params_memory_iterator = (char *)params_memory;
+      for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+      {
+            *(ptrs[i]) = (floatX *)params_memory_iterator;
+            params_memory_iterator += param_elements[i] * param_sizeof[i];
+      }
+      return params_memory;
+}
+
+constexpr int NUM_ACTIVATION_TENSORS = 21;
+typedef struct
+{
+      floatX *encoded;
+      floatX *ln1;
+      float *ln1_mean;
+      float *ln1_rstd;
+      floatX *atty;
+
+#if ENABLE_CUDNN
+      float *att;
+#else
+      floatX *att;
+#endif
+
+      floatX *residual2;
+      floatX *ln2;
+      float *ln2_mean;
+      float *ln2_rstd;
+      floatX *fch;
+      floatX *fch_gelu;
+      floatX *residual3;
+      floatX *lnf;
+      float *lnf_mean;
+      float *lnf_rstd;
+      float *losses;
+
+      floatX *qkvr;
+
+      floatX *output;
+
+      floatX *scratch_bt4c;
+      floatX *scratch_btc;
+} ActivationTensors;
+
+struct TensorSpec
+{
+      void **ptr;
+      size_t size;
+      DType type;
+};
+
+#define TENSOR_SPEC(pointer, size) TensorSpec{(void **)(&pointer), (size), dtype_of(pointer)};
+
+void fill_in_activation_sizes(const ActivationTensors *data, TensorSpec (&tensors)[NUM_ACTIVATION_TENSORS], size_t B, size_t T, GPT2Config config, int recompute)
+{
+      size_t Vp = config.padded_vocab_size;
+      size_t L = config.num_layers;
+      size_t NH = config.num_heads;
+      size_t C = config.channels;
+      tensors[0] = TENSOR_SPEC(data->encoded, B * T * C);
+
+      tensors[1] = TENSOR_SPEC(data->ln1, (recompute < 2) ? L * B * T * C : 0);
+      tensors[2] = TENSOR_SPEC(data->ln1_mean, L * B * T);
+      tensors[3] = TENSOR_SPEC(data->ln1_rstd, L * B * T);
+      tensors[4] = TENSOR_SPEC(data->atty, L * B * T * C);
+#ifdef ENABLE_CUDNN
+
+      tensors[5] = TENSOR_SPEC(data->att, L * B * NH * T);
+#else
+      tensors[5] = TENSOR_SPEC(data->att, L * B * NH * T * T);
+#endif
+      tensors[6] = TENSOR_SPEC(data->residual2, L * B * T * C);
+
+      tensors[7] = TENSOR_SPEC(data->ln2, (recompute < 2) ? L * B * T * C : 0);
+      tensors[8] = TENSOR_SPEC(data->ln2_mean, L * B * T);
+      tensors[9] = TENSOR_SPEC(data->ln2_rstd, L * B * T);
+      tensors[10] = TENSOR_SPEC(data->fch, L * B * T * 4 * C);
+
+      tensors[11] = TENSOR_SPEC(data->fch_gelu, (recompute < 1) ? L * B * T * 4 * C : B * T * 4 * C);
+      tensors[12] = TENSOR_SPEC(data->residual3, L * B * T * C);
+      tensors[13] = TENSOR_SPEC(data->lnf, B * T * C);
+      tensors[14] = TENSOR_SPEC(data->lnf_mean, B * T);
+      tensors[15] = TENSOR_SPEC(data->lnf_rstd, B * T);
+      tensors[16] = TENSOR_SPEC(data->losses, B * T);
+      tensors[17] = TENSOR_SPEC(data->qkvr, L * B * T * 3 * C);
+      tensors[18] = TENSOR_SPEC(data->output, B * T * max(3 * C, max(NH * T, Vp)));
+
+      tensors[19] = TENSOR_SPEC(data->scratch_bt4c, B * T * 4 * C);
+      tensors[20] = TENSOR_SPEC(data->scratch_btc, B * T * C);
+}
+
+void *malloc_and_point_activations(TensorSpec (&tensors)[NUM_ACTIVATION_TENSORS])
+{
+      size_t bytes = 0;
+      for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++)
+      {
+            bytes += tensors[i].size * sizeof_dtype(tensors[i].type);
+      }
+
+      printf0("allocating %d MiB for activations\n", (int)round(bytes / (1024 * 1024)));
+
+      void *acts_memory;
+      cudaCheck(cudaMalloc((void **)&acts_memory, bytes));
+
+      cudaCheck(cudaMemset(acts_memory, 0, bytes));
+
+      char *acts_memory_iterator = (char *)acts_memory;
+      for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++)
+      {
+
+            if (tensors[i].size == 0)
+            {
+                  *(tensors[i].ptr) = NULL;
+            }
+            else
+            {
+                  *(tensors[i].ptr) = acts_memory_iterator;
+                  acts_memory_iterator += tensors[i].size * sizeof_dtype(tensors[i].type);
+            }
+      }
+      return acts_memory;
+}
+
+typedef struct
+{
+      GPT2Config config;
+
+      ParameterTensors params;
+      size_t param_elements[NUM_PARAMETER_TENSORS];
+      size_t param_sizeof[NUM_PARAMETER_TENSORS];
+      void *params_memory;
+      size_t num_parameters;
+      size_t num_parameters_bytes;
+
+      ParameterTensors grads;
+      void *grads_memory;
+
+      float *m_memory;
+      float *v_memory;
+      float *master_weights;
+
+      ActivationTensors acts;
+      TensorSpec acts_specs[NUM_ACTIVATION_TENSORS];
+      void *acts_memory;
+
+      int batch_size;
+      int seq_len;
+      int *inputs;
+      int *targets;
+      float mean_loss;
+      float *accumulated_mean_loss;
+      float *cpu_losses;
+      unsigned long long rng_state;
+      unsigned long long rng_state_last_update;
+      int use_master_weights;
+      bool init_state;
+      int gelu_fusion;
+      int recompute;
+
+      int *workload_indices;
+      int4 *bucket_info;
+} GPT2;
+
+void gpt2_init_common(GPT2 *model)
+{
+
+      model->acts_memory = NULL;
+      model->inputs = NULL;
+      model->targets = NULL;
+      model->accumulated_mean_loss = NULL;
+      model->cpu_losses = NULL;
+
+      model->batch_size = 0;
+      model->seq_len = 0;
+      model->mean_loss = -1.0f;
+      model->params_memory = NULL;
+
+      model->grads_memory = NULL;
+      model->workload_indices = NULL;
+      model->bucket_info = NULL;
+
+      model->m_memory = NULL;
+      model->v_memory = NULL;
+      model->master_weights = NULL;
+
+      model->rng_state = 13371337 + multi_gpu_config.process_rank;
+      model->use_master_weights = 1;
+      model->init_state = true;
+      model->recompute = 1;
+      model->gelu_fusion = 0;
+}
+
+void gpt2_allocate_weights(GPT2 *model)
+{
+
+      fill_in_parameter_sizes(model->param_elements, model->param_sizeof, model->config);
+      model->num_parameters = 0;
+      model->num_parameters_bytes = 0;
+      for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+      {
+            model->num_parameters += model->param_elements[i];
+            model->num_parameters_bytes += model->param_elements[i] * model->param_sizeof[i];
+      }
+
+      assert(model->params_memory == nullptr);
+      model->params_memory = malloc_and_point_parameters(&model->params, model->param_elements, model->param_sizeof);
+}
+
+void gpt2_allocate_state(GPT2 *model, int B, int T)
+{
+      printf0("allocating %d MiB for parameter gradients\n", (int)round(model->num_parameters * sizeof(floatX) / (1024 * 1024)));
+      assert(model->grads_memory == nullptr);
+      model->grads_memory = malloc_and_point_parameters(&model->grads, model->param_elements, model->param_sizeof);
+
+      model->batch_size = B;
+      model->seq_len = T;
+
+      fill_in_activation_sizes(&model->acts, model->acts_specs, B, T, model->config, model->recompute);
+      model->acts_memory = malloc_and_point_activations(model->acts_specs);
+
+      cudaCheck(cudaMalloc((void **)&model->inputs, B * T * sizeof(int)));
+      cudaCheck(cudaMalloc((void **)&model->targets, B * T * sizeof(int)));
+      cudaCheck(cudaMalloc(((void **)&model->accumulated_mean_loss), sizeof(float)));
+      cudaCheck(cudaMallocHost((void **)&model->cpu_losses, B * T * sizeof(float)));
+
+      size_t num_c_groups = CEIL_DIV(model->config.channels, (WARP_SIZE * x128::size));
+      assert((size_t)(model->batch_size * model->seq_len) * num_c_groups < (1ULL << 31ULL));
+      model->workload_indices = (int *)mallocCheck(sizeof(int) * model->batch_size * model->seq_len * num_c_groups);
+      model->bucket_info = (int4 *)mallocCheck(sizeof(int4) * model->batch_size * model->seq_len * num_c_groups);
+
+      int memory_status = 0;
+
+      size_t shard_num_parameters = multi_gpu_config.shard_num_parameters;
+      printf0("allocating %zu MiB for AdamW optimizer state m\n", (shard_num_parameters * sizeof(float)) >> 20);
+      printf0("allocating %zu MiB for AdamW optimizer state v\n", (shard_num_parameters * sizeof(float)) >> 20);
+      assert(model->m_memory == nullptr);
+      assert(model->v_memory == nullptr);
+      memory_status |= cudaMallocConditionallyManaged((void **)&model->m_memory, shard_num_parameters * sizeof(float));
+      memory_status |= cudaMallocConditionallyManaged((void **)&model->v_memory, shard_num_parameters * sizeof(float));
+
+      if (model->use_master_weights == 1)
+      {
+            assert(model->master_weights == nullptr);
+            printf0("allocating %zu MiB for master copy of params\n", (shard_num_parameters * sizeof(float)) >> 20);
+            memory_status |= cudaMallocConditionallyManaged((void **)&model->master_weights, shard_num_parameters * sizeof(float));
+      }
+
+      int reduced_memory_status = (int)multi_gpu_cpu_float_sum((float)memory_status, &multi_gpu_config);
+      if (reduced_memory_status >= 1)
+      {
+            printf0("WARNING: Fell back to cudaMallocManaged when initializing m,v,master_weights on %d GPUs\n", reduced_memory_status);
+            printf0("         Prevents an OOM, but code may run much slower due to device <-> host memory movement\n");
+      }
+
+      size_t free, total;
+      cudaCheck(cudaMemGetInfo(&free, &total));
+      printf0("device memory usage: %zd MiB / %zd MiB\n", (total - free) / 1024 / 1024, total / 1024 / 1024);
+
+      size_t bytes_per_sequence = 0;
+      for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++)
+      {
+            bytes_per_sequence += model->acts_specs[i].size * sizeof_dtype(model->acts_specs[i].type) / B;
+      }
+      printf0("memory per sequence: %zu MiB\n", bytes_per_sequence / 1024 / 1024);
+      printf0(" -> estimated maximum batch size: %zu\n", B + free / bytes_per_sequence);
+}
+
+void gpt2_write_to_checkpoint(GPT2 *model, const char *checkpoint_path)
+{
+
+      printf0("Writing model to %s\n", checkpoint_path);
+      FILE *model_file = fopenCheck(checkpoint_path, "wb");
+
+      int model_header[256];
+      memset(model_header, 0, sizeof(model_header));
+      model_header[0] = 20240326;
+      assert(PRECISION_MODE == PRECISION_FP32 || PRECISION_MODE == PRECISION_BF16);
+      model_header[1] = PRECISION_MODE == PRECISION_FP32 ? 3 : 5;
+      model_header[2] = model->config.max_seq_len;
+      model_header[3] = model->config.vocab_size;
+      model_header[4] = model->config.num_layers;
+      model_header[5] = model->config.num_heads;
+      model_header[6] = model->config.channels;
+      model_header[7] = model->config.padded_vocab_size;
+      fwriteCheck(model_header, sizeof(int), 256, model_file);
+
+      device_to_file(model_file, model->params_memory, model->num_parameters_bytes,
+                     IO_BUF_SIZE, main_stream);
+
+      fcloseCheck(model_file);
+}
+
+void gpt2_build_from_checkpoint(GPT2 *model, const char *checkpoint_path, bool weight_init = true)
+{
+
+      if (PRECISION_MODE == PRECISION_FP16)
+      {
+
+            fprintf(stderr, "build_from_checkpoint() does not support fp16 right now.\n");
+            exit(EXIT_FAILURE);
+      }
+
+      FILE *model_file = fopenCheck(checkpoint_path, "rb");
+      int model_header[256];
+      freadCheck(model_header, sizeof(int), 256, model_file);
+      if (model_header[0] != 20240326)
+      {
+            printf("Bad magic model file\n");
+            exit(EXIT_FAILURE);
+      }
+      int version = model_header[1];
+      if (!(version == 3 || version == 5))
+      {
+
+            fprintf(stderr, "Bad version in model file\n");
+            fprintf(stderr, "---> HINT: try to re-run `python train_gpt2.py`\n");
+            exit(EXIT_FAILURE);
+      }
+
+      if (weight_init)
+      {
+            if (PRECISION_MODE == PRECISION_BF16 && version != 5)
+            {
+                  fprintf(stderr, "Precision is configured as BF16 but model at %s is not.\n", checkpoint_path);
+                  fprintf(stderr, "---> HINT: are you sure you're loading a _bf16.bin file?\n");
+                  exit(EXIT_FAILURE);
+            }
+            if (PRECISION_MODE == PRECISION_FP32 && version != 3)
+            {
+                  fprintf(stderr, "Precision is configured as FP32 but model at %s is not.\n", checkpoint_path);
+                  fprintf(stderr, "---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`\n");
+                  fprintf(stderr, "---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?\n");
+                  exit(EXIT_FAILURE);
+            }
+      }
+
+      model->config.max_seq_len = model_header[2];
+      model->config.vocab_size = model_header[3];
+      model->config.num_layers = model_header[4];
+      model->config.num_heads = model_header[5];
+      model->config.channels = model_header[6];
+      model->config.padded_vocab_size = model_header[7];
+
+      gpt2_allocate_weights(model);
+
+      if (weight_init)
+      {
+            assert(model->params_memory != NULL);
+            file_to_device(model->params_memory, model_file, model->num_parameters_bytes, IO_BUF_SIZE, main_stream);
+      }
+      fcloseCheck(model_file);
+
+      cudaCheck(cudaDeviceSynchronize());
+}
+
+void gpt2_set_hyperparameters(GPT2Config *config, const char *depth_str)
+{
+      int depth = atoi(depth_str);
+      assert(depth > 0);
+      int channels, num_heads;
+      if (depth == 6)
+      {
+            channels = 384;
+            num_heads = 6;
+      }
+      else if (depth == 12)
+      {
+            channels = 768;
+            num_heads = 12;
+      }
+      else if (depth == 24)
+      {
+            channels = 1024;
+            num_heads = 16;
+      }
+      else if (depth == 36)
+      {
+            channels = 1280;
+            num_heads = 20;
+      }
+      else if (depth == 48)
+      {
+            channels = 1600;
+            num_heads = 25;
+      }
+      else if (depth == 60)
+      {
+            channels = 1920;
+            num_heads = 30;
+      }
+      else if (depth == 72)
+      {
+            channels = 2880;
+            num_heads = 30;
+      }
+      else if (depth == 84)
+      {
+            channels = 3456;
+            num_heads = 36;
+      }
+      else
+      {
+            fprintf(stderr, "Unsupported GPT-2 depth: %d\n", depth);
+            exit(EXIT_FAILURE);
+      }
+      config->num_layers = depth;
+      config->channels = channels;
+      config->num_heads = num_heads;
+      config->max_seq_len = 1024;
+}
+
+void gpt3_set_hyperparameters(GPT2Config *config, const char *channels_str)
+{
+
+      int channels = atoi(channels_str);
+      assert(channels > 0);
+      int depth, head_size;
+      if (channels == 384)
+      {
+            depth = 6;
+            head_size = 64;
+      }
+      else if (channels == 768)
+      {
+            depth = 12;
+            head_size = 64;
+      }
+      else if (channels == 1024)
+      {
+            depth = 24;
+            head_size = 64;
+      }
+      else if (channels == 1536)
+      {
+            depth = 24;
+            head_size = 96;
+      }
+      else if (channels == 2048)
+      {
+            depth = 24;
+            head_size = 128;
+      }
+      else if (channels == 2560)
+      {
+            depth = 32;
+            head_size = 80;
+      }
+      else if (channels == 4096)
+      {
+            depth = 32;
+            head_size = 128;
+      }
+      else if (channels == 5140)
+      {
+            depth = 40;
+            head_size = 128;
+      }
+      else if (channels == 12288)
+      {
+            depth = 96;
+            head_size = 128;
+      }
+      else
+      {
+            fprintf(stderr, "Unsupported GPT-3 channels: %d\n", channels);
+            exit(EXIT_FAILURE);
+      }
+      assert(channels % head_size == 0);
+      config->num_layers = depth;
+      config->channels = channels;
+      config->num_heads = channels / head_size;
+      config->max_seq_len = 2048;
+}
+
+void gpt_build_from_descriptor(GPT2 *model, const char *descriptor)
+{
+
+      assert(descriptor != NULL);
+      size_t len = strlen(descriptor);
+      if (len > 1 && descriptor[0] == 'd')
+      {
+            gpt2_set_hyperparameters(&model->config, descriptor + 1);
+      }
+      else if (len > 6 && strncmp(descriptor, "gpt2:d", 6) == 0)
+      {
+            gpt2_set_hyperparameters(&model->config, descriptor + 6);
+      }
+      else if (len > 6 && strncmp(descriptor, "gpt3:c", 6) == 0)
+      {
+            gpt3_set_hyperparameters(&model->config, descriptor + 6);
+      }
+      else
+      {
+            fprintf(stderr, "Unsupported model descriptor: %s\n", descriptor);
+            exit(EXIT_FAILURE);
+      }
+
+      model->config.vocab_size = 50257;
+      model->config.padded_vocab_size = 50304;
+
+      gpt2_allocate_weights(model);
+
+      mt19937_state init_rng;
+      manual_seed(&init_rng, 42);
+      floatX *params_memory_cpu = (floatX *)mallocCheck(model->num_parameters_bytes);
+      memset(params_memory_cpu, 0, model->num_parameters_bytes);
+
+      float residual_scale = 1.0f / sqrtf(2.0f * model->config.num_layers);
+
+      size_t L = model->config.num_layers;
+      size_t offset = 0;
+      for (int l = 0; l < L; l++)
+      {
+            offset = 0;
+            for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+            {
+
+                  if (l == 0 && (i == 2 || i == 8 || i == 14))
+                  {
+                        for (size_t j = 0; j < model->param_elements[i]; j++)
+                        {
+                              params_memory_cpu[offset + j] = 1.0f;
+                        }
+                  }
+
+                  if ((l == 0 && (i == 0 || i == 1)) || i == 4 || i == 6 || i == 10 || i == 12)
+                  {
+                        size_t n = model->param_elements[i];
+                        size_t layer_offset = 0;
+                        if (i == 0)
+                        {
+
+                              n = model->config.vocab_size * model->config.channels;
+                        }
+                        if (i == 4 || i == 6 || i == 10 || i == 12)
+                        {
+
+                              assert(n % L == 0);
+                              n = n / L;
+                              layer_offset = l * n;
+                        }
+
+                        float scale = (i == 6 || i == 12) ? 0.02f * residual_scale : 0.02f;
+
+                        float *fp32_buffer = (float *)mallocCheck(n * sizeof(float));
+                        normal_(fp32_buffer, n, 0.0f, scale, &init_rng);
+                        for (size_t j = 0; j < n; j++)
+                        {
+                              params_memory_cpu[offset + layer_offset + j] = (floatX)fp32_buffer[j];
+                        }
+                        free(fp32_buffer);
+                  }
+                  offset += model->param_elements[i];
+            }
+      }
+
+      cudaCheck(cudaMemcpy(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice));
+      free(params_memory_cpu);
+}
+
+void gpt2_forward(GPT2 *model, const int *inputs, size_t B, size_t T)
+{
+      NVTX_RANGE_FN();
+
+      if (model->params_memory == NULL)
+      {
+            printf("Error: model was not initialized properly.\n");
+            exit(EXIT_FAILURE);
+      }
+
+      const size_t V = model->config.vocab_size;
+      const size_t Vp = model->config.padded_vocab_size;
+      const size_t L = model->config.num_layers;
+      const size_t NH = model->config.num_heads;
+      const size_t C = model->config.channels;
+
+      if (B > model->batch_size || T > model->seq_len)
+      {
+            printf("Model: B=%d T=%d, Desired: B=%d T=%d\n", model->batch_size, model->seq_len, (int)B, (int)T);
+            exit(EXIT_FAILURE);
+      }
+
+      cudaCheck(cudaMemcpy(model->inputs, inputs, B * T * sizeof(int), cudaMemcpyHostToDevice));
+
+      tokenCheck(inputs, B * T, V);
+
+      ParameterTensors params = model->params;
+      ActivationTensors acts = model->acts;
+      encoder_forward(acts.encoded, model->inputs, params.wte, params.wpe, B, T, C, main_stream);
+
+      layernorm_forward((model->recompute < 2) ? acts.ln1 : acts.lnf, acts.ln1_mean, acts.ln1_rstd, acts.encoded, params.ln1w, params.ln1b, B, T, C, main_stream);
+
+      for (int l = 0; l < L; l++)
+      {
+            NvtxRange layer_range("Layer", l);
+
+            floatX *residual = l == 0 ? acts.encoded : acts.residual3 + (l - 1) * B * T * C;
+
+            floatX *l_qkvw = params.qkvw + l * 3 * C * C;
+            floatX *l_qkvb = params.qkvb + l * 3 * C;
+            floatX *l_attprojw = params.attprojw + l * C * C;
+            floatX *l_attprojb = params.attprojb + l * C;
+            floatX *l_ln2w = params.ln2w + l * C;
+            floatX *l_ln2b = params.ln2b + l * C;
+            floatX *l_fcw = params.fcw + l * 4 * C * C;
+            floatX *l_fcb = params.fcb + l * 4 * C;
+            floatX *l_fcprojw = params.fcprojw + l * C * 4 * C;
+            floatX *l_fcprojb = params.fcprojb + l * C;
+
+            floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + l * B * T * C : acts.lnf;
+            floatX *l_qkvr = acts.qkvr + l * B * T * 3 * C;
+            floatX *l_atty = acts.atty + l * B * T * C;
+            floatX *l_residual2 = acts.residual2 + l * B * T * C;
+            floatX *l_ln2 = (model->recompute < 2) ? acts.ln2 + l * B * T * C : acts.lnf;
+            float *l_ln2_mean = acts.ln2_mean + l * B * T;
+            float *l_ln2_rstd = acts.ln2_rstd + l * B * T;
+            floatX *l_fch = acts.fch + l * B * T * 4 * C;
+
+            floatX *l_fch_gelu = (model->recompute < 1) ? acts.fch_gelu + l * B * T * 4 * C : acts.fch_gelu;
+            floatX *l_residual3 = acts.residual3 + l * B * T * C;
+            floatX *scratch = (floatX *)acts.output;
+
+#ifdef ENABLE_CUDNN
+            float *l_att = (float *)acts.att + l * B * NH * T;
+            matmul_forward_cublaslt(l_qkvr, l_ln1, l_qkvw, l_qkvb, B, T, C, 3 * C, main_stream);
+            attention_forward_cudnn(l_atty, (float *)l_att, l_qkvr, B, T, NH, C, main_stream);
+#else
+            floatX *l_att = acts.att + l * B * NH * T * T;
+            if (T != model->seq_len)
+            {
+                  cudaCheck(cudaMemset(l_att, 0, B * NH * T * T * sizeof(floatX)));
+            }
+
+            matmul_forward_cublaslt(scratch, l_ln1, l_qkvw, l_qkvb, B, T, C, 3 * C, main_stream);
+            attention_forward(l_atty, l_qkvr, l_att, scratch, B, T, C, NH, main_stream);
+#endif
+
+            matmul_forward_cublaslt(scratch, l_atty, l_attprojw, l_attprojb, B, T, C, C, main_stream);
+            fused_residual_forward5(l_residual2, l_ln2, l_ln2_mean, l_ln2_rstd, residual, scratch, l_ln2w, l_ln2b, B * T, C, main_stream);
+            matmul_forward_cublaslt(l_fch_gelu, l_ln2, l_fcw, l_fcb, B, T, C, 4 * C, main_stream, l_fch, model->gelu_fusion);
+            matmul_forward_cublaslt(scratch, l_fch_gelu, l_fcprojw, l_fcprojb, B, T, 4 * C, C, main_stream);
+
+            if (l + 1 != L)
+            {
+                  floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + (l + 1) * B * T * C : acts.lnf;
+                  float *l_ln1_mean = acts.ln1_mean + (l + 1) * B * T;
+                  float *l_ln1_rstd = acts.ln1_rstd + (l + 1) * B * T;
+                  const floatX *l_ln1w = params.ln1w + (l + 1) * C;
+                  const floatX *l_ln1b = params.ln1b + (l + 1) * C;
+                  fused_residual_forward5(l_residual3, l_ln1, l_ln1_mean, l_ln1_rstd, l_residual2, scratch, l_ln1w, l_ln1b,
+                                          B * T, C, main_stream);
+            }
+            else
+            {
+                  fused_residual_forward5(l_residual3, acts.lnf, acts.lnf_mean, acts.lnf_rstd, l_residual2, scratch,
+                                          params.lnfw, params.lnfb,
+                                          B * T, C, main_stream);
+            }
+      }
+
+      matmul_forward_cublaslt(acts.output, acts.lnf, params.wte, NULL, B, T, C, Vp, main_stream);
+      cudaCheck(cudaDeviceSynchronize());
+}
+
+float gpt2_validate(GPT2 *model, const int *inputs, const int *targets, size_t B, size_t T)
+{
+      assert(targets != NULL);
+
+      gpt2_forward(model, inputs, B, T);
+
+      const size_t V = model->config.vocab_size;
+      const size_t Vp = model->config.padded_vocab_size;
+
+      NvtxRange classifier_and_loss_range("classifier_and_loss");
+      ActivationTensors acts = model->acts;
+      float mean_loss = 0.0f;
+
+      const float dloss = 1.0f / (B * T);
+
+      cudaCheck(cudaMemset(acts.losses, 0, B * T * sizeof(float)));
+      cudaCheck(cudaMemcpy(model->targets, targets, B * T * sizeof(int), cudaMemcpyHostToDevice));
+      tokenCheck(targets, B * T, V);
+      fused_classifier(acts.output, acts.losses, dloss, model->targets, B, T, V, Vp, False, main_stream);
+      cudaCheck(cudaMemcpy(model->cpu_losses, acts.losses, B * T * sizeof(float), cudaMemcpyDeviceToHost));
+      for (int i = 0; i < B * T; i++)
+      {
+            mean_loss += model->cpu_losses[i];
+      }
+      mean_loss /= B * T;
+      cudaCheck(cudaDeviceSynchronize());
+      return mean_loss;
+}
+
+void gpt2_backward_and_reduce(GPT2 *model, int *inputs, const int *targets, int grad_accum_steps, int micro_step)
+{
+      if (model->grads_memory == nullptr)
+      {
+            fprintf(stderr, "Need to allocate gradients before backward");
+            exit(EXIT_FAILURE);
+      }
+      NVTX_RANGE_FN();
+      bool last_step = micro_step == grad_accum_steps - 1;
+
+      if (micro_step == 0)
+      {
+
+            cudaCheck(cudaMemsetAsync(model->acts.losses, 0, model->batch_size * model->seq_len * sizeof(float), main_stream));
+            cudaCheck(cudaMemsetAsync(model->grads_memory, 0, model->num_parameters * sizeof(floatX), main_stream));
+      }
+
+      const size_t B = model->batch_size;
+      const size_t T = model->seq_len;
+      const size_t V = model->config.vocab_size;
+      const size_t Vp = model->config.padded_vocab_size;
+      const size_t L = model->config.num_layers;
+      const size_t NH = model->config.num_heads;
+      const size_t C = model->config.channels;
+
+      ParameterTensors params = model->params;
+      ParameterTensors grads = model->grads;
+      ActivationTensors acts = model->acts;
+
+      NvtxRange classifier_and_loss_range("classifier_and_loss");
+      const float dloss = 1.0f / (float)(B * T * grad_accum_steps);
+      cudaCheck(cudaMemcpy(model->targets, targets, B * T * sizeof(int), cudaMemcpyHostToDevice));
+      tokenCheck(targets, B * T, V);
+      fused_classifier(acts.output, acts.losses, dloss, model->targets, B, T, V, Vp, True, main_stream);
+
+      floatX *dresidual = (floatX *)model->acts.scratch_btc;
+      cudaCheck(cudaMemset(dresidual, 0, B * T * C * sizeof(floatX)));
+
+      float *scratchF = (float *)acts.output;
+      floatX *scratchX = (floatX *)acts.output;
+
+      matmul_backward(model->acts.scratch_bt4c, grads.wte, NULL, acts.output, acts.lnf, params.wte, NULL, B, T, C, Vp, main_stream);
+
+      floatX *residual = acts.residual3 + (L - 1) * B * T * C;
+      layernorm_backward(dresidual, grads.lnfw, grads.lnfb, scratchF, model->acts.scratch_bt4c, residual, params.lnfw, acts.lnf_mean, acts.lnf_rstd, B, T, C, main_stream);
+
+      floatX *dl_btc = residual;
+
+      for (int l = L - 1; l >= 0; l--)
+      {
+            NvtxRange layer_range("Layer", l);
+
+            residual = l == 0 ? acts.encoded : acts.residual3 + (l - 1) * B * T * C;
+
+            floatX *l_ln1w = params.ln1w + l * C;
+            floatX *l_ln1b = params.ln1b + l * C;
+            floatX *l_qkvw = params.qkvw + l * 3 * C * C;
+            floatX *l_attprojw = params.attprojw + l * C * C;
+            floatX *l_ln2w = params.ln2w + l * C;
+            floatX *l_ln2b = params.ln2b + l * C;
+            floatX *l_fcw = params.fcw + l * 4 * C * C;
+            floatX *l_fcprojw = params.fcprojw + l * C * 4 * C;
+
+            floatX *dl_ln1w = grads.ln1w + l * C;
+            floatX *dl_ln1b = grads.ln1b + l * C;
+            floatX *dl_qkvw = grads.qkvw + l * 3 * C * C;
+            floatX *dl_qkvb = grads.qkvb + l * 3 * C;
+            floatX *dl_attprojw = grads.attprojw + l * C * C;
+            floatX *dl_attprojb = grads.attprojb + l * C;
+            floatX *dl_ln2w = grads.ln2w + l * C;
+            floatX *dl_ln2b = grads.ln2b + l * C;
+            floatX *dl_fcw = grads.fcw + l * 4 * C * C;
+            floatX *dl_fcb = grads.fcb + l * 4 * C;
+            floatX *dl_fcprojw = grads.fcprojw + l * C * 4 * C;
+            floatX *dl_fcprojb = grads.fcprojb + l * C;
+
+            floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + l * B * T * C : acts.lnf;
+            float *l_ln1_mean = acts.ln1_mean + l * B * T;
+            float *l_ln1_rstd = acts.ln1_rstd + l * B * T;
+            floatX *l_qkvr = acts.qkvr + l * B * T * 3 * C;
+            floatX *l_atty = acts.atty + l * B * T * C;
+            floatX *l_residual2 = acts.residual2 + l * B * T * C;
+            floatX *l_ln2 = (model->recompute < 2) ? acts.ln2 + l * B * T * C : acts.lnf;
+            float *l_ln2_mean = acts.ln2_mean + l * B * T;
+            float *l_ln2_rstd = acts.ln2_rstd + l * B * T;
+            floatX *l_fch_pre_gelu = acts.fch + l * B * T * 4 * C;
+            floatX *l_fch_gelu = (model->recompute < 1) ? acts.fch_gelu + l * B * T * 4 * C : acts.fch_gelu;
+
+            floatX *dl_bt4c = (floatX *)model->acts.scratch_bt4c;
+
+            if (model->recompute >= 1)
+            {
+
+                  gelu_forward(l_fch_gelu, l_fch_pre_gelu, B * T * 4 * C, main_stream);
+            }
+            matmul_backward(dl_bt4c, dl_fcprojw, dl_fcprojb, dresidual, l_fch_gelu, l_fcprojw, scratchF, B, T, 4 * C, C, main_stream, l_fch_pre_gelu, model->gelu_fusion);
+            if (model->recompute >= 2)
+            {
+
+                  layernorm_forward(l_ln2, l_ln2_mean, l_ln2_rstd, l_residual2, l_ln2w, l_ln2b, B, T, C, main_stream);
+            }
+            matmul_backward(dl_btc, dl_fcw, dl_fcb, dl_bt4c, l_ln2, l_fcw, scratchF, B, T, C, 4 * C, main_stream);
+
+            layernorm_backward(dresidual, dl_ln2w, dl_ln2b, scratchF, dl_btc, l_residual2, l_ln2w, l_ln2_mean, l_ln2_rstd, B, T, C, main_stream);
+            matmul_backward(dl_btc, dl_attprojw, dl_attprojb, dresidual, l_atty, l_attprojw, scratchF, B, T, C, C, main_stream);
+
+#ifdef ENABLE_CUDNN
+            float *l_att = (float *)acts.att + l * B * NH * T;
+            attention_backward_cudnn(dl_bt4c, dl_btc, l_qkvr, l_atty, (float *)l_att, B, T, NH, C, main_stream);
+#else
+            floatX *l_att = acts.att + l * B * NH * T * T;
+
+            floatX *buffer_a = l_atty;
+            floatX *buffer_b = l_fch_pre_gelu;
+            attention_backward(dl_bt4c, buffer_b, scratchX, buffer_a, dl_btc, l_qkvr, l_att, B, T, C, NH, main_stream);
+#endif
+            if (model->recompute >= 2)
+            {
+                  layernorm_forward(l_ln1, l_ln1_mean, l_ln1_rstd, residual, l_ln1w, l_ln1b, B, T, C, main_stream);
+            }
+
+            matmul_backward(dl_btc, dl_qkvw, dl_qkvb, dl_bt4c, l_ln1, l_qkvw, scratchF, B, T, C, 3 * C, main_stream);
+
+            layernorm_backward(dresidual, dl_ln1w, dl_ln1b, scratchF, dl_btc, residual, l_ln1w, l_ln1_mean, l_ln1_rstd, B, T, C, main_stream);
+
+            if (last_step)
+            {
+                  floatX *const pointers[] = {
+                      dl_ln1w, dl_ln1b,
+                      dl_qkvw, dl_qkvb,
+                      dl_attprojw, dl_attprojb,
+                      dl_ln2w, dl_ln2b,
+                      dl_fcw, dl_fcb,
+                      dl_fcprojw, dl_fcprojb};
+                  const size_t nelem[] = {
+                      C, C,
+                      3 * C * C, 3 * C,
+                      C * C, C,
+                      C, C,
+                      4 * C * C, 4 * C,
+                      C * 4 * C, C};
+                  multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);
+            }
+      }
+      encoder_backward(grads.wte, grads.wpe, scratchX, model->workload_indices, model->bucket_info,
+                       dresidual, model->inputs, inputs, B, T, C, random_u32(&model->rng_state), main_stream);
+
+      if (last_step)
+      {
+
+            global_sum_deterministic(model->accumulated_mean_loss, acts.losses, B * T, main_stream);
+
+#if MULTI_GPU
+            ncclCheck(ncclAllReduce(model->accumulated_mean_loss, model->accumulated_mean_loss, sizeof(float), ncclFloat, ncclAvg, multi_gpu_config.nccl_comm, main_stream));
+#endif
+            cudaCheck(cudaMemcpyAsync(&model->mean_loss, model->accumulated_mean_loss, sizeof(float), cudaMemcpyDeviceToHost, main_stream));
+
+            floatX *const pointers[] = {grads.wte, grads.wpe, grads.lnfw, grads.lnfb};
+            const size_t nelem[] = {Vp * C, T * C, C, C};
+            multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream);
+      }
+
+      cudaCheck(cudaDeviceSynchronize());
+      if (last_step)
+      {
+            model->mean_loss /= B * T * grad_accum_steps;
+      }
+      else
+      {
+            model->mean_loss = -1.f;
+      }
+}
+
+ShardInfo gpt2_get_tensor_at_layer(const GPT2 *model, int layer_id, int param_tensor_id)
+{
+
+      ptrdiff_t offset = 0;
+      for (int i = 0; i < param_tensor_id; i++)
+      {
+            offset += (ptrdiff_t)model->param_elements[i];
+      }
+      size_t size = model->param_elements[param_tensor_id];
+
+      if (2 <= param_tensor_id && param_tensor_id <= 13)
+      {
+            size /= model->config.num_layers;
+            offset += (ptrdiff_t)(layer_id * size);
+      }
+      return {offset, size};
+}
+
+float gpt2_calculate_grad_norm(GPT2 *model, MultiGpuConfig *multi_gpu_config)
+{
+      NVTX_RANGE_FN();
+      floatX *grads_memory = (floatX *)model->grads_memory;
+
+      float *grad_norm_squared = (float *)model->acts.output;
+      float grad_norm_squared_cpu = 0.0f;
+
+      int num_slices[2] = {1, model->config.num_layers};
+      int max_num_block_sums = get_max_num_block_sums(num_slices, 2);
+      if (multi_gpu_config->zero_stage == 1)
+      {
+
+            for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+            {
+                  ShardInfo tensor = gpt2_get_tensor_at_layer(model, 0, i);
+                  ShardInfo shard = multi_gpu_get_shard_offset(tensor.size, multi_gpu_config, 1);
+                  ptrdiff_t offset = tensor.offset + shard.offset;
+                  bool is_first_pass = (i == 0);
+                  if ((i < 2 || i > 13))
+                  {
+                        global_norm_squared(grad_norm_squared, grads_memory + offset, shard.size, 0, 1,
+                                            max_num_block_sums, is_first_pass, main_stream);
+                  }
+                  else
+                  {
+                        global_norm_squared(grad_norm_squared, grads_memory + offset, shard.size, tensor.size, model->config.num_layers,
+                                            max_num_block_sums, is_first_pass, main_stream);
+                  }
+            }
+            global_sum_deterministic(grad_norm_squared, grad_norm_squared, max_num_block_sums, main_stream);
+#if MULTI_GPU
+
+            ncclCheck(ncclAllReduce(grad_norm_squared, grad_norm_squared, sizeof(float), ncclFloat, ncclSum, multi_gpu_config->nccl_comm, main_stream));
+#endif
+      }
+      else
+      {
+
+            global_norm_squared(grad_norm_squared, grads_memory, model->num_parameters, 0, 1, max_num_block_sums, true, main_stream);
+            global_sum_deterministic(grad_norm_squared, grad_norm_squared, max_num_block_sums, main_stream);
+      }
+      cudaCheck(cudaMemcpy(&grad_norm_squared_cpu, grad_norm_squared, sizeof(float), cudaMemcpyDeviceToHost));
+      float grad_norm_cpu = sqrtf(grad_norm_squared_cpu);
+      return grad_norm_cpu;
+}
+
+void gpt2_update(GPT2 *model, float learning_rate, float beta1, float beta2, float eps, float weight_decay, float grad_scale, int t,
+                 MultiGpuConfig *multi_gpu_config, bool init_from_master_only = false)
+{
+
+      NVTX_RANGE_FN();
+      if (model->grads_memory == nullptr || model->m_memory == nullptr || model->v_memory == nullptr)
+      {
+            fprintf(stderr, "Need to allocate optimizer state before update");
+            exit(EXIT_FAILURE);
+      }
+
+      bool init_state = model->init_state;
+      if (init_state)
+      {
+            model->init_state = false;
+            NvtxRange rng("InitOpt");
+            cudaCheck(cudaMemset(model->m_memory, 0, multi_gpu_config->shard_num_parameters * sizeof(float)));
+            cudaCheck(cudaMemset(model->v_memory, 0, multi_gpu_config->shard_num_parameters * sizeof(float)));
+      }
+
+      model->rng_state_last_update = model->rng_state;
+
+      for (int i = 0; i < NUM_PARAMETER_TENSORS; i++)
+      {
+
+            unsigned int seed = random_u32(&model->rng_state);
+
+            int num_layers = model->config.num_layers;
+            if ((i < 2 || i > 13))
+            {
+                  num_layers = 1;
+            }
+
+            ShardInfo tensor = gpt2_get_tensor_at_layer(model, 0, i);
+            ShardInfo shard = multi_gpu_get_shard_offset(tensor.size, multi_gpu_config, 1);
+            ptrdiff_t local_offset_full = tensor.offset + shard.offset;
+            ptrdiff_t local_offset_partial = tensor.offset / multi_gpu_config->num_processes;
+
+            float wd = (i == 0 || i == 1 || i == 4 || i == 6 || i == 10 || i == 12) ? weight_decay : 0.0f;
+            floatX *param_ptr = (floatX *)model->params_memory + local_offset_full;
+            floatX *grad_ptr = (floatX *)model->grads_memory + local_offset_full;
+
+            ptrdiff_t opt_state_offset = multi_gpu_config->zero_stage < 1 ? local_offset_full : local_offset_partial;
+            float *m_ptr = model->m_memory + opt_state_offset;
+            float *v_ptr = model->v_memory + opt_state_offset;
+            float *master_ptr = nullptr;
+            if (model->master_weights != nullptr)
+            {
+                  master_ptr = model->master_weights + opt_state_offset;
+            }
+            if (init_state && model->master_weights != nullptr)
+            {
+                  size_t grid_size = CEIL_DIV(shard.size, 512);
+                  copy_and_cast_kernel<<<dim3(grid_size, num_layers), 512, 0, main_stream>>>(master_ptr, param_ptr, shard.size,
+                                                                                             shard.size, tensor.size);
+                  cudaCheck(cudaGetLastError());
+            }
+
+            if (init_from_master_only)
+            {
+
+                  init_from_master(param_ptr, master_ptr, shard.size, tensor.size, shard.size, num_layers, seed, main_stream);
+            }
+            else
+            {
+
+                  adamw_update(param_ptr, master_ptr, grad_ptr,
+                               m_ptr, v_ptr,
+                               shard.size, tensor.size, tensor.size, shard.size, num_layers,
+                               learning_rate,
+                               beta1, beta2, t, eps, wd, grad_scale, seed, main_stream);
+            }
+
+            if (multi_gpu_config->zero_stage == 1)
+            {
+#if MULTI_GPU
+                  ncclCheck(ncclGroupStart());
+                  for (int l = 0; l < num_layers; ++l)
+                  {
+
+                        ncclCheck(ncclAllGather(param_ptr + l * tensor.size,
+                                                (floatX *)model->params_memory + tensor.offset + l * tensor.size,
+                                                shard.size, ncclFloatX,
+                                                multi_gpu_config->nccl_comm, multi_gpu_config->nccl_stream));
+                  }
+                  ncclCheck(ncclGroupEnd());
+#endif
+            }
+      }
+
+      cudaCheck(cudaDeviceSynchronize());
+}
+
+float gpt2_estimate_mfu(GPT2 *model, int num_tokens, float dt)
+{
+
+      size_t N = model->num_parameters;
+      int L = model->config.num_layers;
+      int C = model->config.channels;
+      int T = model->seq_len;
+      size_t flops_per_token = 6 * N + (size_t)6 * L * C * T;
+      size_t flops_per_step = flops_per_token * num_tokens;
+
+      float flops_achieved = (float)flops_per_step * (1.0f / dt);
+      float flops_promised = get_flops_promised(deviceProp.name, PRECISION_MODE) * 1e12f;
+      if (flops_promised < 0)
+      {
+            return -1.f;
+      }
+      float mfu = flops_achieved / flops_promised;
+      return mfu;
+}
+
+void gpt2_free(GPT2 *model)
+{
+      cudaFreeCheck(&model->params_memory);
+      cudaFreeCheck(&model->grads_memory);
+      cudaFreeCheck(&model->m_memory);
+      cudaFreeCheck(&model->v_memory);
+      cudaFreeCheck(&model->master_weights);
+      cudaFreeCheck(&model->acts_memory);
+      cudaFreeCheck(&model->inputs);
+      cudaFreeCheck(&model->targets);
+      cudaFreeCheck(&model->accumulated_mean_loss);
+      cudaCheck(cudaFreeHost(model->cpu_losses));
+      free(model->workload_indices);
+      free(model->bucket_info);
+}
+
+void common_start(bool override_enable_tf32 = true, bool print_device_info = true)
+{
+
+      cudaCheck(cudaGetDeviceProperties(&deviceProp, multi_gpu_config.local_device_idx));
+      if (print_device_info)
+      {
+            printf("[System]\n");
+            printf("Device %d: %s\n", multi_gpu_config.local_device_idx, deviceProp.name);
+      }
+
+      cudaCheck(cudaStreamCreate(&main_stream));
+      nvtxNameCudaStreamA(main_stream, "main stream");
+
+      cublasCheck(cublasLtCreate(&cublaslt_handle));
+      cudaCheck(cudaMalloc(&cublaslt_workspace, cublaslt_workspace_size));
+
+      bool enable_tf32 = PRECISION_MODE == PRECISION_FP32 && deviceProp.major >= 8 && override_enable_tf32;
+      cublas_compute = enable_tf32 ? CUBLAS_COMPUTE_32F_FAST_TF32 : CUBLAS_COMPUTE_32F;
+
+#ifdef ENABLE_CUDNN
+      create_cudnn();
+#endif
+}
+
+void common_free(GPT2 &model)
+{
+      cudaCheck(cudaStreamDestroy(main_stream));
+      cudaCheck(cudaFree(cublaslt_workspace));
+      cublasCheck(cublasLtDestroy(cublaslt_handle));
+#ifdef ENABLE_CUDNN
+      destroy_cudnn();
+#endif
+}
+
+void save_state(const char *filename, int step, GPT2 *model, DataLoader *loader)
+{
+      printf("Writing state to %s\n", filename);
+      FILE *state_file = fopenCheck(filename, "wb");
+      int state_header[256];
+      memset(state_header, 0, sizeof(state_header));
+
+      state_header[0] = 20240527;
+      state_header[1] = 1;
+      state_header[2] = multi_gpu_config.num_processes;
+      state_header[3] = multi_gpu_config.process_rank;
+      state_header[4] = model->use_master_weights;
+      state_header[5] = loader->should_shuffle;
+
+      state_header[10] = step;
+
+      *((unsigned long long *)&state_header[20]) = model->rng_state;
+      *((unsigned long long *)&state_header[22]) = model->rng_state_last_update;
+
+      *((size_t *)&state_header[30]) = loader->current_shard_idx;
+      *((size_t *)&state_header[32]) = loader->current_sample_idx;
+      fwriteCheck(state_header, sizeof(int), 256, state_file);
+
+      size_t shard_num_parameters = multi_gpu_config.shard_num_parameters;
+      device_to_file(state_file, model->m_memory, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+      device_to_file(state_file, model->v_memory, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+      if (model->use_master_weights)
+      {
+            device_to_file(state_file, model->master_weights, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+      }
+
+      if (loader->should_shuffle)
+      {
+            fwriteCheck(&loader->glob_result.gl_pathc, sizeof(size_t), 1, state_file);
+            fwriteCheck(loader->shard_indices, sizeof(int), loader->glob_result.gl_pathc, state_file);
+            fwriteCheck(&loader->shard_num_samples, sizeof(size_t), 1, state_file);
+            fwriteCheck(loader->intra_shard_indices, sizeof(int), loader->shard_num_samples, state_file);
+            fwriteCheck(&loader->shuffle_rng, sizeof(mt19937_state), 1, state_file);
+      }
+      fcloseCheck(state_file);
+}
+
+void load_state(int *step, GPT2 *model, DataLoader *loader, const char *filename)
+{
+      FILE *state_file = fopenCheck(filename, "rb");
+      int state_header[256];
+      freadCheck(state_header, sizeof(int), 256, state_file);
+      assert(state_header[0] == 20240527);
+      assert(state_header[1] == 1);
+      assert(state_header[2] == multi_gpu_config.num_processes);
+      assert(state_header[3] == multi_gpu_config.process_rank);
+      int use_master_weights = state_header[4];
+      int should_shuffle = state_header[5];
+      *step = state_header[10];
+      model->rng_state = *((unsigned long long *)&state_header[20]);
+      model->rng_state_last_update = *((unsigned long long *)&state_header[22]);
+      size_t current_shard_idx = *((size_t *)&state_header[30]);
+      size_t current_sample_idx = *((size_t *)&state_header[32]);
+
+      size_t shard_num_parameters = multi_gpu_config.shard_num_parameters;
+      if (use_master_weights == 1 && !model->use_master_weights)
+      {
+            printf0("Warning: Master weights are present in state, but not enabled for current run.");
+      }
+      else if (use_master_weights == 0 && model->use_master_weights)
+      {
+            printf0("Error: Master weights requested, but not present in state file.");
+            exit(EXIT_FAILURE);
+      }
+
+      model->init_state = false;
+      assert(model->m_memory != nullptr);
+      assert(model->v_memory != nullptr);
+      file_to_device(model->m_memory, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+      file_to_device(model->v_memory, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+      if (model->use_master_weights)
+      {
+            assert(model->master_weights != nullptr);
+            file_to_device(model->master_weights, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream);
+
+            model->rng_state = model->rng_state_last_update;
+            gpt2_update(model, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0, &multi_gpu_config, true);
+            model->rng_state = *((unsigned long long *)&state_header[20]);
+      }
+
+      loader->should_shuffle = should_shuffle;
+      if (should_shuffle == 1)
+      {
+
+            size_t glob_result_gl_pathc;
+            freadCheck(&glob_result_gl_pathc, sizeof(size_t), 1, state_file);
+            assert(glob_result_gl_pathc == loader->glob_result.gl_pathc);
+
+            loader->shard_indices = (int *)mallocCheck(loader->glob_result.gl_pathc * sizeof(int));
+            freadCheck(loader->shard_indices, sizeof(int), loader->glob_result.gl_pathc, state_file);
+
+            size_t shard_num_samples;
+            freadCheck(&shard_num_samples, sizeof(size_t), 1, state_file);
+            assert(shard_num_samples == loader->shard_num_samples);
+
+            loader->intra_shard_indices = (int *)mallocCheck(loader->shard_num_samples * sizeof(int));
+            freadCheck(loader->intra_shard_indices, sizeof(int), loader->shard_num_samples, state_file);
+
+            freadCheck(&loader->shuffle_rng, sizeof(mt19937_state), 1, state_file);
+      }
+      dataloader_resume(loader, current_shard_idx, current_sample_idx);
+
+      fcloseCheck(state_file);
+}
+
+void write_checkpoint(const char *output_log_dir, int step, GPT2 *model, DataLoader *train_loader, MultiGpuConfig *multi_gpu_config)
+{
+
+      printf0("Writing checkpoint at step %d\n", step);
+      int rank = multi_gpu_config->process_rank;
+
+      if (rank == 0)
+      {
+            snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, step);
+            gpt2_write_to_checkpoint(model, filename_buffer);
+      }
+
+      snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, step, rank);
+      save_state(filename_buffer, step, model, train_loader);
+
+      multi_gpu_barrier(multi_gpu_config);
+      if (rank == 0)
+      {
+            snprintf(filename_buffer, sizeof(filename_buffer), "%s/DONE_%08d", output_log_dir, step);
+            FILE *done_file = fopenCheck(filename_buffer, "w");
+            fcloseCheck(done_file);
+      }
+}
+
+void delete_checkpoint(const char *output_log_dir, int step, MultiGpuConfig *multi_gpu_config)
+{
+
+      printf0("Deleting checkpoint at step %d\n", step);
+      int rank = multi_gpu_config->process_rank;
+      if (rank == 0)
+      {
+            snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, step);
+            remove(filename_buffer);
+      }
+      snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, step, rank);
+      remove(filename_buffer);
+      if (rank == 0)
+      {
+            snprintf(filename_buffer, sizeof(filename_buffer), "%s/DONE_%08d", output_log_dir, step);
+            remove(filename_buffer);
+      }
+}
+
+#ifndef TESTING
+
+void error_usage()
+{
+      fprintf(stderr, "Usage:   ./train_gpt2cu [options]\n");
+      fprintf(stderr, "Options:\n");
+
+      fprintf(stderr, "  -i <string> train data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_train.bin)\n");
+      fprintf(stderr, "  -j <string> val data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_val.bin)\n");
+      fprintf(stderr, "  -e <string> input .bin filename or descriptor, see code comments as docs. (default = gpt2_124M_bf16.bin)\n");
+      fprintf(stderr, "  -o <string> output log dir (default = NULL, no logging)\n");
+      fprintf(stderr, "  -lg <int>   log gpu info every x steps (default = -1; disabled)\n");
+      fprintf(stderr, "  -n <int>    write optimization checkpoints every how many steps? (default 0, don't)\n");
+      fprintf(stderr, "  -nk <int>   max number of checkpoints to keep in the directory, removing old ones (0 = disable, default)\n");
+      fprintf(stderr, "  -nm <int>   every how many step checkpoints are considered major? major checkpoints never get deleted.\n");
+      fprintf(stderr, "  -y <int>    resume optimization found inside output log dir? (0=restart/overwrite, 1=resume/append)\n");
+
+      fprintf(stderr, "  -b <int>    (per-GPU, micro) batch size B (default = 4)\n");
+      fprintf(stderr, "  -t <int>    sequence length T (default = 1024)\n");
+      fprintf(stderr, "  -d <int>    total desired batch size (default = B * T * num_processes, i.e. no grad accumulation\n");
+
+      fprintf(stderr, "  -x <int>    max_steps of optimization to run (-1 (default) = disable, run 1 epoch)\n");
+
+      fprintf(stderr, "  -k <string> learning rate scheduler (default = cosine)\n");
+      fprintf(stderr, "  -l <float>  learning rate (default = 3e-4f)\n");
+      fprintf(stderr, "  -u <int>    learning rate warmup iterations (default = 0, no warmup)\n");
+      fprintf(stderr, "  -q <float>  learning rate decay: final fraction, at end of training (default = 1.0 (no decay))\n");
+      fprintf(stderr, "  -c <float>  weight decay (default = 0.0f)\n");
+      fprintf(stderr, "  -sl <float> outlier stability: skip update if loss goes above this in zscore (0.0f=off)\n");
+      fprintf(stderr, "  -sg <float> outlier stability: skip update if grad_norm goes above this in zscore (0.0f=off)\n");
+
+      fprintf(stderr, "  -v <int>    val_loss_every, how often we evaluate val loss (default = 20)\n");
+      fprintf(stderr, "  -m <int>    val_max_steps, up to how many val batches to estimate val loss? (default = 20)\n");
+      fprintf(stderr, "  -s <int>    sample_every, how often we inference the model (default = 20)\n");
+      fprintf(stderr, "  -g <int>    genT, how many steps of inference we do (default = 64)\n");
+      fprintf(stderr, "  -h <int>    hellaswag eval run? (default = 0)\n");
+
+      fprintf(stderr, "  -a <int>    overfit a single batch? 0/1. useful for debugging\n");
+
+      fprintf(stderr, "  -f <int>    enable_tf32 override (default: 1, set to 0 to disable tf32)\n");
+      fprintf(stderr, "  -w <int>    keep f32 copy of weights for the optimizer? (default: 1)\n");
+      fprintf(stderr, "  -ge <int>   gelu fusion: 0=none, 1=forward, 2=forward+backward (default: 2 for >=SM90, 0 for older GPUs)\n");
+
+      fprintf(stderr, "  -z <int>    zero_stage, Zero Optimization Stage, 0,1,2,3 (default = 0)\n");
+      fprintf(stderr, "  -r <int>    recompute: less memory but less speed. (default = 1), 0|1|2 = none,gelu,gelu+ln\n");
+
+      fprintf(stderr, "  -pn <int>    num_processes (default = 1)\n");
+      fprintf(stderr, "  -pr <int>    process_rank (default = 0)\n");
+      fprintf(stderr, "  -pg <int>    gpus_per_node (default = 8)\n");
+      fprintf(stderr, "  -pm <string> nccl_init_method: tcp,fs,mpi (default = mpi)\n");
+      fprintf(stderr, "  -ps <string> server_ip - used only when nccl_init_method is tcp (default = -1)\n");
+      fprintf(stderr, "  -pp <string> fs_path - used only when nccl_init_method is fs (default = /tmp)\n");
+      exit(EXIT_FAILURE);
+}
+
+int main(int argc, char *argv[])
+{
+
+      const char *train_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_train.bin";
+      const char *val_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_val.bin";
+      const char *load_filename = "gpt2_124M_bf16.bin";
+      const char *lr_scheduler_type = "cosine";
+      const char *output_log_dir = NULL;
+      int checkpoint_every = 0;
+      int checkpoints_keep = 0;
+      int major_checkpoint_every = 0;
+      int resume = 0;
+      int B = 4;
+      int T = 1024;
+      int total_batch_size = -1;
+      float learning_rate = 3e-4f;
+      int log_gpu_every = -1;
+      int warmup_iterations = 0;
+      float final_learning_rate_frac = 1.0f;
+      float weight_decay = 0.0f;
+      float skip_update_lossz = 0.0f;
+      float skip_update_gradz = 0.0f;
+      int val_loss_every = 20;
+      int val_max_steps = 20;
+      int sample_every = 20;
+      int genT = 64;
+      int overfit_single_batch = 0;
+      int max_steps = -1;
+      int override_enable_tf32 = 1;
+      int use_master_weights = 1;
+      int gelu_fusion = -1;
+      int recompute = 1;
+      int zero_stage = 0;
+      int hellaswag_eval = 0;
+
+      int num_processes = 1;
+      int process_rank = 0;
+      int gpus_per_node = 8;
+      char nccl_init_method[256] = "mpi";
+      char server_ip[256] = "";
+      char fs_path[256] = "";
+      for (int i = 1; i < argc; i += 2)
+      {
+            if (i + 1 >= argc)
+            {
+                  error_usage();
+            }
+            if (argv[i][0] != '-')
+            {
+                  error_usage();
+            }
+            if (!(strlen(argv[i]) == 2 || strlen(argv[i]) == 3))
+            {
+                  error_usage();
+            }
+
+            if (argv[i][1] == 'i')
+            {
+                  train_data_pattern = argv[i + 1];
+            }
+            else if (argv[i][1] == 'j')
+            {
+                  val_data_pattern = argv[i + 1];
+            }
+            else if (argv[i][1] == 'e')
+            {
+                  load_filename = argv[i + 1];
+            }
+            else if (argv[i][1] == 'o')
+            {
+                  output_log_dir = argv[i + 1];
+            }
+            else if (argv[i][1] == 'n' && argv[i][2] == '\0')
+            {
+                  checkpoint_every = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'y')
+            {
+                  resume = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'b')
+            {
+                  B = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 't')
+            {
+                  T = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'd')
+            {
+                  total_batch_size = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'l' && argv[i][2] == '\0')
+            {
+                  learning_rate = atof(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'l' && argv[i][2] == 'g')
+            {
+                  log_gpu_every = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'u')
+            {
+                  warmup_iterations = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'q')
+            {
+                  final_learning_rate_frac = atof(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'c')
+            {
+                  weight_decay = atof(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'x')
+            {
+                  max_steps = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'v')
+            {
+                  val_loss_every = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'm')
+            {
+                  val_max_steps = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 's' && argv[i][2] == '\0')
+            {
+                  sample_every = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'g' && argv[i][2] == 'e')
+            {
+                  gelu_fusion = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'g')
+            {
+                  genT = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'a')
+            {
+                  overfit_single_batch = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'f')
+            {
+                  override_enable_tf32 = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'w')
+            {
+                  use_master_weights = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'z')
+            {
+                  zero_stage = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'r')
+            {
+                  recompute = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'h')
+            {
+                  hellaswag_eval = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'k')
+            {
+                  lr_scheduler_type = argv[i + 1];
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 'i')
+            {
+                  strcpy(nccl_init_method, argv[i + 1]);
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 'f')
+            {
+                  strcpy(fs_path, argv[i + 1]);
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 's')
+            {
+                  strcpy(server_ip, argv[i + 1]);
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 'n')
+            {
+                  num_processes = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 'r')
+            {
+                  process_rank = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'p' && argv[i][2] == 'g')
+            {
+                  gpus_per_node = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 's' && argv[i][2] == 'l')
+            {
+                  skip_update_lossz = atof(argv[i + 1]);
+            }
+            else if (argv[i][1] == 's' && argv[i][2] == 'g')
+            {
+                  skip_update_gradz = atof(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'n' && argv[i][2] == 'k')
+            {
+                  checkpoints_keep = atoi(argv[i + 1]);
+            }
+            else if (argv[i][1] == 'n' && argv[i][2] == 'm')
+            {
+                  major_checkpoint_every = atoi(argv[i + 1]);
+            }
+            else
+            {
+                  error_usage();
+            }
+      }
+
+      multi_gpu_config = multi_gpu_config_init(num_processes, process_rank, gpus_per_node, server_ip, fs_path, nccl_init_method);
+      common_start(override_enable_tf32, false);
+
+      assert(warmup_iterations >= 0);
+      if (output_log_dir != NULL)
+      {
+            assert(strlen(output_log_dir) < 400);
+      }
+      int tokens_per_fwdbwd = B * T * multi_gpu_config.num_processes;
+
+      if (total_batch_size == -1)
+      {
+            total_batch_size = tokens_per_fwdbwd;
+      }
+
+      if (gelu_fusion == -1)
+      {
+            gelu_fusion = 0;
+      }
+
+      assert(total_batch_size % tokens_per_fwdbwd == 0);
+      int grad_accum_steps = total_batch_size / tokens_per_fwdbwd;
+
+      if (overfit_single_batch == 1)
+      {
+            train_data_pattern = val_data_pattern;
+      }
+      printf0("+-----------------------+----------------------------------------------------+\n");
+      printf0("| Parameter             | Value                                              |\n");
+      printf0("+-----------------------+----------------------------------------------------+\n");
+      printf0("| train data pattern    | %-50s |\n", train_data_pattern);
+      printf0("| val data pattern      | %-50s |\n", val_data_pattern);
+      printf0("| output log dir        | %-50s |\n", output_log_dir == NULL ? "NULL" : output_log_dir);
+      printf0("| checkpoint_every      | %-50d |\n", checkpoint_every);
+      printf0("| resume                | %-50d |\n", resume);
+      printf0("| micro batch size B    | %-50d |\n", B);
+      printf0("| sequence length T     | %-50d |\n", T);
+      printf0("| total batch size      | %-50d |\n", total_batch_size);
+      printf0("| LR scheduler          | %-50s |\n", lr_scheduler_type);
+      printf0("| learning rate (LR)    | %-50e |\n", learning_rate);
+      printf0("| warmup iterations     | %-50d |\n", warmup_iterations);
+      printf0("| final LR fraction     | %-50e |\n", final_learning_rate_frac);
+      printf0("| weight decay          | %-50e |\n", weight_decay);
+      printf0("| skip update lossz     | %-50f |\n", skip_update_lossz);
+      printf0("| skip update gradz     | %-50f |\n", skip_update_gradz);
+      printf0("| max_steps             | %-50d |\n", max_steps);
+      printf0("| val_loss_every        | %-50d |\n", val_loss_every);
+      printf0("| val_max_steps         | %-50d |\n", val_max_steps);
+      printf0("| sample_every          | %-50d |\n", sample_every);
+      printf0("| genT                  | %-50d |\n", genT);
+      printf0("| overfit_single_batch  | %-50d |\n", overfit_single_batch);
+      printf0("| use_master_weights    | %-50s |\n", use_master_weights ? "enabled" : "disabled");
+      printf0("| gelu_fusion           | %-50d |\n", gelu_fusion);
+      printf0("| recompute             | %-50d |\n", recompute);
+      printf0("+-----------------------+----------------------------------------------------+\n");
+      const char *precision_str = (PRECISION_MODE == PRECISION_FP32)
+                                      ? (cublas_compute == CUBLAS_COMPUTE_32F_FAST_TF32 ? "TF32" : "FP32")
+                                      : (PRECISION_MODE == PRECISION_FP16 ? "FP16" : "BF16");
+      printf0("| device                | %-50s |\n", deviceProp.name);
+      printf0("| peak TFlops           | %-50.1f |\n", get_flops_promised(deviceProp.name, PRECISION_MODE));
+      printf0("| precision             | %-50s |\n", precision_str);
+      printf0("+-----------------------+----------------------------------------------------+\n");
+
+      int resuming = 0;
+
+      int resume_max_step = find_max_step(output_log_dir);
+      if (resume == 1)
+      {
+            assert(output_log_dir != NULL);
+            if (resume_max_step != -1)
+            {
+                  resuming = 1;
+                  snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, resume_max_step);
+            }
+      }
+
+      GPT2 model;
+      gpt2_init_common(&model);
+      if (resuming == 1)
+      {
+
+            bool weight_init = !use_master_weights;
+            gpt2_build_from_checkpoint(&model, filename_buffer, weight_init);
+      }
+      else if (ends_with_bin(load_filename))
+      {
+
+            gpt2_build_from_checkpoint(&model, load_filename);
+      }
+      else
+      {
+
+            gpt_build_from_descriptor(&model, load_filename);
+      }
+
+      model.use_master_weights = use_master_weights;
+      model.gelu_fusion = gelu_fusion;
+      model.recompute = recompute;
+      printf0("| weight init method    | %-50s |\n", resuming == 1 ? "intermediate checkpoint" : load_filename);
+      printf0("| max_sequence_length T | %-50d |\n", model.config.max_seq_len);
+      printf0("| vocab_size V          | %-50d |\n", model.config.vocab_size);
+      printf0("| padded_vocab_size Vp  | %-50d |\n", model.config.padded_vocab_size);
+      printf0("| num_layers L          | %-50d |\n", model.config.num_layers);
+      printf0("| num_heads NH          | %-50d |\n", model.config.num_heads);
+      printf0("| channels C            | %-50d |\n", model.config.channels);
+      printf0("| num_parameters        | %-50zu |\n", model.num_parameters);
+      printf0("+-----------------------+----------------------------------------------------+\n");
+
+      int permute_train_loader = (overfit_single_batch == 1) ? 0 : 1;
+      DataLoader train_loader, val_loader;
+      dataloader_init(&train_loader, train_data_pattern, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes, permute_train_loader);
+      dataloader_init(&val_loader, val_data_pattern, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes, 0);
+
+      int train_num_batches = max_steps;
+      if (train_num_batches == -1)
+      {
+
+            size_t ntok = train_loader.num_tokens;
+
+            train_num_batches = ntok / total_batch_size;
+      }
+
+      int val_num_batches = val_max_steps;
+      if (val_num_batches == -1)
+      {
+
+            size_t ntok = val_loader.num_tokens;
+
+            val_num_batches = ntok / tokens_per_fwdbwd;
+      }
+      printf0("| train_num_batches     | %-50d |\n", train_num_batches);
+      printf0("| val_num_batches       | %-50d |\n", val_num_batches);
+      printf0("+-----------------------+----------------------------------------------------+\n");
+
+      EvalLoader eval_loader;
+      const char *hellaswag_path = "dev/data/hellaswag/hellaswag_val.bin";
+      const bool hellaswag_available = access(hellaswag_path, F_OK) == 0;
+      const bool run_hellaswag = hellaswag_eval && hellaswag_available;
+      if (run_hellaswag)
+      {
+            evalloader_init(&eval_loader, hellaswag_path, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes);
+      }
+      printf0("| run hellaswag         | %-50s |\n", run_hellaswag ? "yes" : "no");
+      printf0("+-----------------------+----------------------------------------------------+\n");
+
+      set_zero_configs(&multi_gpu_config, zero_stage, model.num_parameters);
+      printf0("| num_processes         | %-50d |\n", multi_gpu_config.num_processes);
+      printf0("| zero_stage            | %-50d |\n", multi_gpu_config.zero_stage);
+      printf0("+-----------------------+----------------------------------------------------+\n");
+
+      if (!hellaswag_available)
+      {
+            printf0("HellaSwag eval not found at %s, skipping its evaluation\n", hellaswag_path);
+            printf0("You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.\n");
+      }
+
+      printf0("num_parameters: %zu => bytes: %zu\n", model.num_parameters, model.num_parameters_bytes);
+      printf0("allocated %d MiB for model parameters\n", (int)round(model.num_parameters_bytes / (1024 * 1024)));
+
+      printf0("batch_size B=%d * seq_len T=%d * num_processes=%d and total_batch_size=%d\n",
+              B, T, multi_gpu_config.num_processes, total_batch_size);
+      printf0("=> setting grad_accum_steps=%d\n", grad_accum_steps);
+
+      if (multi_gpu_config.process_rank == 0)
+      {
+            create_dir_if_not_exists(output_log_dir);
+      }
+      Logger logger;
+      logger_init(&logger, output_log_dir, multi_gpu_config.process_rank, resume);
+
+      Tokenizer tokenizer;
+      tokenizer_init(&tokenizer, "gpt2_tokenizer.bin");
+
+      LearningRateScheduler lr_scheduler;
+      lr_scheduler_init(&lr_scheduler, lr_scheduler_type, learning_rate,
+                        warmup_iterations, train_num_batches, final_learning_rate_frac);
+
+      int *gen_tokens = (int *)mallocCheck(B * T * sizeof(int));
+      floatX *cpu_logits_raw = (floatX *)mallocCheck(model.config.vocab_size * sizeof(floatX));
+      float *cpu_logits = (float *)mallocCheck(model.config.vocab_size * sizeof(float));
+
+      int step = 0;
+      gpt2_allocate_state(&model, B, T);
+      if (resuming == 1)
+      {
+            snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, resume_max_step, multi_gpu_config.process_rank);
+            load_state(&step, &model, &train_loader, filename_buffer);
+      }
+
+      OutlierDetector loss_outlier_detector, grad_norm_outlier_detector;
+      init_detector(&loss_outlier_detector);
+      init_detector(&grad_norm_outlier_detector);
+
+      if (T < model.config.max_seq_len)
+      {
+            printf0("!!!!!!!!\n");
+            printf0("WARNING:\n");
+            printf0("- The training sequence length is: T=%d (set with -t)\n", T);
+            printf0("- The model's max sequence length is: max_seq_len=%d\n", model.config.max_seq_len);
+            printf0("You are attempting to train with a sequence length shorter than the model's max.\n");
+            printf0("This will lead to unused parameters in the wpe position embedding weights.\n");
+            printf0("If you know what you're doing you can ignore this warning.\n");
+            printf0("If you're like ???, you are most likely misconfiguring your training run.\n");
+            printf0("---> HINT: If you're training GPT-2 use -t 1024. If GPT-3, use -t 2048.\n");
+            printf0("!!!!!!!!\n");
+      }
+
+      assert(T <= model.config.max_seq_len);
+
+      cudaEvent_t start, end;
+      cudaCheck(cudaEventCreate(&start));
+      cudaCheck(cudaEventCreate(&end));
+      cudaCheck(cudaProfilerStart());
+      double total_sum_iteration_time_s = 0.0;
+      float ema_tokens_per_second = 0.0f;
+      for (; step <= train_num_batches; step++)
+      {
+            NvtxRange step_range("Train step", step);
+
+            int last_step = step == train_num_batches;
+
+            if (step % val_loss_every == 0 || last_step)
+            {
+                  NvtxRange validation_range("validation");
+                  float val_loss = 0.0f;
+                  dataloader_reset(&val_loader);
+                  for (int i = 0; i < val_num_batches; i++)
+                  {
+                        dataloader_next_batch(&val_loader);
+                        val_loss += gpt2_validate(&model, val_loader.inputs, val_loader.targets, B, T);
+                  }
+                  val_loss /= val_num_batches;
+                  val_loss = multi_gpu_cpu_float_sum(val_loss, &multi_gpu_config) / multi_gpu_config.num_processes;
+                  printf0("val loss %f\n", val_loss);
+                  logger_log_val(&logger, step, val_loss);
+            }
+
+            if (run_hellaswag &&
+                ((step > 0 && step % val_loss_every == 0) || last_step))
+            {
+                  NvtxRange evaluation_range("evaluation");
+                  float eval_acc_norm = 0.0f;
+                  evalloader_reset(&eval_loader);
+                  for (int i = 0; i < eval_loader.num_batches; i++)
+                  {
+                        if (i % 10 == 0)
+                        {
+                              printf("evaluating HellaSwag: %d/%d\r", i, eval_loader.num_batches);
+                        }
+                        evalloader_next_batch(&eval_loader);
+                        gpt2_validate(&model, eval_loader.inputs, eval_loader.targets, B, T);
+                        int correct = evalloader_stat_losses(&eval_loader, model.cpu_losses);
+                        eval_acc_norm += (float)correct;
+                  }
+
+                  eval_acc_norm = multi_gpu_cpu_float_sum(eval_acc_norm, &multi_gpu_config);
+                  printf0("HellaSwag: %d/%d = %f\n", (int)eval_acc_norm, eval_loader.num_examples, eval_acc_norm / eval_loader.num_examples);
+                  logger_log_eval(&logger, step, eval_acc_norm / eval_loader.num_examples);
+            }
+
+            if (multi_gpu_config.process_rank == 0 && sample_every > 0 &&
+                (step > 0 && (step % sample_every) == 0 || last_step))
+            {
+                  NvtxRange generation_range("generation");
+                  unsigned long long sample_rng_state = 1337;
+
+                  int eot_token = tokenizer.eot_token;
+                  for (int i = 0; i < B * T; ++i)
+                  {
+                        gen_tokens[i] = eot_token;
+                  }
+
+                  printf("generating:\n---\n");
+                  for (int t = 1; t < genT; t++)
+                  {
+                        NvtxRange generation_range("Generation step", t);
+
+                        gpt2_forward(&model, gen_tokens, 1, CEIL_DIV(t, min(T, 256)) * min(T, 256));
+
+                        floatX *logits = model.acts.output + (t - 1) * model.config.padded_vocab_size;
+
+                        cudaCheck(cudaMemcpy(cpu_logits_raw, logits, model.config.vocab_size * sizeof(floatX), cudaMemcpyDeviceToHost));
+
+                        for (int i = 0; i < model.config.vocab_size; i++)
+                        {
+                              cpu_logits[i] = (float)cpu_logits_raw[i];
+                        }
+
+                        float coin = random_f32(&sample_rng_state);
+                        int next_token = sample_softmax(cpu_logits, model.config.vocab_size, coin);
+                        gen_tokens[t] = next_token;
+
+                        if (tokenizer.init_ok)
+                        {
+                              const char *token_str = tokenizer_decode(&tokenizer, next_token);
+                              safe_printf(token_str);
+                        }
+                        else
+                        {
+
+                              printf("%d ", next_token);
+                        }
+                        fflush(stdout);
+                  }
+                  printf("\n---\n");
+            }
+
+            if ((checkpoint_every > 0 && output_log_dir != NULL && resuming == 0) &&
+                ((step > 0 && step % checkpoint_every == 0) || last_step))
+            {
+
+                  write_checkpoint(output_log_dir, step, &model, &train_loader, &multi_gpu_config);
+
+                  int step_delete = step - checkpoints_keep * checkpoint_every;
+                  if (checkpoints_keep > 0 && step_delete > 0 &&
+                      (major_checkpoint_every == 0 || step_delete % major_checkpoint_every != 0))
+                  {
+                        delete_checkpoint(output_log_dir, step_delete, &multi_gpu_config);
+                  }
+            }
+            resuming = 0;
+
+            if (last_step)
+            {
+                  break;
+            }
+
+            if (overfit_single_batch == 1)
+            {
+
+                  dataloader_reset(&train_loader);
+            }
+
+            cudaCheck(cudaEventRecord(start));
+
+            for (int micro_step = 0; micro_step < grad_accum_steps; micro_step++)
+            {
+
+                  dataloader_next_batch(&train_loader);
+
+                  gpt2_forward(&model, train_loader.inputs, B, T);
+
+                  gpt2_backward_and_reduce(&model, train_loader.inputs, train_loader.targets, grad_accum_steps, micro_step);
+            }
+            float zloss = (float)(update_detector(&loss_outlier_detector, (double)model.mean_loss));
+
+            float step_learning_rate = get_learning_rate(&lr_scheduler, step);
+
+            float grad_norm = gpt2_calculate_grad_norm(&model, &multi_gpu_config);
+            float zgrad = (float)(update_detector(&grad_norm_outlier_detector, (double)grad_norm));
+
+            if (isfinite(zloss) && skip_update_lossz != 0.0f && zloss > skip_update_lossz)
+            {
+                  printf0("skipping update due to loss z-score of %f\n", zloss);
+            }
+            else if (isfinite(zgrad) && skip_update_gradz != 0.0f && zgrad > skip_update_gradz)
+            {
+                  printf0("skipping update due to grad z-score of %f\n", zgrad);
+            }
+            else
+            {
+
+                  float grad_clip = 1.0f;
+                  float grad_scale = (grad_norm > grad_clip) ? grad_clip / grad_norm : 1.0f;
+                  gpt2_update(&model, step_learning_rate, 0.9f, 0.95f, 1e-8f, weight_decay, grad_scale, step + 1, &multi_gpu_config);
+            }
+            cudaCheck(cudaEventRecord(end));
+            cudaCheck(cudaEventSynchronize(end));
+
+            float time_elapsed_ms;
+            cudaCheck(cudaEventElapsedTime(&time_elapsed_ms, start, end));
+            size_t tokens_processed = (size_t)multi_gpu_config.num_processes * B * T * grad_accum_steps;
+            float tokens_per_second = tokens_processed / time_elapsed_ms * 1000.0f;
+            float bias_corrected_ema_tokens_per_second = tokens_per_second;
+            if (step > 0)
+            {
+                  total_sum_iteration_time_s += time_elapsed_ms / 1000.0f;
+
+                  ema_tokens_per_second = 0.95f * ema_tokens_per_second + 0.05f * tokens_per_second;
+                  bias_corrected_ema_tokens_per_second = ema_tokens_per_second / (1.0f - powf(0.95f, step));
+            }
+            float mfu = gpt2_estimate_mfu(&model, B * T * grad_accum_steps, time_elapsed_ms / 1000.0f);
+            printf0("step %4d/%d | loss %7.6f (%+.2fz)| norm %6.4f (%+.2fz)| lr %.2e | %.2f ms | %.1f%% bf16 MFU | %.0f tok/s\n",
+                    step + 1, train_num_batches, model.mean_loss, zloss, grad_norm, zgrad, step_learning_rate,
+                    time_elapsed_ms, 100 * mfu, bias_corrected_ema_tokens_per_second);
+            if (log_gpu_every > 0 && (step + 1) % log_gpu_every == 0)
+            {
+                  GPUUtilInfo gpu_info = get_gpu_utilization_info();
+                  printf0("                  compute %2.1f%% | memory: %2.1f%% | fan: %2d%% | %4d MHz / %4d MHz | %3d W / %3d W | %d°C / %d°C | %s\n",
+                          gpu_info.gpu_utilization, gpu_info.mem_utilization, gpu_info.fan, gpu_info.clock, gpu_info.max_clock, gpu_info.power / 1000, gpu_info.power_limit / 1000,
+                          gpu_info.temperature, gpu_info.temp_slowdown, gpu_info.throttle_reason);
+            }
+            logger_log_train(&logger, step, model.mean_loss, step_learning_rate, grad_norm);
+
+            if (step == 3)
+            {
+                  cudaProfilerStop();
+            }
+      }
+
+      printf0("total average iteration time: %f ms\n", total_sum_iteration_time_s / (train_num_batches - 1) * 1000);
+
+      cudaCheck(cudaEventDestroy(end));
+      cudaCheck(cudaEventDestroy(start));
+      if (run_hellaswag)
+      {
+            evalloader_free(&eval_loader);
+      }
+      dataloader_free(&train_loader);
+      dataloader_free(&val_loader);
+      tokenizer_free(&tokenizer);
+      free(cpu_logits_raw);
+      free(cpu_logits);
+      free(gen_tokens);
+      multi_gpu_config_free(&multi_gpu_config);
+      gpt2_free(&model);
+      common_free(model);
+      return 0;
+}
+#endif
\ No newline at end of file
diff --git a/LICENSE b/LICENSE
index 804d8ed..c3222c6 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2026 Eamon
+Copyright(c) 2026 Eamon
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..ccb6702
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,104 @@
+# =============================================================================
+# Quadtrix.cpp — Makefile  (llama.cpp-style convenience targets)
+# =============================================================================
+
+.PHONY: all build clean run dev gpu train bench logs ps shell help
+
+SHELL := /bin/bash
+SCRIPT := ./scripts/build.sh
+
+# ── Native C++ ───────────────────────────────────────────────────────────────
+CC     := g++
+CFLAGS := -std=c++17 -O3 -march=native
+IFLAGS := -I. -Iinclude
+TARGET := quadtrix
+SRCS   := main.cpp
+
+all: $(TARGET)
+
+$(TARGET): $(SRCS)
+	$(CC) $(CFLAGS) $(IFLAGS) -o $@ $^
+	@echo "✓ Built $(TARGET)"
+
+# Optimised release (same flags, explicit target)
+release: $(SRCS)
+	$(CC) $(CFLAGS) $(IFLAGS) -DNDEBUG -o $(TARGET) $^
+	strip $(TARGET)
+
+# Debug build
+debug: $(SRCS)
+	$(CC) -std=c++17 -O0 -g -fsanitize=address,undefined \
+	      $(IFLAGS) -o $(TARGET)-debug $^
+
+benchmark-bin: benchmark.cpp
+	$(CC) $(CFLAGS) $(IFLAGS) -o quadtrix-bench $^
+
+clean-native:
+	rm -f $(TARGET) $(TARGET)-debug quadtrix-bench
+
+# ── Docker / Compose targets ─────────────────────────────────────────────────
+build:
+	$(SCRIPT) up
+
+run: build
+	@echo "Stack already started."
+
+dev:
+	$(SCRIPT) dev
+
+gpu:
+	$(SCRIPT) gpu
+
+train-cpp:
+	$(SCRIPT) train-cpp
+
+train-torch:
+	$(SCRIPT) train-torch
+
+bench:
+	$(SCRIPT) bench
+
+logs:
+	$(SCRIPT) logs
+
+ps:
+	$(SCRIPT) ps
+
+shell:
+	$(SCRIPT) shell $(SERVICE)
+
+clean:
+	$(SCRIPT) clean
+
+# ── Misc ─────────────────────────────────────────────────────────────────────
+format:
+	find . \( -name "*.cpp" -o -name "*.h" \) \
+	  ! -path "./build/*" \
+	  | xargs clang-format -i --style=LLVM
+
+lint-py:
+	ruff check backend/ engine/
+
+help:
+	@echo ""
+	@echo "  Quadtrix.cpp — make targets"
+	@echo ""
+	@echo "  Native:"
+	@echo "    make              Build C++ binary (native)"
+	@echo "    make release      Stripped release binary"
+	@echo "    make debug        Debug binary with ASan/UBSan"
+	@echo "    make clean-native Remove native build artifacts"
+	@echo "    make format       Run clang-format on all C++ files"
+	@echo ""
+	@echo "  Docker:"
+	@echo "    make build        docker compose up --build (CPU)"
+	@echo "    make dev          Hot-reload dev stack"
+	@echo "    make gpu          CUDA GPU stack"
+	@echo "    make train-cpp    Train with C++ inside Docker"
+	@echo "    make train-torch  Train with PyTorch inside Docker"
+	@echo "    make bench        Run benchmark"
+	@echo "    make logs         Tail all logs"
+	@echo "    make ps           Show container status"
+	@echo "    make shell        Shell into backend (SERVICE=frontend to change)"
+	@echo "    make clean        Remove containers + volumes"
+	@echo ""
diff --git a/README.md b/README.md
index 0feeebe..56f99cc 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,9 @@
 # Quadtrix.cpp
 
+<p align="center">
+  <img width="785" height="261" alt="image" src="https://github.com/user-attachments/assets/7bd2d8c6-d1e3-4ca0-96c0-0161d3cf235a" />
+</p>
+
 A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend.
 
 > Full technical reference: [docs](https://eamon2009.github.io/LLMs/)
diff --git a/config/config.h b/config/config.h
index db053cb..844efeb 100644
--- a/config/config.h
+++ b/config/config.h
@@ -1,34 +1,18 @@
 #pragma once
-// ============================================================
-//  config/config.h  –  Global constants (mirrors config/config.py)
-// ============================================================
-
 #include <string>
-
-// ── Paths ────────────────────────────────────────────────────
-// Set CLEANED_PATH to your input text file before compiling,
-// or override at runtime via the env-var GPT_DATA_PATH.
 static const std::string DEFAULT_CLEANED_PATH = "data/input.txt";
 static const std::string DATA_PATH_ENV_VAR = "GPT_DATA_PATH";
-
-// ── Reproducibility ──────────────────────────────────────────
 static const unsigned int SEED = 1337;
-
-// ── Data split ───────────────────────────────────────────────
 static const double TRAIN_SPLIT = 0.9; // 90 % train, 10 % val
-
-// ── Hyper-parameters (identical to the Python script) ───────
 static const int BATCH_SIZE = 4;
 static const int BLOCK_SIZE = 64; // context length
-static const int MAX_ITERS = 3000;
+static const int MAX_ITERS = 10000;
 static const int EVAL_INTERVAL = 20;
 static const float LEARNING_RATE = 3e-4f;
-static const int EVAL_ITERS = 10;
+static const int EVAL_ITERS = 1;
 static const int N_EMBD = 128;
 static const int N_HEAD = 4;
 static const int N_LAYER = 4;
 static const float DROPOUT = 0.2f; // applied during training only
-
-// ── Output paths ─────────────────────────────────────────────
 static const std::string BEST_MODEL_PATH = "best_model.bin";
 static const std::string MODEL_PATH_ENV_VAR = "GPT_MODEL_PATH";
diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml
new file mode 100644
index 0000000..a2e9a85
--- /dev/null
+++ b/docker-compose.dev.yml
@@ -0,0 +1,45 @@
+services:
+  frontend:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.dev.frontend
+    ports:
+      - "5173:5173"
+    volumes:
+      - ./frontend:/app:delegated
+      - /app/node_modules
+    environment:
+      VITE_API_BASE_URL: "http://localhost:3001"
+    command: [ "npm", "run", "dev", "--", "--host", "0.0.0.0" ]
+    healthcheck:
+      test: [ "CMD", "wget", "-qO-", "http://localhost:5173/" ]
+      interval: 15s
+      timeout: 5s
+      retries: 5
+
+  backend:
+    volumes:
+      - ./backend:/app/backend:delegated
+      - ./engine:/app/engine:delegated
+      - models:/models
+    environment:
+      LOG_LEVEL: DEBUG
+      CORS_ORIGINS: "http://localhost:5173,http://localhost:3001"
+    command:
+      - python
+      - -m
+      - uvicorn
+      - main:app
+      - --host
+      - "0.0.0.0"
+      - --port
+      - "3001"
+      - --reload
+      - --reload-dir
+      - /app/backend
+
+  redis:
+    ports:
+      - "6379:6379"
+volumes:
+  models:
diff --git a/docker-compose.gpu.yml b/docker-compose.gpu.yml
new file mode 100644
index 0000000..abbd02e
--- /dev/null
+++ b/docker-compose.gpu.yml
@@ -0,0 +1,32 @@
+services:
+  backend:
+    build:
+      args:
+        CUDA: "1"
+    image: quadtrix/backend-cuda:latest
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [ gpu ]
+    environment:
+      CUDA_VISIBLE_DEVICES: "0"
+      TORCH_CHECKPOINT_PATH: /models/best_model.pt
+
+  train-torch:
+    build:
+      args:
+        CUDA: "1"
+    image: quadtrix/backend-cuda:latest
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [ gpu ]
+    environment:
+      CUDA_VISIBLE_DEVICES: "0"
+      QUADTRIX_TRAIN_DATA: /app/data/input.txt
diff --git a/docker-compose.yml b/docker-compose.yml
index 8191856..7bb3572 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,34 +1,173 @@
+name: quadtrix
+
+x-common-env: &common-env
+  TZ: UTC
+  PYTHONUNBUFFERED: "1"
+
 services:
-  quadtrix:
-    image: ghcr.io/eamon2009/quadtrix.cpp:latest
+
+  frontend:
     build:
       context: .
-      dockerfile: Dockerfile
+      dockerfile: .devops/Dockerfile.frontend
       args:
-        # for cuda
-        # BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-runtime-ubuntu24.04
-        BASE_IMAGE: ubuntu:24.04
-
+        VITE_API_BASE_URL: ""
+    image: quadtrix/frontend:latest
+    container_name: quadtrix-frontend
+    restart: unless-stopped
     ports:
-      - "3001:3001" # FastAPI backend
-      - "8080:8080" # React frontend
-
-    volumes:
-      # Place best_model.pt and/or best_model.bin inside ./models/
-      - ./models:/app/models
+      - "5173:80"
+    depends_on:
+      backend:
+        condition: service_healthy
+    networks:
+      - quadtrix-net
+    healthcheck:
+      test: [ "CMD", "wget", "-qO-", "http://localhost/" ]
+      interval: 30s
+      timeout: 5s
+      retries: 3
 
+  backend:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.backend
+    image: quadtrix/backend:latest
+    container_name: quadtrix-backend
+    restart: unless-stopped
+    ports:
+      - "3001:3001"
     environment:
-      TORCH_CHECKPOINT_PATH: /app/models/best_model.pt
-      GPT_MODEL_PATH: /app/models/best_model.bin
-      CORS_ORIGINS: http://localhost:8080
+      <<: *common-env
+      API_PORT: "3001"
+      CORS_ORIGINS: "http://localhost:5173,http://frontend"
+      REDIS_URL: "redis://redis:6379/0"
+      TORCH_CHECKPOINT_PATH: /models/best_model.pt
       LOG_LEVEL: INFO
-      MAX_SESSIONS: 1000
-      SESSION_TTL_HOURS: 24
-    restart: unless-stopped
-
+      MAX_SESSIONS: "500"
+      SESSION_TTL_HOURS: "24"
+    volumes:
+      - models:/models
+      - ./engine:/app/engine:ro
+    depends_on:
+      redis:
+        condition: service_healthy
+    networks:
+      - quadtrix-net
     healthcheck:
       test: [ "CMD", "curl", "-f", "http://localhost:3001/api/health" ]
       interval: 30s
       timeout: 10s
-      retries: 5
       start_period: 20s
+      retries: 3
+
+  redis:
+    image: redis:7-alpine
+    container_name: quadtrix-redis
+    restart: unless-stopped
+    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
+    volumes:
+      - redis-data:/data
+    networks:
+      - quadtrix-net
+    healthcheck:
+      test: [ "CMD", "redis-cli", "ping" ]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+    expose:
+      - "6379"
+
+  cpp:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.cpp
+    image: quadtrix/cpp:latest
+    container_name: quadtrix-cpp
+
+    restart: "no"
+    stdin_open: true
+    tty: true
+    volumes:
+      - models:/models
+      - ./data:/app/data:ro
+    environment:
+      <<: *common-env
+      GPT_DATA_PATH: /app/data/input.txt
+      GPT_MODEL_PATH: /models/best_model.bin
+    networks:
+      - quadtrix-net
+    profiles:
+      - cpp
+
+  train-cpp:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.cpp
+    image: quadtrix/cpp:latest
+    container_name: quadtrix-train-cpp
+    restart: "no"
+    volumes:
+      - models:/models
+      - ./data:/app/data:ro
+    environment:
+      <<: *common-env
+      GPT_DATA_PATH: /app/data/input.txt
+      GPT_MODEL_PATH: /models/best_model.bin
+    command: [ "data/input.txt" ] # train mode (no --chat flag)
+    networks:
+      - quadtrix-net
+    profiles:
+      - train
+
+  train-torch:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.backend
+    image: quadtrix/backend:latest
+    container_name: quadtrix-train-torch
+    restart: "no"
+    volumes:
+      - models:/models
+      - ./engine:/app/engine
+      - ./data:/app/data:ro
+    environment:
+      <<: *common-env
+      QUADTRIX_TRAIN_DATA: /app/data/input.txt
+    working_dir: /app
+    command: [ "python", "engine/main.py" ]
+    networks:
+      - quadtrix-net
+    profiles:
+      - train
+
+  benchmark:
+    build:
+      context: .
+      dockerfile: .devops/Dockerfile.cpp
+    image: quadtrix/cpp:latest
+    container_name: quadtrix-benchmark
+    restart: "no"
+    volumes:
+      - models:/models
+      - ./data:/app/data:ro
+      - ./benchmark_results.csv:/app/benchmark_results.csv
+    environment:
+      <<: *common-env
+      GPT_MODEL_PATH: /models/best_model.bin
+
+    command: [ "data/input.txt", "--generate" ]
+    networks:
+      - quadtrix-net
+    profiles:
+      - benchmark
+
+volumes:
+  models:
+    driver: local
+  redis-data:
+    driver: local
+
+networks:
+  quadtrix-net:
+    driver: bridge
diff --git a/frontend/src/components/chat/EmptyState.tsx b/frontend/src/components/chat/EmptyState.tsx
index ce75d9a..abf94ec 100644
--- a/frontend/src/components/chat/EmptyState.tsx
+++ b/frontend/src/components/chat/EmptyState.tsx
@@ -1,13 +1,95 @@
 export function EmptyState() {
   return (
-    <div className="flex flex-1 items-center justify-center px-6">
-      <div className="flex w-full max-w-3xl flex-col items-center gap-6 text-center">
-        <div className="flex h-16 w-16 items-center justify-center rounded-md border border-[var(--border-muted)] bg-white">
-          <img alt="Quadtrix.cpp icon" className="h-14 w-14 object-contain" src="/icon.svg" />
+    <div
+      style={{
+        flex: 1,
+        display: "flex",
+        alignItems: "center",
+        justifyContent: "center",
+        padding: "24px",
+      }}
+    >
+      <div
+        style={{
+          display: "flex",
+          flexDirection: "column",
+          alignItems: "center",
+          gap: 20,
+          textAlign: "center",
+          maxWidth: 420,
+        }}
+      >
+        {/* Icon */}
+        <div
+          style={{
+            width: 56,
+            height: 56,
+            borderRadius: 14,
+            background: "linear-gradient(135deg, #4f8ef7 0%, #2563eb 100%)",
+            display: "flex",
+            alignItems: "center",
+            justifyContent: "center",
+            boxShadow: "0 8px 32px rgba(79,142,247,0.25)",
+          }}
+        >
+          <svg width="28" height="28" viewBox="0 0 24 24" fill="none">
+            <path
+              d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 14H9V8h2v8zm4 0h-2V8h2v8z"
+              fill="white"
+              opacity="0.9"
+            />
+            <path
+              d="M8 12l2-2 2 2 4-4"
+              stroke="white"
+              strokeWidth="1.5"
+              strokeLinecap="round"
+              strokeLinejoin="round"
+              fill="none"
+            />
+          </svg>
         </div>
-        <div className="space-y-2">
-          <h1 className="font-mono text-2xl font-semibold tracking-[0.18em] text-[var(--text-primary)]">Quadtrix.cpp</h1>
-          <p className="text-sm text-[var(--text-secondary)]">Minimal local chat interface. Start typing below to begin.</p>
+
+        <div>
+          <h1
+            style={{
+              margin: 0,
+              fontSize: 20,
+              fontWeight: 600,
+              color: "var(--text-primary)",
+              letterSpacing: "-0.3px",
+            }}
+          >
+            Quadtrix.cpp
+          </h1>
+          <p
+            style={{
+              margin: "8px 0 0",
+              fontSize: 13,
+              color: "var(--text-muted)",
+              lineHeight: 1.6,
+            }}
+          >
+            Local char-level language model. Start a conversation below.
+          </p>
+        </div>
+
+        {/* Hint chips */}
+        <div style={{ display: "flex", flexWrap: "wrap", gap: 8, justifyContent: "center" }}>
+          {["Fast local inference", "C++ & PyTorch backends", "No cloud required"].map((chip) => (
+            <span
+              key={chip}
+              style={{
+                padding: "4px 10px",
+                borderRadius: 20,
+                border: "1px solid var(--border-muted)",
+                fontSize: 11,
+                color: "var(--text-muted)",
+                background: "var(--bg-elevated)",
+              }}
+            >
+              {chip}
+            </span>
+          ))}
         </div>
       </div>
     </div>
diff --git a/frontend/src/components/chat/MessageAvatar.tsx b/frontend/src/components/chat/MessageAvatar.tsx
index 25373d5..c606c9d 100644
--- a/frontend/src/components/chat/MessageAvatar.tsx
+++ b/frontend/src/components/chat/MessageAvatar.tsx
@@ -6,15 +6,48 @@ interface MessageAvatarProps {
 
 export function MessageAvatar({ role }: MessageAvatarProps) {
   const isUser = role === "user";
+
+  if (isUser) {
+    return (
+      <div
+        style={{
+          width: 30,
+          height: 30,
+          borderRadius: "50%",
+          background: "var(--bg-elevated)",
+          border: "1px solid var(--border-muted)",
+          display: "flex",
+          alignItems: "center",
+          justifyContent: "center",
+          fontSize: 11,
+          fontWeight: 600,
+          color: "var(--text-secondary)",
+          flexShrink: 0,
+        }}
+      >
+        U
+      </div>
+    );
+  }
+
   return (
     <div
-      className={`flex h-8 w-8 shrink-0 items-center justify-center rounded-md border text-xs font-semibold ${
-        isUser
-          ? "border-[var(--border-muted)] bg-elevated text-[var(--text-primary)]"
-          : "border-[var(--border-muted)] bg-surface font-mono text-[var(--text-primary)]"
-      }`}
+      style={{
+        width: 30,
+        height: 30,
+        borderRadius: "50%",
+        background: "linear-gradient(135deg, #4f8ef7 0%, #2563eb 100%)",
+        display: "flex",
+        alignItems: "center",
+        justifyContent: "center",
+        fontSize: 12,
+        fontWeight: 700,
+        color: "#fff",
+        flexShrink: 0,
+        boxShadow: "0 2px 8px rgba(79,142,247,0.3)",
+      }}
     >
-      {isUser ? "You" : "Q"}
+      Q
     </div>
   );
 }
diff --git a/frontend/src/components/chat/MessageList.tsx b/frontend/src/components/chat/MessageList.tsx
index e38a0af..5de6e62 100644
--- a/frontend/src/components/chat/MessageList.tsx
+++ b/frontend/src/components/chat/MessageList.tsx
@@ -1,5 +1,6 @@
-import { useAutoScroll } from "../../hooks/useAutoScroll";
+import { useRef } from "react";
 import type { Message } from "../../types";
+import { useAutoScroll } from "../../hooks/useAutoScroll";
 import { MessageRow } from "./MessageRow";
 
 interface MessageListProps {
@@ -7,13 +8,25 @@ interface MessageListProps {
 }
 
 export function MessageList({ messages }: MessageListProps) {
-  const scrollRef = useAutoScroll<HTMLDivElement>(messages.length);
+  const bottomRef = useRef<HTMLDivElement | null>(null);
+  useAutoScroll(bottomRef, messages);
+
   return (
-    <div className="flex-1 overflow-y-auto px-4 py-8 md:px-8 md:py-10" ref={scrollRef}>
-      <div className="mx-auto flex max-w-4xl flex-col gap-8">
+    <div
+      style={{
+        flex: 1,
+        overflowY: "auto",
+        padding: "24px 16px",
+        display: "flex",
+        flexDirection: "column",
+        gap: 20,
+      }}
+    >
+      <div style={{ maxWidth: 780, width: "100%", margin: "0 auto", display: "flex", flexDirection: "column", gap: 20 }}>
         {messages.map((message) => (
           <MessageRow key={message.id} message={message} />
         ))}
+        <div ref={bottomRef} />
       </div>
     </div>
   );
diff --git a/frontend/src/components/chat/MessageRow.tsx b/frontend/src/components/chat/MessageRow.tsx
index 372d585..8dd3910 100644
--- a/frontend/src/components/chat/MessageRow.tsx
+++ b/frontend/src/components/chat/MessageRow.tsx
@@ -27,37 +27,96 @@ export function MessageRow({ message }: MessageRowProps) {
   return (
     <motion.div
       animate={{ opacity: 1, y: 0 }}
-      className={`group flex w-full gap-3 ${isUser ? "justify-end" : "justify-start"}`}
       initial={{ opacity: 0, y: 6 }}
-      transition={{ duration: 0.2 }}
+      transition={{ duration: 0.18 }}
+      className="group"
+      style={{
+        display: "flex",
+        width: "100%",
+        gap: 12,
+        justifyContent: isUser ? "flex-end" : "flex-start",
+        alignItems: "flex-start",
+      }}
     >
       {!isUser && <MessageAvatar role={message.role} />}
-      <div className={`max-w-[min(760px,calc(100vw-48px))] ${isUser ? "items-end" : "items-start"} flex flex-col gap-1`}>
-        <div className="flex items-center gap-2 font-mono text-[11px] uppercase tracking-[0.16em] text-[var(--text-muted)]">
-          <span>{isUser ? "You" : "Quadtrix"}</span>
+
+      <div
+        style={{
+          maxWidth: "min(680px, calc(100vw - 80px))",
+          display: "flex",
+          flexDirection: "column",
+          gap: 4,
+          alignItems: isUser ? "flex-end" : "flex-start",
+        }}
+      >
+        {/* Meta row */}
+        <div
+          style={{
+            display: "flex",
+            alignItems: "center",
+            gap: 8,
+            fontSize: 11,
+            color: "var(--text-muted)",
+          }}
+        >
+          <span style={{ fontWeight: 500 }}>{isUser ? "You" : "Quadtrix"}</span>
           <span>{formatRelativeTime(message.created_at)}</span>
           {!isUser && !message.pending && (
             <button
-              className="hidden rounded px-1 text-[var(--text-secondary)] hover:text-[var(--text-primary)] group-hover:inline"
+              className="group-hover:opacity-100"
               onClick={copyText}
               type="button"
+              style={{
+                opacity: 0,
+                background: "none",
+                border: "none",
+                cursor: "pointer",
+                color: copied ? "var(--status-online)" : "var(--text-muted)",
+                fontSize: 11,
+                padding: "0 2px",
+                transition: "opacity 0.12s, color 0.12s",
+              }}
             >
-              {copied ? "Copied" : "Copy"}
+              {copied ? "✓ Copied" : "Copy"}
             </button>
           )}
         </div>
+
+        {/* Bubble */}
         <div
-          className={`rounded-lg border px-4 py-3 text-sm leading-7 ${
-            isUser
-              ? "border-[var(--border-muted)] bg-surface text-[var(--text-primary)]"
+          style={{
+            borderRadius: 10,
+            padding: "10px 14px",
+            fontSize: 13,
+            lineHeight: 1.7,
+            ...(isUser
+              ? {
+                  background: "var(--bg-elevated)",
+                  border: "1px solid var(--border-muted)",
+                  color: "var(--text-primary)",
+                }
               : message.error
-                ? "border-red-500/20 bg-red-500/10 font-sans text-red-200"
-                : "border-[var(--border-subtle)] bg-[#0d0d0d] font-mono text-[var(--text-primary)]"
-          }`}
+              ? {
+                  background: "rgba(224,82,82,0.08)",
+                  border: "1px solid rgba(224,82,82,0.2)",
+                  color: "#f87171",
+                }
+              : {
+                  background: "var(--bg-surface)",
+                  border: "1px solid var(--border-subtle)",
+                  color: "var(--text-primary)",
+                  fontFamily: "var(--font-mono)",
+                }),
+          }}
         >
-          {message.pending ? <ThinkingIndicator /> : <span className="whitespace-pre-wrap">{message.text}</span>}
+          {message.pending ? (
+            <ThinkingIndicator />
+          ) : (
+            <span style={{ whiteSpace: "pre-wrap" }}>{message.text}</span>
+          )}
         </div>
       </div>
+
       {isUser && <MessageAvatar role={message.role} />}
     </motion.div>
   );
diff --git a/frontend/src/components/chat/ThinkingIndicator.tsx b/frontend/src/components/chat/ThinkingIndicator.tsx
index e83d0f5..7ec4a6c 100644
--- a/frontend/src/components/chat/ThinkingIndicator.tsx
+++ b/frontend/src/components/chat/ThinkingIndicator.tsx
@@ -1,12 +1,28 @@
 export function ThinkingIndicator() {
   return (
-    <div className="flex items-center gap-2 text-[var(--text-secondary)]">
-      <span>Quadtrix is thinking</span>
-      <span className="flex gap-1">
-        <span className="h-1.5 w-1.5 animate-bounce rounded-full bg-[var(--text-secondary)]" />
-        <span className="h-1.5 w-1.5 animate-bounce rounded-full bg-[var(--text-secondary)] [animation-delay:120ms]" />
-        <span className="h-1.5 w-1.5 animate-bounce rounded-full bg-[var(--text-secondary)] [animation-delay:240ms]" />
+    <div style={{ display: "flex", alignItems: "center", gap: 8, color: "var(--text-muted)" }}>
+      <span style={{ fontSize: 12 }}>Generating</span>
+      <span style={{ display: "flex", gap: 3 }}>
+        {[0, 120, 240].map((delay) => (
+          <span
+            key={delay}
+            style={{
+              display: "inline-block",
+              width: 5,
+              height: 5,
+              borderRadius: "50%",
+              background: "var(--accent)",
+              animation: `bounce 1s ease-in-out ${delay}ms infinite`,
+            }}
+          />
+        ))}
       </span>
+      <style>{`
+        @keyframes bounce {
+          0%, 80%, 100% { transform: translateY(0); opacity: 0.4; }
+          40% { transform: translateY(-4px); opacity: 1; }
+        }
+      `}</style>
     </div>
   );
 }
diff --git a/include/tensor.h b/include/tensor.h
index f6ac4a5..c3526b6 100644
--- a/include/tensor.h
+++ b/include/tensor.h
@@ -1,8 +1,4 @@
 #pragma once
-// ============================================================
-//  include/tensor.h  –  Lightweight 2-D / 3-D float tensor
-//  (CPU only – mirrors what PyTorch tensors do in the model)
-// ============================================================
 
 #include <vector>
 #include <cmath>
@@ -15,310 +11,557 @@
 #include <iostream>
 #include <functional>
 
-// ------------------------------------------------------------------
-// Tensor  (row-major, float32)
-//   shape is stored as {d0, d1}  or  {d0, d1, d2}
-// ------------------------------------------------------------------
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+#ifdef __AVX__
+#include <immintrin.h>
+#endif
+
+#ifdef __SSE__
+#include <xmmintrin.h>
+#endif
+
 struct Tensor
 {
-      std::vector<int> shape;
-      std::vector<float> data;
-
-      Tensor() = default;
-
-      Tensor(std::vector<int> sh, float fill = 0.0f)
-          : shape(sh)
-      {
-            int total = 1;
-            for (int d : sh)
-                  total *= d;
-            data.assign(total, fill);
-      }
-
-      int numel() const
-      {
-            int n = 1;
-            for (int d : shape)
-                  n *= d;
-            return n;
-      }
-
-      int ndim() const { return (int)shape.size(); }
-
-      // ---- element access helpers --------------------------------
-      float &at(int i)
-      {
-            assert(i >= 0 && i < (int)data.size());
-            return data[i];
-      }
-      float at(int i) const
-      {
-            assert(i >= 0 && i < (int)data.size());
-            return data[i];
-      }
-
-      // 2-D
-      float &at(int r, int c)
-      {
-            return data[r * shape[1] + c];
-      }
-      float at(int r, int c) const
-      {
-            return data[r * shape[1] + c];
-      }
-
-      // 3-D
-      float &at(int b, int r, int c)
-      {
-            return data[b * shape[1] * shape[2] + r * shape[2] + c];
-      }
-      float at(int b, int r, int c) const
-      {
-            return data[b * shape[1] * shape[2] + r * shape[2] + c];
-      }
-
-      // ---- factory helpers ---------------------------------------
-      static Tensor zeros(std::vector<int> sh) { return Tensor(sh, 0.0f); }
-      static Tensor ones(std::vector<int> sh) { return Tensor(sh, 1.0f); }
-
-      static Tensor randn(std::vector<int> sh, float mean, float std,
-                          std::mt19937 &rng)
-      {
-            std::normal_distribution<float> dist(mean, std);
-            Tensor t(sh);
-            for (auto &v : t.data)
-                  v = dist(rng);
-            return t;
-      }
-
-      void fill(float v) { std::fill(data.begin(), data.end(), v); }
-
-      // ---- print shape -------------------------------------------
-      void print_shape(const std::string &name = "") const
-      {
-            if (!name.empty())
-                  std::cout << name << ": ";
-            std::cout << "[";
-            for (int i = 0; i < (int)shape.size(); ++i)
-            {
-                  std::cout << shape[i];
-                  if (i + 1 < (int)shape.size())
-                        std::cout << ", ";
-            }
-            std::cout << "]" << std::endl;
-      }
-};
+    std::vector<int> shape;
+    std::vector<float> data;
+
+    Tensor() = default;
+
+    Tensor(std::vector<int> sh, float fill = 0.0f) : shape(std::move(sh))
+    {
+        int total = 1;
+        for (int d : shape)
+            total *= d;
+        data.reserve(total);
+        data.assign(total, fill);
+    }
+
+    Tensor(const Tensor &) = default;
+    Tensor(Tensor &&) noexcept = default;
+    Tensor &operator=(const Tensor &) = default;
+    Tensor &operator=(Tensor &&) noexcept = default;
+
+    int numel() const
+    {
+        int n = 1;
+        for (int d : shape)
+            n *= d;
+        return n;
+    }
 
-// ------------------------------------------------------------------
-// Basic math ops  (in-place and returning new tensors)
-// ------------------------------------------------------------------
+    int ndim() const { return (int)shape.size(); }
+
+    float &at(int i) { return data[i]; }
+    float at(int i) const { return data[i]; }
+
+    float &at(int r, int c) { return data[r * shape[1] + c]; }
+    float at(int r, int c) const { return data[r * shape[1] + c]; }
+
+    float &at(int b, int r, int c) { return data[b * shape[1] * shape[2] + r * shape[2] + c]; }
+    float at(int b, int r, int c) const { return data[b * shape[1] * shape[2] + r * shape[2] + c]; }
+
+    static Tensor zeros(std::vector<int> sh) { return Tensor(sh, 0.0f); }
+    static Tensor ones(std::vector<int> sh) { return Tensor(sh, 1.0f); }
+
+    static Tensor randn(std::vector<int> sh, float mean, float std, std::mt19937 &rng)
+    {
+        std::normal_distribution<float> dist(mean, std);
+        Tensor t(sh);
+        for (auto &v : t.data)
+            v = dist(rng);
+        return t;
+    }
+
+    void fill(float v) { std::fill(data.begin(), data.end(), v); }
+
+    void print_shape(const std::string &name = "") const
+    {
+        if (!name.empty())
+            std::cout << name << ": ";
+        std::cout << "[";
+        for (int i = 0; i < (int)shape.size(); ++i)
+        {
+            std::cout << shape[i];
+            if (i + 1 < (int)shape.size())
+                std::cout << ", ";
+        }
+        std::cout << "]" << std::endl;
+    }
+};
 
-// element-wise add (same shape)
 inline Tensor add(const Tensor &a, const Tensor &b)
 {
-      assert(a.data.size() == b.data.size());
-      Tensor c(a.shape);
-      for (int i = 0; i < (int)a.data.size(); ++i)
-            c.data[i] = a.data[i] + b.data[i];
-      return c;
+    assert(a.data.size() == b.data.size());
+    Tensor c(a.shape);
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vb = _mm256_loadu_ps(&b.data[i]);
+        __m256 vc = _mm256_add_ps(va, vb);
+        _mm256_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = a.data[i] + b.data[i];
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vb = _mm_loadu_ps(&b.data[i]);
+        __m128 vc = _mm_add_ps(va, vb);
+        _mm_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = a.data[i] + b.data[i];
+#else
+    for (size_t i = 0; i < n; ++i)
+        c.data[i] = a.data[i] + b.data[i];
+#endif
+    return c;
+}
+
+inline void add_inplace(Tensor &a, const Tensor &b)
+{
+    assert(a.data.size() == b.data.size());
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vb = _mm256_loadu_ps(&b.data[i]);
+        __m256 vc = _mm256_add_ps(va, vb);
+        _mm256_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] += b.data[i];
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vb = _mm_loadu_ps(&b.data[i]);
+        __m128 vc = _mm_add_ps(va, vb);
+        _mm_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] += b.data[i];
+#else
+    for (size_t i = 0; i < n; ++i)
+        a.data[i] += b.data[i];
+#endif
 }
 
-// scalar multiply
 inline Tensor scale(const Tensor &a, float s)
 {
-      Tensor c(a.shape);
-      for (int i = 0; i < (int)a.data.size(); ++i)
-            c.data[i] = a.data[i] * s;
-      return c;
+    Tensor c(a.shape);
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    __m256 vs = _mm256_set1_ps(s);
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vc = _mm256_mul_ps(va, vs);
+        _mm256_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = a.data[i] * s;
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    __m128 vs = _mm_set1_ps(s);
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vc = _mm_mul_ps(va, vs);
+        _mm_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = a.data[i] * s;
+#else
+    for (size_t i = 0; i < n; ++i)
+        c.data[i] = a.data[i] * s;
+#endif
+    return c;
+}
+
+inline void scale_inplace(Tensor &a, float s)
+{
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    __m256 vs = _mm256_set1_ps(s);
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vc = _mm256_mul_ps(va, vs);
+        _mm256_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] *= s;
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    __m128 vs = _mm_set1_ps(s);
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vc = _mm_mul_ps(va, vs);
+        _mm_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] *= s;
+#else
+    for (auto &v : a.data)
+        v *= s;
+#endif
 }
 
-// ReLU
 inline Tensor relu(const Tensor &a)
 {
-      Tensor c(a.shape);
-      for (int i = 0; i < (int)a.data.size(); ++i)
-            c.data[i] = std::max(0.0f, a.data[i]);
-      return c;
+    Tensor c(a.shape);
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    __m256 zero = _mm256_setzero_ps();
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vc = _mm256_max_ps(va, zero);
+        _mm256_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = std::max(0.0f, a.data[i]);
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    __m128 zero = _mm_setzero_ps();
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vc = _mm_max_ps(va, zero);
+        _mm_storeu_ps(&c.data[i], vc);
+    }
+    for (; i < n; ++i)
+        c.data[i] = std::max(0.0f, a.data[i]);
+#else
+    for (size_t i = 0; i < n; ++i)
+        c.data[i] = std::max(0.0f, a.data[i]);
+#endif
+    return c;
 }
 
-// Softmax along last dim for 3-D tensor [B, T, C]
-inline Tensor softmax3d(const Tensor &a)
+inline void relu_inplace(Tensor &a)
 {
-      int B = a.shape[0], T = a.shape[1], C = a.shape[2];
-      Tensor out(a.shape);
-      for (int b = 0; b < B; ++b)
-      {
-            for (int t = 0; t < T; ++t)
-            {
-                  float maxv = -1e30f;
-                  for (int c = 0; c < C; ++c)
-                        maxv = std::max(maxv, a.at(b, t, c));
-                  float sumv = 0.0f;
-                  for (int c = 0; c < C; ++c)
-                  {
-                        float e = std::exp(a.at(b, t, c) - maxv);
-                        out.at(b, t, c) = e;
-                        sumv += e;
-                  }
-                  for (int c = 0; c < C; ++c)
-                        out.at(b, t, c) /= sumv;
-            }
-      }
-      return out;
+    size_t n = a.data.size();
+
+#ifdef __AVX__
+    size_t i = 0;
+    size_t vec_end = n & ~7ULL;
+    __m256 zero = _mm256_setzero_ps();
+    for (; i < vec_end; i += 8)
+    {
+        __m256 va = _mm256_loadu_ps(&a.data[i]);
+        __m256 vc = _mm256_max_ps(va, zero);
+        _mm256_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] = std::max(0.0f, a.data[i]);
+#elif defined(__SSE__)
+    size_t i = 0;
+    size_t vec_end = n & ~3ULL;
+    __m128 zero = _mm_setzero_ps();
+    for (; i < vec_end; i += 4)
+    {
+        __m128 va = _mm_loadu_ps(&a.data[i]);
+        __m128 vc = _mm_max_ps(va, zero);
+        _mm_storeu_ps(&a.data[i], vc);
+    }
+    for (; i < n; ++i)
+        a.data[i] = std::max(0.0f, a.data[i]);
+#else
+    for (auto &v : a.data)
+        v = std::max(0.0f, v);
+#endif
 }
 
-// Softmax along last dim for 2-D tensor [T, C]
-inline Tensor softmax2d(const Tensor &a)
+inline Tensor softmax3d(const Tensor &a)
 {
-      int T = a.shape[0], C = a.shape[1];
-      Tensor out(a.shape);
-      for (int t = 0; t < T; ++t)
-      {
+    int B = a.shape[0], T = a.shape[1], C = a.shape[2];
+    Tensor out(a.shape);
+
+#ifdef _OPENMP
+#pragma omp parallel for collapse(2) if (B * T > 64)
+#endif
+    for (int b = 0; b < B; ++b)
+    {
+        for (int t = 0; t < T; ++t)
+        {
             float maxv = -1e30f;
             for (int c = 0; c < C; ++c)
-                  maxv = std::max(maxv, a.at(t, c));
+                maxv = std::max(maxv, a.at(b, t, c));
+
             float sumv = 0.0f;
             for (int c = 0; c < C; ++c)
             {
-                  float e = std::exp(a.at(t, c) - maxv);
-                  out.at(t, c) = e;
-                  sumv += e;
+                float e = std::exp(a.at(b, t, c) - maxv);
+                out.at(b, t, c) = e;
+                sumv += e;
             }
+
+            float inv_sum = 1.0f / sumv;
             for (int c = 0; c < C; ++c)
-                  out.at(t, c) /= sumv;
-      }
-      return out;
+                out.at(b, t, c) *= inv_sum;
+        }
+    }
+    return out;
 }
 
-// Layer-norm along last dim  [B, T, C]  → same shape
-inline Tensor layer_norm(const Tensor &x,
-                         const Tensor &gamma, // [C]
-                         const Tensor &beta,  // [C]
-                         float eps = 1e-5f)
+inline Tensor softmax2d(const Tensor &a)
 {
-      int B = x.shape[0], T = x.shape[1], C = x.shape[2];
-      Tensor out(x.shape);
-      for (int b = 0; b < B; ++b)
-      {
-            for (int t = 0; t < T; ++t)
+    int T = a.shape[0], C = a.shape[1];
+    Tensor out(a.shape);
+
+#ifdef _OPENMP
+#pragma omp parallel for if (T > 128)
+#endif
+    for (int t = 0; t < T; ++t)
+    {
+        float maxv = -1e30f;
+        for (int c = 0; c < C; ++c)
+            maxv = std::max(maxv, a.at(t, c));
+
+        float sumv = 0.0f;
+        for (int c = 0; c < C; ++c)
+        {
+            float e = std::exp(a.at(t, c) - maxv);
+            out.at(t, c) = e;
+            sumv += e;
+        }
+
+        float inv_sum = 1.0f / sumv;
+        for (int c = 0; c < C; ++c)
+            out.at(t, c) *= inv_sum;
+    }
+    return out;
+}
+
+inline Tensor layer_norm(const Tensor &x, const Tensor &gamma, const Tensor &beta, float eps = 1e-5f)
+{
+    int B = x.shape[0], T = x.shape[1], C = x.shape[2];
+    Tensor out(x.shape);
+
+#ifdef _OPENMP
+#pragma omp parallel for collapse(2) if (B * T > 64)
+#endif
+    for (int b = 0; b < B; ++b)
+    {
+        for (int t = 0; t < T; ++t)
+        {
+            float mu = 0.0f;
+            for (int c = 0; c < C; ++c)
+                mu += x.at(b, t, c);
+            mu /= C;
+
+            float var = 0.0f;
+            for (int c = 0; c < C; ++c)
             {
-                  float mu = 0.0f;
-                  for (int c = 0; c < C; ++c)
-                        mu += x.at(b, t, c);
-                  mu /= C;
-                  float var = 0.0f;
-                  for (int c = 0; c < C; ++c)
-                  {
-                        float d = x.at(b, t, c) - mu;
-                        var += d * d;
-                  }
-                  var /= C;
-                  float inv = 1.0f / std::sqrt(var + eps);
-                  for (int c = 0; c < C; ++c)
-                        out.at(b, t, c) = (x.at(b, t, c) - mu) * inv * gamma.at(c) + beta.at(c);
+                float d = x.at(b, t, c) - mu;
+                var += d * d;
             }
-      }
-      return out;
+            var /= C;
+
+            float inv = 1.0f / std::sqrt(var + eps);
+            for (int c = 0; c < C; ++c)
+                out.at(b, t, c) = (x.at(b, t, c) - mu) * inv * gamma.at(c) + beta.at(c);
+        }
+    }
+    return out;
 }
 
-// matmul:  [B, T, D] x [D, E]  →  [B, T, E]
 inline Tensor matmul(const Tensor &a, const Tensor &w)
 {
-      // a: [B, T, D]  or  [B, T, D]
-      // w: [D, E]
-      assert(a.ndim() == 3 && w.ndim() == 2);
-      int B = a.shape[0], T = a.shape[1], D = a.shape[2];
-      int E = w.shape[1];
-      assert(w.shape[0] == D);
-      Tensor out({B, T, E}, 0.0f);
-      for (int b = 0; b < B; ++b)
-            for (int t = 0; t < T; ++t)
-                  for (int e = 0; e < E; ++e)
-                  {
-                        float s = 0.0f;
-                        for (int d = 0; d < D; ++d)
-                              s += a.at(b, t, d) * w.at(d, e);
-                        out.at(b, t, e) = s;
-                  }
-      return out;
+    assert(a.ndim() == 3 && w.ndim() == 2);
+    int B = a.shape[0], T = a.shape[1], D = a.shape[2];
+    int E = w.shape[1];
+    assert(w.shape[0] == D);
+
+    Tensor out({B, T, E}, 0.0f);
+
+    const int TILE_T = 32;
+    const int TILE_E = 32;
+    const int TILE_D = 32;
+
+#ifdef _OPENMP
+#pragma omp parallel for collapse(2) if (B * T * E * D > 100000)
+#endif
+    for (int b = 0; b < B; ++b)
+    {
+        for (int t0 = 0; t0 < T; t0 += TILE_T)
+        {
+            int t_end = std::min(t0 + TILE_T, T);
+            for (int e0 = 0; e0 < E; e0 += TILE_E)
+            {
+                int e_end = std::min(e0 + TILE_E, E);
+                for (int d0 = 0; d0 < D; d0 += TILE_D)
+                {
+                    int d_end = std::min(d0 + TILE_D, D);
+                    for (int t = t0; t < t_end; ++t)
+                    {
+                        for (int e = e0; e < e_end; ++e)
+                        {
+                            float s = out.at(b, t, e);
+                            for (int d = d0; d < d_end; ++d)
+                                s += a.at(b, t, d) * w.at(d, e);
+                            out.at(b, t, e) = s;
+                        }
+                    }
+                }
+            }
+        }
+    }
+    return out;
 }
 
-// add bias [E] broadcast over [B, T, E]
 inline Tensor add_bias(const Tensor &x, const Tensor &bias)
 {
-      assert(x.shape.back() == bias.shape[0]);
-      Tensor out = x;
-      int E = bias.shape[0];
-      int stride = E;
-      int n = x.numel() / E;
-      for (int i = 0; i < n; ++i)
-            for (int e = 0; e < E; ++e)
-                  out.data[i * stride + e] += bias.data[e];
-      return out;
+    assert(x.shape.back() == bias.shape[0]);
+    Tensor out = x;
+    int E = bias.shape[0];
+    int stride = E;
+    int n = x.numel() / E;
+
+#ifdef _OPENMP
+#pragma omp parallel for if (n * E > 10000)
+#endif
+    for (int i = 0; i < n; ++i)
+    {
+        for (int e = 0; e < E; ++e)
+            out.data[i * stride + e] += bias.data[e];
+    }
+    return out;
 }
 
-// batched matmul:  [B, T, D] x [B, D, T2]  →  [B, T, T2]
 inline Tensor bmm(const Tensor &a, const Tensor &b)
 {
-      assert(a.ndim() == 3 && b.ndim() == 3);
-      int B = a.shape[0], T = a.shape[1], D = a.shape[2];
-      int T2 = b.shape[2];
-      assert(b.shape[0] == B && b.shape[1] == D);
-      Tensor out({B, T, T2}, 0.0f);
-      for (int bb = 0; bb < B; ++bb)
-            for (int t = 0; t < T; ++t)
-                  for (int t2 = 0; t2 < T2; ++t2)
-                  {
-                        float s = 0.0f;
-                        for (int d = 0; d < D; ++d)
-                              s += a.at(bb, t, d) * b.at(bb, d, t2);
-                        out.at(bb, t, t2) = s;
-                  }
-      return out;
+    assert(a.ndim() == 3 && b.ndim() == 3);
+    int B = a.shape[0], T = a.shape[1], D = a.shape[2];
+    int T2 = b.shape[2];
+    assert(b.shape[0] == B && b.shape[1] == D);
+
+    Tensor out({B, T, T2}, 0.0f);
+
+    const int TILE = 32;
+
+#ifdef _OPENMP
+#pragma omp parallel for if (B * T * T2 * D > 100000)
+#endif
+    for (int bb = 0; bb < B; ++bb)
+    {
+        for (int t0 = 0; t0 < T; t0 += TILE)
+        {
+            int t_end = std::min(t0 + TILE, T);
+            for (int t2_0 = 0; t2_0 < T2; t2_0 += TILE)
+            {
+                int t2_end = std::min(t2_0 + TILE, T2);
+                for (int d0 = 0; d0 < D; d0 += TILE)
+                {
+                    int d_end = std::min(d0 + TILE, D);
+                    for (int t = t0; t < t_end; ++t)
+                    {
+                        for (int t2 = t2_0; t2 < t2_end; ++t2)
+                        {
+                            float s = out.at(bb, t, t2);
+                            for (int d = d0; d < d_end; ++d)
+                                s += a.at(bb, t, d) * b.at(bb, d, t2);
+                            out.at(bb, t, t2) = s;
+                        }
+                    }
+                }
+            }
+        }
+    }
+    return out;
 }
 
-// transpose last two dims of 3-D tensor [B, T, D] → [B, D, T]
 inline Tensor transpose23(const Tensor &a)
 {
-      int B = a.shape[0], T = a.shape[1], D = a.shape[2];
-      Tensor out({B, D, T});
-      for (int b = 0; b < B; ++b)
+    int B = a.shape[0], T = a.shape[1], D = a.shape[2];
+    Tensor out({B, D, T});
+
+#ifdef _OPENMP
+#pragma omp parallel for collapse(2) if (B * T * D > 10000)
+#endif
+    for (int b = 0; b < B; ++b)
+    {
+        for (int d = 0; d < D; ++d)
+        {
             for (int t = 0; t < T; ++t)
-                  for (int d = 0; d < D; ++d)
-                        out.at(b, d, t) = a.at(b, t, d);
-      return out;
+                out.at(b, d, t) = a.at(b, t, d);
+        }
+    }
+    return out;
 }
 
-// concat along last dim:  [B,T,D1] + [B,T,D2] → [B,T,D1+D2]
 inline Tensor cat_last(const std::vector<Tensor> &ts)
 {
-      int B = ts[0].shape[0], T = ts[0].shape[1];
-      int total = 0;
-      for (auto &t : ts)
-            total += t.shape[2];
-      Tensor out({B, T, total}, 0.0f);
-      int offset = 0;
-      for (auto &t : ts)
-      {
-            int D = t.shape[2];
-            for (int b = 0; b < B; ++b)
-                  for (int tt = 0; tt < T; ++tt)
-                        for (int d = 0; d < D; ++d)
-                              out.at(b, tt, offset + d) = t.at(b, tt, d);
-            offset += D;
-      }
-      return out;
+    int B = ts[0].shape[0], T = ts[0].shape[1];
+    int total = 0;
+    for (auto &t : ts)
+        total += t.shape[2];
+
+    Tensor out({B, T, total}, 0.0f);
+
+    int offset = 0;
+    for (auto &t : ts)
+    {
+        int D = t.shape[2];
+#ifdef _OPENMP
+#pragma omp parallel for collapse(2) if (B * T * D > 10000)
+#endif
+        for (int b = 0; b < B; ++b)
+        {
+            for (int tt = 0; tt < T; ++tt)
+            {
+                for (int d = 0; d < D; ++d)
+                    out.at(b, tt, offset + d) = t.at(b, tt, d);
+            }
+        }
+        offset += D;
+    }
+    return out;
 }
 
-// dropout mask (applied only during training)
 inline Tensor dropout(const Tensor &x, float p, bool training, std::mt19937 &rng)
 {
-      if (!training || p == 0.0f)
-            return x;
-      std::bernoulli_distribution dist(1.0f - p);
-      Tensor out = x;
-      float scale_v = 1.0f / (1.0f - p);
-      for (auto &v : out.data)
-            v = dist(rng) ? v * scale_v : 0.0f;
-      return out;
+    if (!training || p == 0.0f)
+        return x;
+
+    std::bernoulli_distribution dist(1.0f - p);
+    Tensor out = x;
+    float scale_v = 1.0f / (1.0f - p);
+
+    for (auto &v : out.data)
+        v = dist(rng) ? v * scale_v : 0.0f;
+
+    return out;
 }
\ No newline at end of file
diff --git a/main.cpp b/main.cpp
index 006af20..7fc540c 100644
--- a/main.cpp
+++ b/main.cpp
@@ -103,6 +103,22 @@ static std::string choose_output_path(const std::string &requested_path,
       return exe_relative;
 }
 
+// sample N tokens from the model and print them
+static void sample_tokens(GPTLanguageModel &model,
+                          DataLoader &dl,
+                          int n_tokens)
+{
+      std::vector<int> ctx = {0};
+      for (int i = 0; i < n_tokens; ++i)
+      {
+            ctx = model.generate(ctx, 1);
+            std::cout << dl.decode({ctx.back()}) << std::flush;
+            if ((int)ctx.size() > BLOCK_SIZE)
+                  ctx = std::vector<int>(ctx.end() - BLOCK_SIZE, ctx.end());
+      }
+      std::cout << "\n";
+}
+
 // estimate loss — no gradients, training=false
 static float estimate_loss(GPTLanguageModel &model,
                            DataLoader &dl,
@@ -184,10 +200,7 @@ int main(int argc, char *argv[])
       std::signal(SIGINT, sig_handler);
 
       // Banner
-      std::cout << std::string(60, '=') << "\n";
       std::cout << " Quadtrix v1.0 (C++)\n";
-      std::cout << std::string(60, '=') << "\n";
-      std::cout << "\n[INFO] Starting at: " << now_str() << "\n";
 
       std::string data_path = DEFAULT_CLEANED_PATH;
       const char *env_data_path = std::getenv(DATA_PATH_ENV_VAR.c_str());
@@ -219,17 +232,6 @@ int main(int argc, char *argv[])
       data_path = choose_existing_path(data_path, argv[0]);
       model_path = choose_output_path(model_path, argv[0]);
 
-      // Config print
-      std::cout << "\n[CONFIG] Hyperparameters:\n";
-      std::cout << "         batch_size=" << BATCH_SIZE
-                << "  block_size=" << BLOCK_SIZE << "\n";
-      std::cout << "         max_iters=" << MAX_ITERS
-                << "  learning_rate=" << LEARNING_RATE << "\n";
-      std::cout << "         n_embd=" << N_EMBD
-                << "  n_head=" << N_HEAD
-                << "  n_layer=" << N_LAYER
-                << "  dropout=" << DROPOUT << "\n";
-
       //  Data
       DataLoader dl;
       try
@@ -247,13 +249,12 @@ int main(int argc, char *argv[])
       GPTLanguageModel model(dl.vocab_size, N_EMBD, N_HEAD, N_LAYER, BLOCK_SIZE, SEED);
 
       long n_params = model.num_params();
-      std::cout << "[MODEL] Parameters  : "
-                << std::fixed << std::setprecision(2)
-                << n_params / 1.0e6f << " M  (" << n_params << " total)\n";
-      std::cout << "[MODEL] Architecture: "
-                << N_LAYER << " layers x "
-                << N_HEAD << " heads x "
-                << N_EMBD << " embedding dim\n";
+      std::cout << "max_seq_len: " << BLOCK_SIZE << "\n";
+      std::cout << "vocab_size: " << dl.vocab_size << "\n";
+      std::cout << "num_layers: " << N_LAYER << "\n";
+      std::cout << "num_heads: " << N_HEAD << "\n";
+      std::cout << "channels: " << N_EMBD << "\n";
+      std::cout << "num_parameters: " << n_params << "\n";
 
       // chat mode
       if (chat_mode)
@@ -268,9 +269,8 @@ int main(int argc, char *argv[])
             }
 
             model.load(model_path);
-            std::cout << "[CHAT]  Weights loaded from " << model_path << "\n";
-            std::cout << "[CHAT]  Max tokens per reply: " << chat_tokens
-                      << "  (override with --chat-tokens N)\n";
+            std::cout << "weights: " << model_path << "\n";
+            std::cout << "max_tokens: " << chat_tokens << "\n";
 
             run_chat(model, dl, chat_tokens);
             return 0;
@@ -289,10 +289,7 @@ int main(int argc, char *argv[])
             }
 
             model.load(model_path);
-            std::cout << "\n"
-                      << std::string(60, '-') << "\n";
-            std::cout << "  Quadtrix OUTPUT  (Ctrl+C to stop)\n";
-            std::cout << std::string(60, '-') << "\n\n";
+            std::cout << "\ngenerating:\n";
             std::vector<int> ctx = {0};
             while (!g_interrupted)
             {
@@ -301,7 +298,7 @@ int main(int argc, char *argv[])
                   if ((int)ctx.size() > BLOCK_SIZE)
                         ctx = std::vector<int>(ctx.end() - BLOCK_SIZE, ctx.end());
             }
-            std::cout << "\n\n[Stopped by user]\n";
+            std::cout << "\n";
             return 0;
       }
 
@@ -312,114 +309,78 @@ int main(int argc, char *argv[])
       std::mt19937 rng(SEED);
 
       // training loop
-      std::cout << "\n"
-                << std::string(60, '-') << "\n";
-      std::cout << "  TRAINING  ("
-                << MAX_ITERS << " iters, eval every "
-                << EVAL_INTERVAL << ")\n";
-      std::cout << std::string(60, '-') << "\n";
 
       float best_val_loss = 1e30f;
+      float last_val_loss = 0.0f;
       double train_start = wall_secs();
-      double last_eval_time = train_start; // ← tracks time of previous eval
 
-      for (int iter = 0; iter <= MAX_ITERS && !g_interrupted; ++iter)
+      // compute initial val loss before training
       {
+            std::mt19937 init_rng(SEED);
+            last_val_loss = estimate_loss(model, dl, "val", init_rng);
+      }
 
-            // Periodic eval checkpoint
-            if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS)
-            {
-                  double now = wall_secs();
-                  double elapsed = now - train_start;
-
-                  // ms per training step since the last eval window
-                  double window_secs = now - last_eval_time;
-                  int steps_in_win = (iter == 0) ? 1 : EVAL_INTERVAL;
-                  double ms_per_step = window_secs * 1000.0 / steps_in_win;
-
-                  // tokens processed per second
-                  long toks_in_win = (long)BATCH_SIZE * BLOCK_SIZE * steps_in_win;
-                  int tok_per_sec = (window_secs > 0.0)
-                                        ? (int)(toks_in_win / window_secs)
-                                        : 0;
-
-                  last_eval_time = now; // reset window
-
-                  float tl = estimate_loss(model, dl, "train", rng);
-                  float vl = estimate_loss(model, dl, "val", rng);
-
-                  bool better = vl < best_val_loss;
-                  if (better)
-                  {
-                        best_val_loss = vl;
-                        model.save(model_path);
-                  }
-
-                  // ── new log line ─────────────────────────────────────────────
-                  std::cout
-                      << "step "
-                      << std::setw(5) << iter << "/" << MAX_ITERS
-                      << " | loss "
-                      << std::fixed << std::setprecision(6) << tl
-                      << " | val "
-                      << std::fixed << std::setprecision(6) << vl
-                      << " | lr "
-                      << std::scientific << std::setprecision(2) << (float)LEARNING_RATE
-                      << " | "
-                      << std::fixed << std::setprecision(2) << ms_per_step << " ms"
-                      << " | " << tok_per_sec << " tok/s"
-                      << (better ? "  *best*" : "")
-                      << "\n";
-                  std::cout.flush();
-
-                  if (iter == MAX_ITERS)
-                        break;
-            }
+      for (int iter = 1; iter <= MAX_ITERS && !g_interrupted; ++iter)
+      {
+            double step_start = wall_secs();
 
-            // Sample training batch
+            // train step
             std::pair<std::vector<int>, std::vector<int>> batch =
                 dl.get_batch("train", BATCH_SIZE, BLOCK_SIZE, rng);
 
-            // Forward — saves all intermediate activations
             SavedForward saved = forward_save(model,
                                               batch.first, BATCH_SIZE, BLOCK_SIZE,
                                               batch.second, /*training=*/true);
 
-            //  Backward — exact analytical gradients
-            Grads grads = backward(model, saved);
+            float batch_loss = model.forward(batch.first, BATCH_SIZE, BLOCK_SIZE,
+                                             batch.second, false)
+                                   .second;
 
-            // AdamW parameter update
+            Grads grads = backward(model, saved);
             apply_grads(model, grads, opt);
-      }
 
-      double total = wall_secs() - train_start;
-      std::cout << "\n[DONE]  Training finished in "
-                << std::fixed << std::setprecision(1) << total << "s ("
-                << total / 60.0 << " min)  |  Best val loss: "
-                << std::setprecision(4) << best_val_loss << "\n";
-      std::cout << "[SAVE]  Best weights saved to " << model_path << "\n";
+            double step_ms = (wall_secs() - step_start) * 1000.0;
+            int tok_per_sec = (step_ms > 0.0)
+                                  ? (int)((long)BATCH_SIZE * BLOCK_SIZE / (step_ms / 1000.0))
+                                  : 0;
 
-      //  Continuous generation
-      std::cout << "\n"
-                << std::string(60, '-') << "\n";
-      std::cout << "  MODEL OUTPUT  (Ctrl+C to stop)\n";
-      std::cout << std::string(60, '-') << "\n\n";
+            // every EVAL_INTERVAL steps: compute val, save if best, sample
+            bool better = false;
+            if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS)
+            {
+                  last_val_loss = estimate_loss(model, dl, "val", rng);
+                  if (last_val_loss < best_val_loss)
+                  {
+                        best_val_loss = last_val_loss;
+                        model.save(model_path);
+                        better = true;
+                  }
+            }
 
-      model.load(model_path);
-      model.rng = std::mt19937(SEED + 42);
+            // print every step
+            std::cout
+                << "step"
+                << std::setw(5) << iter << "/" << MAX_ITERS
+                << " | loss "
+                << std::fixed << std::setprecision(6) << batch_loss
+                << " | val "
+                << std::fixed << std::setprecision(6) << last_val_loss
+                << " | lr "
+                << std::scientific << std::setprecision(2) << (float)LEARNING_RATE
+                << " | "
+                << std::fixed << std::setprecision(2) << step_ms << " ms"
+                << " | " << tok_per_sec << " tok/s"
+                << (better ? "  *best*" : "")
+                << "\n";
+            std::cout.flush();
 
-      std::vector<int> ctx = {0};
-      while (!g_interrupted)
-      {
-            ctx = model.generate(ctx, 1);
-            std::cout << dl.decode({ctx.back()}) << std::flush;
-            if ((int)ctx.size() > BLOCK_SIZE)
-                  ctx = std::vector<int>(ctx.end() - BLOCK_SIZE, ctx.end());
+            // sample after every eval window
+            if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS)
+            {
+                  std::cout << "generating:\n";
+                  sample_tokens(model, dl, iter == MAX_ITERS ? 10000 : 150);
+            }
       }
 
-      std::cout << "\n\n[Stopped by user]\n";
-      std::cout << "[TOTAL] Wall-clock: "
-                << std::fixed << std::setprecision(1)
-                << (wall_secs() - train_start) << "s\n";
       return 0;
 }
\ No newline at end of file
diff --git a/run.md b/run.md
deleted file mode 100644
index a2c0e65..0000000
--- a/run.md
+++ /dev/null
@@ -1,492 +0,0 @@
-# Quadtrix.cpp
-
-Quadtrix.cpp is a local GPT-style language model project with multiple runtime paths:
-
-- Native C++ inference and training through `Quadtrix.exe` / `main.cpp`
-- PyTorch checkpoint inference through `engine/inference.py` and `engine/best_model .pt`
-- FastAPI middleware in `backend/`
-- React + TypeScript chat UI in `frontend/`
-
-The web interface can chat with both model backends:
-
-- `C++`: calls the C++ HTTP server on port `8080`
-- `.pt`: loads the PyTorch checkpoint directly from `engine/best_model .pt`
-
-## Project Layout
-
-```text
-Quadtrix.cpp/
-  Quadtrix.exe
-  main.cpp
-  config/
-  include/
-  data/
-  engine/
-    inference.py
-    main.py
-    fine-tune/main.py
-    best_model .pt
-    fineweb_30mb.txt
-  backend/
-    main.py
-    inference.py
-    requirements.txt
-  frontend/
-    package.json
-    src/
-```
-
-## Requirements
-
-- Python 3.10+
-- Node.js 18+
-- npm
-- C++17 compiler if you want to rebuild the C++ executable
-
-## 1. Python Setup
-
-From the repo root:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-python -m venv .venv
-.\.venv\Scripts\python.exe -m pip install --upgrade pip
-```
-
-Install backend and PyTorch inference dependencies:
-
-```powershell
-cd backend
-..\.venv\Scripts\python.exe -m pip install -r requirements.txt
-```
-
-## 2. Frontend Setup
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend
-npm.cmd install
-npm.cmd run build
-```
-
-Run the frontend:
-
-```powershell
-npm.cmd run dev
-```
-
-Frontend URL:
-
-```text
-http://localhost:5173
-```
-
-## Install as a Web App
-
-The frontend is configured as an installable PWA. It includes:
-
-- `frontend/manifest.webmanifest`
-- `frontend/sw.js`
-- `frontend/public/manifest.webmanifest`
-- `frontend/public/sw.js`
-- service worker registration in `frontend/src/registerServiceWorker.ts`
-
-For the clean installable version, build and preview the frontend:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend
-npm.cmd run build
-npm.cmd run preview
-```
-
-Open the preview URL, usually:
-
-```text
-http://localhost:4173
-```
-
-Then install from the browser:
-
-- Chrome / Edge: click the install icon in the address bar
-- Or open browser menu -> Apps -> Install this site as an app
-
-The installed app still talks to the backend at:
-
-```text
-http://localhost:3001
-```
-
-So keep the FastAPI backend running when chatting.
-
-## 3. Run the PyTorch `.pt` Model in the Web UI
-
-The `.pt` model does not need a separate model server. The FastAPI backend loads it directly from:
-
-```text
-engine/best_model .pt
-```
-
-Start the backend:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend
-..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001
-```
-
-Start the frontend in another terminal:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend
-npm.cmd run dev
-```
-
-Open:
-
-```text
-http://localhost:5173
-```
-
-Select `.pt` in the top bar.
-
-## 4. Run the C++ Model in the Web UI
-
-Start the C++ inference server:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\Quadtrix.exe data\input.txt --server --port 8080
-```
-
-Start the backend:
-
-```powershell
-cd backend
-..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001
-```
-
-Start the frontend:
-
-```powershell
-cd ..\frontend
-npm.cmd run dev
-```
-
-Open:
-
-```text
-http://localhost:5173
-```
-
-Select `C++` in the top bar.
-
-## 5. Run Both Backends Together
-
-Use three terminals.
-
-Terminal 1:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\Quadtrix.exe data\input.txt --server --port 8080
-```
-
-Terminal 2:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend
-..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001
-```
-
-Terminal 3:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend
-npm.cmd run dev
-```
-
-Open:
-
-```text
-http://localhost:5173
-```
-
-Switch between `C++` and `.pt` from the model selector.
-
-## 6. Backend API
-
-Base URL:
-
-```text
-http://localhost:3001
-```
-
-Routes:
-
-```text
-GET    /api/health
-GET    /api/stats
-POST   /api/chat
-GET    /api/sessions
-POST   /api/sessions
-DELETE /api/sessions/{id}
-GET    /api/sessions/{id}/messages
-POST   /api/feedback
-```
-
-Example `.pt` chat request:
-
-```powershell
-Invoke-RestMethod `
-  -Uri http://localhost:3001/api/chat `
-  -Method Post `
-  -ContentType "application/json" `
-  -Body '{
-    "session_id": null,
-    "prompt": "Once upon a time",
-    "max_tokens": 100,
-    "temperature": 1.0,
-    "stream": false,
-    "model_backend": "torch"
-  }'
-```
-
-Example C++ chat request:
-
-```powershell
-Invoke-RestMethod `
-  -Uri http://localhost:3001/api/chat `
-  -Method Post `
-  -ContentType "application/json" `
-  -Body '{
-    "session_id": null,
-    "prompt": "Once upon a time",
-    "max_tokens": 100,
-    "temperature": 1.0,
-    "stream": false,
-    "model_backend": "cpp"
-  }'
-```
-
-## 7. Environment Variables
-
-Backend defaults are in `backend/.env.example`:
-
-```text
-API_PORT=3001
-CORS_ORIGINS=http://localhost:5173
-REDIS_URL=
-LOG_LEVEL=INFO
-MAX_SESSIONS=1000
-SESSION_TTL_HOURS=24
-CPP_SERVER_URL=http://localhost:8080
-TORCH_CHECKPOINT_PATH=../engine/best_model .pt
-REQUEST_TIMEOUT_SECONDS=60
-```
-
-Create `backend/.env` if you want overrides.
-
-Frontend defaults are in `frontend/.env.example`:
-
-```text
-VITE_API_BASE_URL=http://localhost:3001
-```
-
-## 8. PyTorch CLI Inference
-
-Interactive chat:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\.venv\Scripts\python.exe engine\inference.py --checkpoint "engine\best_model .pt"
-```
-
-Generate once:
-
-```powershell
-.\.venv\Scripts\python.exe engine\inference.py --checkpoint "engine\best_model .pt" --prompt "Hello" --max-new-tokens 100 --temperature 1.0
-```
-
-## 9. PyTorch Training
-
-Main training:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\.venv\Scripts\python.exe engine\main.py
-```
-
-Fine-tuning:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\.venv\Scripts\python.exe engine\fine-tune\main.py
-```
-
-## 10. C++ Build and Run
-
-Build manually:
-
-```powershell
-g++ -std=c++17 -O2 -I. -Iinclude -o Quadtrix.exe main.cpp
-```
-
-Train from scratch:
-
-```powershell
-.\Quadtrix.exe data\input.txt
-```
-
-Terminal chat:
-
-```powershell
-.\Quadtrix.exe data\input.txt --chat
-```
-
-Raw generation:
-
-```powershell
-.\Quadtrix.exe data\input.txt --generate
-```
-
-HTTP server:
-
-```powershell
-.\Quadtrix.exe data\input.txt --server --port 8080
-```
-
-## 11. Health Checks
-
-Backend:
-
-```powershell
-Invoke-RestMethod http://localhost:3001/api/health
-```
-
-C++ server:
-
-```powershell
-Invoke-RestMethod http://localhost:8080/health
-```
-
-Frontend:
-
-```text
-http://localhost:5173
-```
-
-When only `.pt` is available, backend health should show:
-
-```json
-{
-  "status": "degraded",
-  "api": "ok",
-  "cpp_server": "unreachable",
-  "torch_model": "ok"
-}
-```
-
-When both are available, backend health should show:
-
-```json
-{
-  "status": "ok",
-  "api": "ok",
-  "cpp_server": "ok",
-  "torch_model": "ok"
-}
-```
-
-## 12. Troubleshooting
-
-### PowerShell blocks `npm`
-
-Use `npm.cmd`:
-
-```powershell
-npm.cmd run dev
-npm.cmd run build
-```
-
-### `.pt` model is unavailable
-
-Check that this file exists:
-
-```text
-engine/best_model .pt
-```
-
-Then check Python dependencies:
-
-```powershell
-cd backend
-..\.venv\Scripts\python.exe -c "import torch, tiktoken; print(torch.__version__)"
-```
-
-### Backend cannot import FastAPI
-
-Install dependencies into the repo venv:
-
-```powershell
-cd backend
-..\.venv\Scripts\python.exe -m pip install -r requirements.txt
-```
-
-### C++ option is offline
-
-Start the C++ server:
-
-```powershell
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\Quadtrix.exe data\input.txt --server --port 8080
-```
-
-### Frontend cannot reach backend
-
-Check:
-
-```text
-http://localhost:3001/api/health
-```
-
-Make sure frontend config points to:
-
-```text
-VITE_API_BASE_URL=http://localhost:3001
-```
-
-### Port already in use
-
-```powershell
-Get-NetTCPConnection -LocalPort 3001
-Get-NetTCPConnection -LocalPort 5173
-Get-NetTCPConnection -LocalPort 8080
-```
-
-## Recommended Daily Run
-
-```powershell
-# Terminal 1
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp
-.\Quadtrix.exe data\input.txt --server --port 8080
-```
-
-```powershell
-# Terminal 2
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend
-..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001
-```
-
-```powershell
-# Terminal 3
-cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend
-npm.cmd run dev
-```
-
-Open:
-
-```text
-http://localhost:5173
-```
-
-## License
-
-MIT
diff --git a/scripts/build.sh b/scripts/build.sh
new file mode 100644
index 0000000..e36678b
--- /dev/null
+++ b/scripts/build.sh
@@ -0,0 +1,161 @@
+
+# Quadtrix.cpp — build.sh  
+# Usage
+#   ./scripts/build.sh               # full stack, CPU
+#   ./scripts/build.sh dev           # hot-reload dev mode
+#   ./scripts/build.sh gpu           # CUDA backend
+#   ./scripts/build.sh cpp-only      # compile + run C++ engine
+#   ./scripts/build.sh train-cpp     # train with C++ backend
+#   ./scripts/build.sh train-torch   # train with PyTorch backend
+#   ./scripts/build.sh bench         # run benchmark
+#   ./scripts/build.sh clean         # remove containers + volumes
+#   ./scripts/build.sh logs          # tail all service logs
+
+set -euo pipefail
+
+BOLD="\033[1m"
+GREEN="\033[0;32m"
+CYAN="\033[0;36m"
+YELLOW="\033[1;33m"
+RED="\033[0;31m"
+RESET="\033[0m"
+
+info()    { echo -e "${CYAN}[quadtrix]${RESET} $*"; }
+success() { echo -e "${GREEN}[quadtrix]${RESET} $*"; }
+warn()    { echo -e "${YELLOW}[quadtrix]${RESET} $*"; }
+error()   { echo -e "${RED}[quadtrix] ERROR:${RESET} $*" >&2; }
+
+COMPOSE_BASE="docker compose -f docker-compose.yml"
+COMPOSE_DEV="${COMPOSE_BASE} -f docker-compose.dev.yml"
+COMPOSE_GPU="${COMPOSE_BASE} -f docker-compose.gpu.yml"
+
+check_docker() {
+    if ! docker info &>/dev/null; then
+        error "Docker daemon is not running. Start Docker Desktop or the Docker service."
+        exit 1
+    fi
+}
+
+check_nvidia() {
+    if ! command -v nvidia-smi &>/dev/null; then
+        warn "nvidia-smi not found — GPU mode may not work."
+    else
+        info "GPU detected: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
+    fi
+}
+
+pull_cache() {
+    info "Pulling build cache images (if available)..."
+    $COMPOSE_BASE pull --ignore-pull-failures 2>/dev/null || true
+}
+
+cmd_up() {
+    check_docker
+    info "Starting full stack (CPU)..."
+    $COMPOSE_BASE up --build -d
+    success "Stack is up."
+    echo ""
+    echo -e "  ${BOLD}Frontend:${RESET}  http://localhost:5173"
+    echo -e "  ${BOLD}API:${RESET}       http://localhost:3001/api/health"
+    echo -e "  ${BOLD}Docs:${RESET}      http://localhost:3001/docs"
+}
+
+cmd_dev() {
+    check_docker
+    info "Starting in DEV mode (hot-reload)..."
+    $COMPOSE_DEV up --build
+}
+
+cmd_gpu() {
+    check_docker
+    check_nvidia
+    info "Starting with CUDA GPU support..."
+    $COMPOSE_GPU up --build -d
+    success "GPU stack is up."
+}
+
+cmd_cpp_only() {
+    check_docker
+    info "Compiling and running C++ engine..."
+    $COMPOSE_BASE --profile cpp run --rm cpp "$@"
+}
+
+cmd_train_cpp() {
+    check_docker
+    info "Training with C++ backend..."
+    $COMPOSE_BASE --profile train run --rm train-cpp
+    success "C++ training complete. Checkpoint saved in 'models' volume."
+}
+
+cmd_train_torch() {
+    check_docker
+    info "Training with PyTorch backend..."
+    $COMPOSE_BASE --profile train run --rm train-torch
+    success "PyTorch training complete. Checkpoint saved in 'models' volume."
+}
+
+cmd_bench() {
+    check_docker
+    info "Running benchmark..."
+    $COMPOSE_BASE --profile benchmark run --rm benchmark
+}
+
+cmd_logs() {
+    check_docker
+    $COMPOSE_BASE logs -f --tail=100
+}
+
+cmd_clean() {
+    check_docker
+    warn "This will remove all containers and volumes (including saved models!)"
+    read -r -p "Are you sure? [y/N] " confirm
+    if [[ "${confirm,,}" == "y" ]]; then
+        $COMPOSE_BASE down -v --remove-orphans
+        docker image prune -f --filter "label=org.opencontainers.image.source=https://github.com/Eamon2009/Quadtrix.cpp"
+        success "Cleaned."
+    else
+        info "Aborted."
+    fi
+}
+
+cmd_ps() {
+    $COMPOSE_BASE ps
+}
+
+cmd_shell() {
+    service="${1:-backend}"
+    info "Opening shell in '${service}'..."
+    $COMPOSE_BASE exec "${service}" /bin/sh
+}
+CMD="${1:-up}"
+shift || true
+
+case "${CMD}" in
+    up)           cmd_up "$@" ;;
+    dev)          cmd_dev "$@" ;;
+    gpu)          cmd_gpu "$@" ;;
+    cpp-only)     cmd_cpp_only "$@" ;;
+    train-cpp)    cmd_train_cpp "$@" ;;
+    train-torch)  cmd_train_torch "$@" ;;
+    bench)        cmd_bench "$@" ;;
+    logs)         cmd_logs "$@" ;;
+    clean)        cmd_clean "$@" ;;
+    ps)           cmd_ps "$@" ;;
+    shell)        cmd_shell "$@" ;;
+    *)
+        echo -e "Usage: ./scripts/build.sh ${BOLD}[command]${RESET}"
+        echo ""
+        echo "Commands:"
+        echo "  up           Full stack (CPU) — default"
+        echo "  dev          Hot-reload dev mode"
+        echo "  gpu          CUDA GPU stack"
+        echo "  cpp-only     Run C++ engine CLI"
+        echo "  train-cpp    Train with C++ backend"
+        echo "  train-torch  Train with PyTorch"
+        echo "  bench        Benchmark"
+        echo "  logs         Tail logs"
+        echo "  ps           Show container status"
+        echo "  shell [svc]  Shell into service (default: backend)"
+        echo "  clean        Remove all containers + volumes"
+        ;;
+esac

From 6facc3e7e450ef33072feaa427c42630a7a6a216 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 00:54:48 +0530
Subject: [PATCH 02/45] ci: add manual PR checks workflow with slash command
 support

---
 .github/workflows/pr-check.yml | 59 ++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml
index 699b834..4824b9e 100644
--- a/.github/workflows/pr-check.yml
+++ b/.github/workflows/pr-check.yml
@@ -56,6 +56,23 @@ jobs:
             });
             core.setOutput('sha', pr.head.sha);
 
+      - name: Set checks to pending
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const sha = '${{ steps.get-sha.outputs.sha }}';
+            const checks = ['Lint', 'Build C++ (ubuntu-22.04)', 'Build C++ (ubuntu-24.04)', 'Build C++ (macos-14)', 'Validate'];
+            for (const check of checks) {
+              await github.rest.repos.createCommitStatus({
+                owner: context.repo.owner,
+                repo:  context.repo.repo,
+                sha,
+                state: 'pending',
+                context: check,
+                description: 'Waiting...',
+              });
+            }
+
 
   lint:
     name: Lint
@@ -77,6 +94,20 @@ jobs:
         with:
           args: "check engine/ --ignore E501 --exit-zero"
 
+      - name: Report status
+        if: always()
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.repos.createCommitStatus({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              sha:   '${{ needs.slash-command.outputs.pr-sha }}',
+              state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
+              context: 'Lint',
+              description: '${{ job.status }}',
+            });
+
 
   build-cpp:
     name: Build C++ (${{ matrix.os }})
@@ -125,6 +156,20 @@ jobs:
           path: quadtrix
           retention-days: 7
 
+      - name: Report status
+        if: always()
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.repos.createCommitStatus({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              sha:   '${{ needs.slash-command.outputs.pr-sha }}',
+              state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
+              context: 'Build C++ (${{ matrix.os }})',
+              description: '${{ job.status }}',
+            });
+
 
   validate:
     name: Validate
@@ -171,6 +216,20 @@ jobs:
           dockerfile: .devops/Dockerfile.backend
           failure-threshold: error
 
+      - name: Report status
+        if: always()
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.repos.createCommitStatus({
+              owner: context.repo.owner,
+              repo:  context.repo.repo,
+              sha:   '${{ needs.slash-command.outputs.pr-sha }}',
+              state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
+              context: 'Validate',
+              description: '${{ job.status }}',
+            });
+
 
   post-result:
     name: Post result

From 40b8bd93fb776c075ba10cc3cf7b3b2e7f992843 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 01:00:08 +0530
Subject: [PATCH 03/45] feat(cuda): add attention forward backward kernel
 declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>
---
 CUDA/includes/attention.cuh | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 CUDA/includes/attention.cuh

diff --git a/CUDA/includes/attention.cuh b/CUDA/includes/attention.cuh
new file mode 100644
index 0000000..7feac08
--- /dev/null
+++ b/CUDA/includes/attention.cuh
@@ -0,0 +1,29 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cuda_runtime.h>
+
+namespace quadtrix {
+namespace cuda {
+
+Status attention_forward(
+    const TensorView& input_qkv,
+    TensorView preatt,
+    TensorView att,
+    TensorView output,
+    int num_heads,
+    cudaStream_t stream = nullptr);
+
+Status attention_backward(
+    const TensorView& grad_output,
+    const TensorView& input_qkv,
+    const TensorView& att,
+    TensorView grad_input_qkv,
+    TensorView grad_preatt,
+    TensorView grad_att,
+    int num_heads,
+    cudaStream_t stream = nullptr);
+
+}  // namespace cuda
+}  // namespace quadtrix

From 4aac832e725f1ec5b2136b3167bfa7028e714ee5 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:30:58 +0530
Subject: [PATCH 04/45] feat(cuda): add checkpoint metadata struct and stub
 functions

---
 CUDA/includes/checkpoint.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)
 create mode 100644 CUDA/includes/checkpoint.h

diff --git a/CUDA/includes/checkpoint.h b/CUDA/includes/checkpoint.h
new file mode 100644
index 0000000..ba91b0f
--- /dev/null
+++ b/CUDA/includes/checkpoint.h
@@ -0,0 +1,25 @@
+#pragma once
+
+#include "tensor.cuh"
+
+namespace quadtrix {
+namespace cuda {
+
+struct CheckpointMetadata {
+    int vocab_size = 0;
+    int max_sequence_length = 0;
+    int num_layers = 0;
+    int num_heads = 0;
+    int channels = 0;
+};
+
+inline bool load_checkpoint_metadata(const char*, CheckpointMetadata*) {
+    return false;
+}
+
+inline bool save_tensor_checkpoint(const char*, const TensorView&) {
+    return false;
+}
+
+}  // namespace cuda
+}  // namespace quadtrix

From 47696058b34c95c45e715fb7b25dcec5a28ea955 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:34:04 +0530
Subject: [PATCH 05/45] feat(cuda): introduce core type definitions and error
 handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
---
 CUDA/includes/common.h | 120 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)
 create mode 100644 CUDA/includes/common.h

diff --git a/CUDA/includes/common.h b/CUDA/includes/common.h
new file mode 100644
index 0000000..36df155
--- /dev/null
+++ b/CUDA/includes/common.h
@@ -0,0 +1,120 @@
+#pragma once
+
+#include <cuda_runtime.h>
+
+#include <cstddef>
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <limits>
+
+namespace quadtrix {
+namespace cuda {
+
+enum class DType : std::uint8_t {
+    F32,
+    F16,
+    BF16,
+    I32,
+    U8,
+};
+
+enum class DeviceKind : std::uint8_t {
+    CPU,
+    CUDA,
+};
+
+struct Status {
+    bool ok;
+    cudaError_t cuda_error;
+    const char* message;
+
+    static Status success() {
+        return {true, cudaSuccess, "ok"};
+    }
+
+    static Status failure(cudaError_t error, const char* message) {
+        return {false, error, message};
+    }
+};
+
+inline const char* dtype_name(DType dtype) {
+    switch (dtype) {
+        case DType::F32:
+            return "f32";
+        case DType::F16:
+            return "f16";
+        case DType::BF16:
+            return "bf16";
+        case DType::I32:
+            return "i32";
+        case DType::U8:
+            return "u8";
+    }
+    return "unknown";
+}
+
+inline std::size_t dtype_size(DType dtype) {
+    switch (dtype) {
+        case DType::F32:
+            return 4;
+        case DType::F16:
+            return 2;
+        case DType::BF16:
+            return 2;
+        case DType::I32:
+            return 4;
+        case DType::U8:
+            return 1;
+    }
+
+    std::fprintf(stderr, "Unknown CUDA dtype value %u\n", static_cast<unsigned int>(dtype));
+    std::abort();
+}
+
+inline bool checked_mul(std::size_t lhs, std::size_t rhs, std::size_t* out) {
+    if (lhs != 0 && rhs > std::numeric_limits<std::size_t>::max() / lhs) {
+        return false;
+    }
+    *out = lhs * rhs;
+    return true;
+}
+
+inline Status check_cuda(cudaError_t error, const char* expression, const char* file, int line) {
+    if (error == cudaSuccess) {
+        return Status::success();
+    }
+
+    std::fprintf(
+        stderr,
+        "CUDA error at %s:%d: %s failed with %s\n",
+        file,
+        line,
+        expression,
+        cudaGetErrorString(error));
+    return Status::failure(error, expression);
+}
+
+inline void abort_on_cuda(cudaError_t error, const char* expression, const char* file, int line) {
+    if (error == cudaSuccess) {
+        return;
+    }
+
+    std::fprintf(
+        stderr,
+        "Fatal CUDA error at %s:%d: %s failed with %s\n",
+        file,
+        line,
+        expression,
+        cudaGetErrorString(error));
+    std::abort();
+}
+
+}  // namespace cuda
+}  // namespace quadtrix
+
+#define QUADTRIX_CUDA_CHECK(expr) \
+    ::quadtrix::cuda::check_cuda((expr), #expr, __FILE__, __LINE__)
+
+#define QUADTRIX_CUDA_ABORT(expr) \
+    ::quadtrix::cuda::abort_on_cuda((expr), #expr, __FILE__, __LINE__)

From 7c94958781dddc8d38a30d34dd343a00417c7fc7 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:34:39 +0530
Subject: [PATCH 06/45] feat(cuda): add TokenBatchView struct and DataLoader
 stub class

---
 CUDA/includes/dataloader.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 CUDA/includes/dataloader.h

diff --git a/CUDA/includes/dataloader.h b/CUDA/includes/dataloader.h
new file mode 100644
index 0000000..fd3c47d
--- /dev/null
+++ b/CUDA/includes/dataloader.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include <cstddef>
+#include <cstdint>
+
+namespace quadtrix {
+namespace cuda {
+
+struct TokenBatchView {
+    const std::int32_t* inputs = nullptr;
+    const std::int32_t* targets = nullptr;
+    int batch_size = 0;
+    int sequence_length = 0;
+};
+
+class DataLoader {
+public:
+    DataLoader() = default;
+
+    bool next(TokenBatchView* batch) {
+        if (batch != nullptr) {
+            *batch = {};
+        }
+        return false;
+    }
+};
+
+}  // namespace cuda
+}  // namespace quadtrix

From c62c869527bcf83ab494341b1667b7ac95e9af95 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:35:34 +0530
Subject: [PATCH 07/45] feat(cuda): add GeLU activation forward and backward
 declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
---
 CUDA/includes/gelu.cuh | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
 create mode 100644 CUDA/includes/gelu.cuh

diff --git a/CUDA/includes/gelu.cuh b/CUDA/includes/gelu.cuh
new file mode 100644
index 0000000..af87e64
--- /dev/null
+++ b/CUDA/includes/gelu.cuh
@@ -0,0 +1,31 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cuda_runtime.h>
+
+#include <cstdint>
+
+namespace quadtrix {
+namespace cuda {
+
+enum class GeluMode : std::uint8_t {
+    Exact,
+    Approximate,
+};
+
+Status gelu_forward(
+    const TensorView& input,
+    TensorView output,
+    GeluMode mode = GeluMode::Approximate,
+    cudaStream_t stream = nullptr);
+
+Status gelu_backward(
+    const TensorView& grad_output,
+    const TensorView& input,
+    TensorView grad_input,
+    GeluMode mode = GeluMode::Approximate,
+    cudaStream_t stream = nullptr);
+
+}  // namespace cuda
+}  // namespace quadtrix

From 28117dc6f6e5bb2be6544f0a9007043a943686c1 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:47:36 +0530
Subject: [PATCH 08/45] feat(cuda): add gradient norm calculation and clipping
 interfaces

---
 CUDA/includes/global_norm.cuh | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)
 create mode 100644 CUDA/includes/global_norm.cuh

diff --git a/CUDA/includes/global_norm.cuh b/CUDA/includes/global_norm.cuh
new file mode 100644
index 0000000..f418ab7
--- /dev/null
+++ b/CUDA/includes/global_norm.cuh
@@ -0,0 +1,26 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cuda_runtime.h>
+
+namespace quadtrix {
+namespace cuda {
+
+Status global_norm_squared(
+    const TensorView& grads,
+    TensorView partial_sums,
+    cudaStream_t stream = nullptr);
+
+Status clip_gradients_by_global_norm(
+    TensorView grads,
+    float global_norm,
+    float max_norm,
+    cudaStream_t stream = nullptr);
+
+inline float clip_scale(float global_norm, float max_norm) {
+    return global_norm > max_norm && global_norm > 0.0f ? max_norm / global_norm : 1.0f;
+}
+
+}  // namespace cuda
+}  // namespace quadtrix

From 3bdf5bed6472cf21ed9904ebe99d50d98689e79d Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Mon, 1 Jun 2026 22:48:31 +0530
Subject: [PATCH 09/45] feat(cuda): add LayerNorm forward and backward kernel
 declarations

---
 CUDA/includes/layernorm.cuh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
 create mode 100644 CUDA/includes/layernorm.cuh

diff --git a/CUDA/includes/layernorm.cuh b/CUDA/includes/layernorm.cuh
new file mode 100644
index 0000000..2645537
--- /dev/null
+++ b/CUDA/includes/layernorm.cuh
@@ -0,0 +1,32 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cuda_runtime.h>
+
+namespace quadtrix {
+namespace cuda {
+
+Status layernorm_forward(
+    const TensorView& input,
+    const TensorView& gamma,
+    const TensorView& beta,
+    TensorView output,
+    TensorView mean,
+    TensorView rstd,
+    float epsilon = 1.0e-5f,
+    cudaStream_t stream = nullptr);
+
+Status layernorm_backward(
+    const TensorView& grad_output,
+    const TensorView& input,
+    const TensorView& gamma,
+    const TensorView& mean,
+    const TensorView& rstd,
+    TensorView grad_input,
+    TensorView grad_gamma,
+    TensorView grad_beta,
+    cudaStream_t stream = nullptr);
+
+}  // namespace cuda
+}  // namespace quadtrix

From 3dba73a7650cdec838f5907f8fafcdcc5fd5cd65 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 22:34:54 +0530
Subject: [PATCH 10/45] refactor(ci): organize workflow into push-triggered QA
 and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
---
 .github/workflows/ci.yml | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index bf49286..c30d16b 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -2,11 +2,11 @@ name: CI
 
 on:
   push:
-    branches: [master, dev]
+    branches: [master]
   workflow_dispatch:
     inputs:
       image:
-        description: "Which image to build?"
+        description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)"
         required: true
         type: choice
         options:
@@ -14,7 +14,7 @@ on:
           - cpu
           - cuda
           - all
-      push:
+      push_image:
         description: "Push to ghcr.io?"
         required: true
         default: "true"
@@ -27,6 +27,7 @@ env:
 
 jobs:
 
+ 
   file-integrity:
     name: File integrity
     if: github.event_name == 'push'
@@ -86,8 +87,9 @@ jobs:
         run: ./quadtrix --help || true
 
 
+ 
   build-cpp-image:
-    name: Build — cpp
+    name: "Build — cpp (C++ engine · linux/amd64 + arm64)"
     if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')
     runs-on: ubuntu-latest
     permissions:
@@ -100,7 +102,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push == 'true'
+        if: inputs.push_image == 'true'
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -123,7 +125,7 @@ jobs:
           context: .
           file: .devops/Dockerfile.cpp
           platforms: linux/amd64,linux/arm64
-          push: ${{ inputs.push == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cpp
@@ -131,7 +133,7 @@ jobs:
 
 
   build-cpu-image:
-    name: Build — cpu
+    name: "Build — cpu (PyTorch CPU · linux/amd64 + arm64)"
     if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')
     runs-on: ubuntu-latest
     permissions:
@@ -144,7 +146,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push == 'true'
+        if: inputs.push_image == 'true'
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -167,7 +169,7 @@ jobs:
           context: .
           file: .devops/Dockerfile
           platforms: linux/amd64,linux/arm64
-          push: ${{ inputs.push == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cpu
@@ -175,7 +177,7 @@ jobs:
 
 
   build-cuda-image:
-    name: Build — cuda
+    name: "Build — cuda (PyTorch CUDA · linux/amd64 only)"
     if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')
     runs-on: ubuntu-latest
     permissions:
@@ -187,7 +189,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push == 'true'
+        if: inputs.push_image == 'true'
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -210,8 +212,8 @@ jobs:
           context: .
           file: .devops/Dockerfile.backend
           platforms: linux/amd64
-          push: ${{ inputs.push == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cuda
-          cache-to: type=gha,mode=max,scope=cuda
\ No newline at end of file
+          cache-to: type=gha,mode=max,scope=cuda

From 309183fdefb5ad77fd64828d147f1f04037f47ea Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 22:45:48 +0530
Subject: [PATCH 11/45] Fix formatting and update CI workflow steps

---
 .github/workflows/ci.yml | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index c30d16b..0423bd2 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -45,9 +45,9 @@ jobs:
           failed=0
           for f in "${files[@]}"; do
             if [ -f "$f" ]; then
-              echo "✅  $f"
+              echo "PASS: $f"
             else
-              echo "❌  $f — MISSING"
+              echo "FAIL: $f -- MISSING"
               failed=1
             fi
           done
@@ -86,11 +86,17 @@ jobs:
       - name: Smoke test
         run: ./quadtrix --help || true
 
+      - name: Upload binary
+        uses: actions/upload-artifact@v4
+        with:
+          name: quadtrix-linux-amd64
+          path: quadtrix
+          retention-days: 7
+
 
- 
   build-cpp-image:
-    name: "Build — cpp (C++ engine · linux/amd64 + arm64)"
-    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')
+    name: "Build -- cpp (C++ engine - linux/amd64 + arm64)"
+    if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -102,7 +108,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push_image == 'true'
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -133,8 +139,8 @@ jobs:
 
 
   build-cpu-image:
-    name: "Build — cpu (PyTorch CPU · linux/amd64 + arm64)"
-    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')
+    name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)"
+    if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all') }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -146,7 +152,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push_image == 'true'
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -177,8 +183,8 @@ jobs:
 
 
   build-cuda-image:
-    name: "Build — cuda (PyTorch CUDA · linux/amd64 only)"
-    if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')
+    name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)"
+    if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all') }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -189,7 +195,7 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Login to GHCR
-        if: inputs.push_image == 'true'
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}

From ac398662e4e8bfab63840f3a709afd3fd0d6e5e9 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 22:49:14 +0530
Subject: [PATCH 12/45] Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.
---
 .github/workflows/ci.yml | 64 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 0423bd2..d0f158b 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -3,6 +3,8 @@ name: CI
 on:
   push:
     branches: [master]
+    tags:
+      - 'v*'
   workflow_dispatch:
     inputs:
       image:
@@ -23,11 +25,10 @@ on:
 
 env:
   REGISTRY: ghcr.io
-  IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix
 
 jobs:
 
- 
+
   file-integrity:
     name: File integrity
     if: github.event_name == 'push'
@@ -67,8 +68,8 @@ jobs:
           args: "check engine/ --ignore E501 --exit-zero"
 
 
-  build-cpp:
-    name: C++ compile check
+  build-binary-linux:
+    name: Binary (ubuntu-latest)
     if: github.event_name == 'push'
     runs-on: ubuntu-latest
     steps:
@@ -94,6 +95,52 @@ jobs:
           retention-days: 7
 
 
+  build-binary-macos:
+    name: Binary (macos-14)
+    if: github.event_name == 'push'
+    runs-on: macos-14
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Compile main.cpp
+        run: |
+          g++ -std=c++17 -O3 \
+            -I. -Iinclude \
+            -o quadtrix main.cpp
+
+      - name: Smoke test
+        run: ./quadtrix --help || true
+
+      - name: Package binary
+        run: tar -czf quadtrix-macos-arm64.tar.gz quadtrix
+
+      - name: Upload binary
+        uses: actions/upload-artifact@v4
+        with:
+          name: quadtrix-macos-arm64
+          path: quadtrix-macos-arm64.tar.gz
+          retention-days: 7
+
+  release:
+    name: Publish release
+    if: startsWith(github.ref, 'refs/tags/v')
+    needs: [build-binary-linux, build-binary-macos]
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+    steps:
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          path: dist/
+
+      - name: Publish GitHub release
+        uses: softprops/action-gh-release@v2
+        with:
+          files: |
+            dist/quadtrix-linux-amd64/quadtrix
+            dist/quadtrix-macos-arm64/quadtrix-macos-arm64.tar.gz
+
   build-cpp-image:
     name: "Build -- cpp (C++ engine - linux/amd64 + arm64)"
     if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }}
@@ -107,6 +154,9 @@ jobs:
       - uses: docker/setup-qemu-action@v3
       - uses: docker/setup-buildx-action@v3
 
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
       - name: Login to GHCR
         if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
@@ -151,6 +201,9 @@ jobs:
       - uses: docker/setup-qemu-action@v3
       - uses: docker/setup-buildx-action@v3
 
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
       - name: Login to GHCR
         if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
@@ -194,6 +247,9 @@ jobs:
 
       - uses: docker/setup-buildx-action@v3
 
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
       - name: Login to GHCR
         if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3

From e38ff85eca57521327e30c8983512a2351855622 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 23:24:15 +0530
Subject: [PATCH 13/45] feat(docker): add Dockerfile for frontend application

---
 .devops/Dockerfile.dev.frontend | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 .devops/Dockerfile.dev.frontend

diff --git a/.devops/Dockerfile.dev.frontend b/.devops/Dockerfile.dev.frontend
new file mode 100644
index 0000000..de054d6
--- /dev/null
+++ b/.devops/Dockerfile.dev.frontend
@@ -0,0 +1,12 @@
+FROM node:20-alpine
+
+WORKDIR /app
+
+COPY frontend/package*.json ./
+RUN npm ci
+
+COPY frontend/ ./
+
+EXPOSE 5173
+
+CMD ["npm", "run", "dev", "--", "--host", "0.0.0.0"]

From b120ffd1b2726e11a8fbb23eed07223d0ea3136b Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 23:24:32 +0530
Subject: [PATCH 14/45] feat(docker): add Dockerfile for frontend application

---
 .devops/Dockerfile.frontend | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100644 .devops/Dockerfile.frontend

diff --git a/.devops/Dockerfile.frontend b/.devops/Dockerfile.frontend
new file mode 100644
index 0000000..70ca5aa
--- /dev/null
+++ b/.devops/Dockerfile.frontend
@@ -0,0 +1,22 @@
+FROM node:20-alpine AS build
+
+WORKDIR /app
+
+COPY frontend/package*.json ./
+RUN npm ci
+
+COPY frontend/ ./
+
+ARG VITE_API_BASE_URL=/api
+ENV VITE_API_BASE_URL=${VITE_API_BASE_URL}
+
+RUN npm run build
+
+FROM nginx:1.27-alpine AS runtime
+
+COPY .devops/nginx.conf /etc/nginx/conf.d/default.conf
+COPY --from=build /app/dist /usr/share/nginx/html
+
+EXPOSE 80
+
+CMD ["nginx", "-g", "daemon off;"]

From 9156bba064a027ed817156b2d5ab287977f999bd Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 23:25:34 +0530
Subject: [PATCH 15/45] refactor(ci): remove release job from GitHub actions

---
 .github/workflows/ci.yml | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index d0f158b..e6502d0 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -3,8 +3,6 @@ name: CI
 on:
   push:
     branches: [master]
-    tags:
-      - 'v*'
   workflow_dispatch:
     inputs:
       image:
@@ -121,26 +119,6 @@ jobs:
           path: quadtrix-macos-arm64.tar.gz
           retention-days: 7
 
-  release:
-    name: Publish release
-    if: startsWith(github.ref, 'refs/tags/v')
-    needs: [build-binary-linux, build-binary-macos]
-    runs-on: ubuntu-latest
-    permissions:
-      contents: write
-    steps:
-      - name: Download all artifacts
-        uses: actions/download-artifact@v4
-        with:
-          path: dist/
-
-      - name: Publish GitHub release
-        uses: softprops/action-gh-release@v2
-        with:
-          files: |
-            dist/quadtrix-linux-amd64/quadtrix
-            dist/quadtrix-macos-arm64/quadtrix-macos-arm64.tar.gz
-
   build-cpp-image:
     name: "Build -- cpp (C++ engine - linux/amd64 + arm64)"
     if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }}

From 8898418b0722fa8178fc1b7738a56432f884448b Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 23:26:24 +0530
Subject: [PATCH 16/45] ci: add unified release and docker build workflow

---
 .github/workflows/docker-publish.yml | 214 ++++++++++++++++-----------
 1 file changed, 128 insertions(+), 86 deletions(-)

diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
index ca9493f..3a8fbae 100644
--- a/.github/workflows/docker-publish.yml
+++ b/.github/workflows/docker-publish.yml
@@ -1,66 +1,86 @@
-name: Release
+name: Docker Images
 
 on:
+  push:
+    tags:
+      - "v*"
   workflow_dispatch:
     inputs:
+      image:
+        description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)"
+        required: true
+        type: choice
+        options:
+          - cpp
+          - cpu
+          - cuda
+          - all
       version:
-        description: "Version tag (e.g. 1.2.3)"
+        description: "Optional image tag for manual runs"
+        required: false
+      push_image:
+        description: "Push to ghcr.io?"
         required: true
+        default: "true"
+        type: choice
+        options: ["true", "false"]
 
 env:
   REGISTRY: ghcr.io
   IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix
 
 jobs:
-
-  build-binaries:
-    name: Binary (${{ matrix.os }})
-    runs-on: ${{ matrix.os }}
-    strategy:
-      matrix:
-        os: [ubuntu-22.04, macos-14]
-        include:
-          - os: ubuntu-22.04
-            artifact_name: quadtrix-linux-x64
-            binary: quadtrix
-          - os: macos-14
-            artifact_name: quadtrix-macos-arm64
-            binary: quadtrix
+  build-cpp-image:
+    name: "Build -- cpp (C++ engine - linux/amd64 + arm64)"
+    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')) }}
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
     steps:
       - uses: actions/checkout@v4
 
-      - name: Compile (Linux)
-        if: runner.os == 'Linux'
-        run: |
-          sudo apt-get update && sudo apt-get install -y g++
-          g++ -std=c++17 -O3 -march=native \
-              -I. -Iinclude \
-              -o ${{ matrix.binary }} main.cpp
-          strip ${{ matrix.binary }}
-
-      - name: Compile (macOS)
-        if: runner.os == 'macOS'
-        run: |
-          g++ -std=c++17 -O3 -march=native \
-              -I. -Iinclude \
-              -o ${{ matrix.binary }} main.cpp
-
-      - name: Package
-        run: |
-          mkdir dist
-          cp ${{ matrix.binary }} dist/
-          cp README.md LICENSE dist/
-          tar -czf ${{ matrix.artifact_name }}.tar.gz -C dist .
-
-      - name: Upload to Release
-        uses: softprops/action-gh-release@v2
+      - uses: docker/setup-qemu-action@v3
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
+      - name: Login to GHCR
+        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+        uses: docker/login-action@v3
         with:
-          tag_name: v${{ github.event.inputs.version }}
-          files: ${{ matrix.artifact_name }}.tar.gz
-          generate_release_notes: true
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
 
-  publish-images:
-    name: Publish Docker images
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.IMAGE_PREFIX }}-cpp
+          tags: |
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
+            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
+
+      - name: Build & push
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.cpp
+          platforms: linux/amd64,linux/arm64
+          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cpp
+          cache-to: type=gha,mode=max,scope=cpp
+
+  build-cpu-image:
+    name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)"
+    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')) }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -71,62 +91,84 @@ jobs:
       - uses: docker/setup-qemu-action@v3
       - uses: docker/setup-buildx-action@v3
 
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
       - name: Login to GHCR
+        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Parse tag
-        id: tag
-        run: echo "VERSION=${{ github.event.inputs.version }}" >> $GITHUB_OUTPUT
-
-      - name: Build & push backend
-        uses: docker/build-push-action@v6
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
         with:
-          context: .
-          file: .devops/Dockerfile.backend
-          platforms: linux/amd64,linux/arm64
-          push: true
+          images: ${{ env.IMAGE_PREFIX }}-cpu
           tags: |
-            ${{ env.IMAGE_PREFIX }}-backend:latest
-            ${{ env.IMAGE_PREFIX }}-backend:${{ steps.tag.outputs.VERSION }}
-          cache-from: type=gha,scope=backend
-          cache-to: type=gha,mode=max,scope=backend
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
+            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
 
-      - name: Build & push frontend
+      - name: Build & push
         uses: docker/build-push-action@v6
         with:
           context: .
-          file: .devops/Dockerfile.frontend
+          file: .devops/Dockerfile
           platforms: linux/amd64,linux/arm64
-          push: true
+          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cpu
+          cache-to: type=gha,mode=max,scope=cpu
+
+  build-cuda-image:
+    name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)"
+    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')) }}
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+
+      - name: Login to GHCR
+        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.IMAGE_PREFIX }}-cuda
           tags: |
-            ${{ env.IMAGE_PREFIX }}-frontend:latest
-            ${{ env.IMAGE_PREFIX }}-frontend:${{ steps.tag.outputs.VERSION }}
-          cache-from: type=gha,scope=frontend
-          cache-to: type=gha,mode=max,scope=frontend
+            type=ref,event=branch
+            type=sha,prefix=sha-
+            type=raw,value=latest,enable={{is_default_branch}}
+            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
+            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
 
-      - name: Build & push cpp
+      - name: Build & push
         uses: docker/build-push-action@v6
         with:
           context: .
-          file: .devops/Dockerfile.cpp
-          platforms: linux/amd64,linux/arm64
-          push: true
-          tags: |
-            ${{ env.IMAGE_PREFIX }}-cpp:latest
-            ${{ env.IMAGE_PREFIX }}-cpp:${{ steps.tag.outputs.VERSION }}
-          cache-from: type=gha,scope=cpp
-          cache-to: type=gha,mode=max,scope=cpp
-
-      - name: Create Release summary
-        run: |
-          echo "## Docker images published" >> $GITHUB_STEP_SUMMARY
-          echo "" >> $GITHUB_STEP_SUMMARY
-          echo "| Image | Tags |" >> $GITHUB_STEP_SUMMARY
-          echo "|-------|------|" >> $GITHUB_STEP_SUMMARY
-          echo "| \`quadtrix-backend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
-          echo "| \`quadtrix-frontend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
-          echo "| \`quadtrix-cpp\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY
+          file: .devops/Dockerfile.backend
+          platforms: linux/amd64
+          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=cuda
+          cache-to: type=gha,mode=max,scope=cuda

From f4f3bf3daffe4f6786a085416dc48a3a4507c29f Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Tue, 2 Jun 2026 23:26:40 +0530
Subject: [PATCH 17/45] ci: add unified release and docker build workflow

---
 .github/workflows/release.yml | 236 ++++++++++++++++++++++++++++++++++
 1 file changed, 236 insertions(+)
 create mode 100644 .github/workflows/release.yml

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
new file mode 100644
index 0000000..219b56c
--- /dev/null
+++ b/.github/workflows/release.yml
@@ -0,0 +1,236 @@
+name: Release
+
+on:
+  push:
+    tags:
+      - "v*"
+  workflow_dispatch:
+    inputs:
+      version:
+        description: "Release version, with or without a leading v"
+        required: true
+
+env:
+  ARTIFACT_ROOT: release-assets
+
+jobs:
+  prepare-release:
+    name: Prepare release metadata
+    runs-on: ubuntu-latest
+    outputs:
+      tag_name: ${{ steps.meta.outputs.tag_name }}
+      version: ${{ steps.meta.outputs.version }}
+    steps:
+      - id: meta
+        shell: bash
+        run: |
+          set -euo pipefail
+          if [ "${GITHUB_EVENT_NAME}" = "workflow_dispatch" ]; then
+            raw_version="${{ inputs.version }}"
+          else
+            raw_version="${GITHUB_REF_NAME}"
+          fi
+
+          raw_version="${raw_version#v}"
+          tag_name="v${raw_version}"
+
+          echo "tag_name=${tag_name}" >> "$GITHUB_OUTPUT"
+          echo "version=${raw_version}" >> "$GITHUB_OUTPUT"
+
+  build-linux:
+    name: Linux ${{ matrix.arch }} CPU
+    needs: prepare-release
+    runs-on: ${{ matrix.runner }}
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - arch: x64
+            runner: ubuntu-22.04
+            compiler: g++
+            packages: build-essential file
+            cxxflags: -std=c++17 -O3 -march=native
+            artifact: quadtrix-ubuntu-x64-cpu.tar.gz
+          - arch: arm64
+            runner: ubuntu-24.04-arm
+            compiler: g++
+            packages: build-essential file
+            cxxflags: -std=c++17 -O3 -march=native
+            artifact: quadtrix-ubuntu-arm64-cpu.tar.gz
+          - arch: s390x
+            runner: ubuntu-22.04
+            compiler: s390x-linux-gnu-g++
+            packages: g++-s390x-linux-gnu file
+            cxxflags: -std=c++17 -O3
+            artifact: quadtrix-ubuntu-s390x-cpu.tar.gz
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install toolchain
+        shell: bash
+        run: |
+          set -euo pipefail
+          sudo apt-get update
+          sudo apt-get install -y ${{ matrix.packages }}
+
+      - name: Build binary
+        shell: bash
+        run: |
+          set -euo pipefail
+          ${{ matrix.compiler }} ${{ matrix.cxxflags }} \
+            -I. -Iinclude \
+            -o quadtrix main.cpp
+          file quadtrix
+
+      - name: Smoke test
+        shell: bash
+        run: |
+          set +e
+          ./quadtrix --chat >/dev/null 2>&1
+          exit 0
+
+      - name: Package artifact
+        shell: bash
+        run: |
+          set -euo pipefail
+          mkdir -p "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu"
+          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/"
+          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/"
+          tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-ubuntu-${{ matrix.arch }}-cpu"
+
+      - name: Upload artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: quadtrix-ubuntu-${{ matrix.arch }}-cpu
+          path: ${{ matrix.artifact }}
+          if-no-files-found: error
+          retention-days: 30
+
+  build-windows:
+    name: Windows ${{ matrix.arch }} CPU
+    needs: prepare-release
+    runs-on: ${{ matrix.runner }}
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - arch: x64
+            runner: windows-latest
+            msvc_arch: x64
+            artifact: quadtrix-windows-x64-cpu.zip
+          - arch: arm64
+            runner: windows-11-arm
+            msvc_arch: arm64
+            artifact: quadtrix-windows-arm64-cpu.zip
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up MSVC
+        uses: ilammy/msvc-dev-cmd@v1
+        with:
+          arch: ${{ matrix.msvc_arch }}
+
+      - name: Build binary
+        shell: cmd
+        run: |
+          cl /nologo /std:c++17 /O2 /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe
+
+      - name: Smoke test
+        shell: pwsh
+        run: |
+          $ErrorActionPreference = 'Continue'
+          & .\quadtrix.exe --chat | Out-Null
+          exit 0
+
+      - name: Package artifact
+        shell: pwsh
+        run: |
+          New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu" | Out-Null
+          Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\"
+          Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\"
+          Compress-Archive -Path "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\*" -DestinationPath "${{ matrix.artifact }}" -Force
+
+      - name: Upload artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: quadtrix-windows-${{ matrix.arch }}-cpu
+          path: ${{ matrix.artifact }}
+          if-no-files-found: error
+          retention-days: 30
+
+  build-macos:
+    name: macOS ${{ matrix.arch }} CPU
+    needs: prepare-release
+    runs-on: ${{ matrix.runner }}
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - arch: x64
+            arch_flag: x86_64
+            runner: macos-13
+            artifact: quadtrix-macos-x64-cpu.tar.gz
+          - arch: arm64
+            arch_flag: arm64
+            runner: macos-14
+            artifact: quadtrix-macos-arm64-cpu.tar.gz
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Build binary
+        shell: bash
+        run: |
+          set -euo pipefail
+          clang++ -std=c++17 -O3 -arch ${{ matrix.arch_flag }} \
+            -I. -Iinclude \
+            -o quadtrix main.cpp
+          file quadtrix
+
+      - name: Smoke test
+        shell: bash
+        run: |
+          set +e
+          ./quadtrix --chat >/dev/null 2>&1
+          exit 0
+
+      - name: Package artifact
+        shell: bash
+        run: |
+          set -euo pipefail
+          mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu"
+          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/"
+          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/"
+          tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-macos-${{ matrix.arch }}-cpu"
+
+      - name: Upload artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: quadtrix-macos-${{ matrix.arch }}-cpu
+          path: ${{ matrix.artifact }}
+          if-no-files-found: error
+          retention-days: 30
+
+  publish-release:
+    name: Publish GitHub release
+    needs:
+      - prepare-release
+      - build-linux
+      - build-windows
+      - build-macos
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+    steps:
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          path: dist
+          merge-multiple: true
+
+      - name: Publish release
+        uses: softprops/action-gh-release@v2
+        with:
+          tag_name: ${{ needs.prepare-release.outputs.tag_name }}
+          target_commitish: ${{ github.sha }}
+          files: dist/*
+          generate_release_notes: true

From af5a20756767eb7227d7b51ae8110ca1979f0a23 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 01:17:06 +0530
Subject: [PATCH 18/45] Refactor macOS build workflow for arm64 architecture

---
 .github/workflows/release.yml | 48 ++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 219b56c..4f14816 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -158,22 +158,11 @@ jobs:
           if-no-files-found: error
           retention-days: 30
 
-  build-macos:
-    name: macOS ${{ matrix.arch }} CPU
+
+  build-macos-arm64:
+    name: macOS arm64 CPU
     needs: prepare-release
-    runs-on: ${{ matrix.runner }}
-    strategy:
-      fail-fast: false
-      matrix:
-        include:
-          - arch: x64
-            arch_flag: x86_64
-            runner: macos-13
-            artifact: quadtrix-macos-x64-cpu.tar.gz
-          - arch: arm64
-            arch_flag: arm64
-            runner: macos-14
-            artifact: quadtrix-macos-arm64-cpu.tar.gz
+    runs-on: macos-14
     steps:
       - uses: actions/checkout@v4
 
@@ -181,7 +170,7 @@ jobs:
         shell: bash
         run: |
           set -euo pipefail
-          clang++ -std=c++17 -O3 -arch ${{ matrix.arch_flag }} \
+          clang++ -std=c++17 -O3 -arch arm64 \
             -I. -Iinclude \
             -o quadtrix main.cpp
           file quadtrix
@@ -197,16 +186,16 @@ jobs:
         shell: bash
         run: |
           set -euo pipefail
-          mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu"
-          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/"
-          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/"
-          tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-macos-${{ matrix.arch }}-cpu"
+          mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu"
+          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/"
+          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/"
+          tar -czf "quadtrix-macos-arm64-cpu.tar.gz" -C "${ARTIFACT_ROOT}" "quadtrix-macos-arm64-cpu"
 
       - name: Upload artifact
         uses: actions/upload-artifact@v4
         with:
-          name: quadtrix-macos-${{ matrix.arch }}-cpu
-          path: ${{ matrix.artifact }}
+          name: quadtrix-macos-arm64-cpu
+          path: quadtrix-macos-arm64-cpu.tar.gz
           if-no-files-found: error
           retention-days: 30
 
@@ -216,8 +205,21 @@ jobs:
       - prepare-release
       - build-linux
       - build-windows
-      - build-macos
+      - build-macos-x64
+      - build-macos-arm64
     runs-on: ubuntu-latest
+    
+    if: |
+      always() &&
+      needs.prepare-release.result == 'success' &&
+      needs.build-linux.result == 'success' &&
+      needs.build-windows.result == 'success' &&
+      needs.build-macos-x64.result == 'success' &&
+      (
+        needs.build-macos-arm64.result == 'success' ||
+        needs.build-macos-arm64.result == 'cancelled' ||
+        needs.build-macos-arm64.result == 'skipped'
+      )
     permissions:
       contents: write
     steps:

From 58f89df8fe246569804a147df893e3e9ebd2262f Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 01:20:45 +0530
Subject: [PATCH 19/45] Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.
---
 .github/workflows/release.yml | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 4f14816..9d73bc0 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -158,7 +158,7 @@ jobs:
           if-no-files-found: error
           retention-days: 30
 
-
+  # Optional — cancelling this job will not block the release
   build-macos-arm64:
     name: macOS arm64 CPU
     needs: prepare-release
@@ -205,16 +205,13 @@ jobs:
       - prepare-release
       - build-linux
       - build-windows
-      - build-macos-x64
       - build-macos-arm64
     runs-on: ubuntu-latest
-    
     if: |
       always() &&
       needs.prepare-release.result == 'success' &&
       needs.build-linux.result == 'success' &&
       needs.build-windows.result == 'success' &&
-      needs.build-macos-x64.result == 'success' &&
       (
         needs.build-macos-arm64.result == 'success' ||
         needs.build-macos-arm64.result == 'cancelled' ||

From 1718c3df29b14e9b2398e7e6f02ddf0fe3f2cb19 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 11:44:19 +0530
Subject: [PATCH 20/45] perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
---
 benchmark/results/python_benchmark.csv | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 benchmark/results/python_benchmark.csv

diff --git a/benchmark/results/python_benchmark.csv b/benchmark/results/python_benchmark.csv
new file mode 100644
index 0000000..c264086
--- /dev/null
+++ b/benchmark/results/python_benchmark.csv
@@ -0,0 +1,13 @@
+suite,name,backend,batch_size,sequence_length,tokens,avg_ms,median_ms,min_ms,max_ms,p90_ms,p95_ms,std_ms,tokens_per_sec,samples,loss,memory_mb,notes
+data,tokenizer_encode,python,0,0,220975,169.76018999121152,164.62069997214712,124.44350001169369,211.44290000665933,204.09656001720577,207.76973001193255,29.81756930091779,1301689.165236207,10,,188.39453125,
+data,batch_sample_to_device,python,4,32,128,0.34600000944919884,0.2575500402599573,0.2452000044286251,0.8668999653309584,0.48601999878883345,0.6764599820598955,0.1852238791693057,369942.18642873614,10,,189.2734375,
+primitive,matmul_3d_1x16,python,1,16,16,0.028490001568570733,0.026749970857053995,0.024800014216452837,0.04350004019215703,0.03351001651026308,0.038505028351210044,0.005415998067507546,561600.5306805843,10,,181.234375,
+primitive,matmul_3d_4x32,python,4,32,128,0.047069991705939174,0.043849984649568796,0.03890000516548753,0.07130001904442906,0.05185997579246759,0.06157999741844831,0.008578937902311791,2719354.632557738,10,,181.296875,
+primitive,attention_scores_4x32,python,4,32,128,0.11958999675698578,0.10689999908208847,0.10410003596916795,0.20840001525357366,0.12946999049745497,0.16893500287551425,0.030181239200475163,1070323.6346774376,10,,181.93359375,
+forward,batch1_seq8,python,1,8,8,16.073119995417073,15.318600024329498,14.594200009014457,20.715299993753433,17.489159997785464,19.102229995769445,1.798887105385644,497.7253950870173,10,10.797359466552734,166.43359375,
+forward,batch1_seq32,python,1,32,32,21.528740011854097,21.653899981174618,20.405600022058934,22.147400013636798,22.095200035255402,22.1213000244461,0.548371285312407,1486.3851754622074,10,10.882255554199219,190.01171875,
+forward,batch4_seq32,python,4,32,128,44.681840017437935,45.51370002445765,37.46199997840449,54.08870003884658,48.26489001279697,51.17679502582177,4.5654932173684655,2864.6984983171146,10,10.885703086853027,253.171875,
+training,adamw_step_b4_s32,python,4,32,128,229.80256001465023,207.2436999878846,200.93890000134706,321.9230000395328,279.9100400414318,300.9165200404823,46.404669570312535,556.9998871720134,5,10.602718353271484,392.30078125,
+generation,empty,python,1,1,32,563.3423800056335,548.9804000244476,466.00820001913235,704.8150000046007,643.7829400005285,674.2989700025645,72.57013670803387,56.80382150492566,10,,218.44140625,
+generation,short,python,1,6,32,524.1239399998449,524.1038500098512,493.7280000303872,561.7482999805361,549.8817999905441,555.8150499855401,20.612269243289685,61.054261326070076,10,,218.47265625,
+generation,long,python,1,32,32,561.3779200008139,560.0390000035986,545.9933000383899,574.2078999755904,570.0918399612419,572.1498699684162,7.699534842668483,57.00259817834233,10,,218.14453125,

From 48971226c6cdc2f270e5f82b53781b116b1885e0 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 22:30:41 +0530
Subject: [PATCH 21/45] ci(docker): refactor image build workflow and add
 frontend job

---
 .github/workflows/ci.yml | 2 --
 1 file changed, 2 deletions(-)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index e6502d0..992ef59 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -1,8 +1,6 @@
 name: CI
 
 on:
-  push:
-    branches: [master]
   workflow_dispatch:
     inputs:
       image:

From 275ecd12300dc8aec4d215fb19eab49cbb653982 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 22:31:07 +0530
Subject: [PATCH 22/45] ci(docker): refactor image build workflow and add
 frontend job

---
 .github/workflows/docker-publish.yml | 116 +++++++++++++++++++--------
 1 file changed, 81 insertions(+), 35 deletions(-)

diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
index 3a8fbae..b7c1584 100644
--- a/.github/workflows/docker-publish.yml
+++ b/.github/workflows/docker-publish.yml
@@ -1,38 +1,40 @@
 name: Docker Images
 
 on:
-  push:
-    tags:
-      - "v*"
   workflow_dispatch:
     inputs:
       image:
-        description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)"
+        description: "Image variant to build"
         required: true
         type: choice
         options:
           - cpp
           - cpu
           - cuda
+          - frontend
           - all
       version:
         description: "Optional image tag for manual runs"
         required: false
       push_image:
-        description: "Push to ghcr.io?"
+        description: "Push to ghcr.io"
         required: true
         default: "true"
         type: choice
         options: ["true", "false"]
 
+concurrency:
+  group: docker-images-${{ github.ref }}
+  cancel-in-progress: true
+
 env:
   REGISTRY: ghcr.io
   IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix
 
 jobs:
   build-cpp-image:
-    name: "Build -- cpp (C++ engine - linux/amd64 + arm64)"
-    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')) }}
+    name: Docker cpp
+    if: ${{ inputs.image == 'cpp' || inputs.image == 'all' }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -44,10 +46,10 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Set lowercase image prefix
-        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV"
 
       - name: Login to GHCR
-        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -60,27 +62,26 @@ jobs:
         with:
           images: ${{ env.IMAGE_PREFIX }}-cpp
           tags: |
-            type=ref,event=branch
+            type=ref,event=tag
             type=sha,prefix=sha-
-            type=raw,value=latest,enable={{is_default_branch}}
-            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
-            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
+            type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }}
+            type=raw,value=latest,enable=${{ inputs.push_image == 'true' }}
 
-      - name: Build & push
+      - name: Build and push
         uses: docker/build-push-action@v6
         with:
           context: .
           file: .devops/Dockerfile.cpp
           platforms: linux/amd64,linux/arm64
-          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cpp
           cache-to: type=gha,mode=max,scope=cpp
 
   build-cpu-image:
-    name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)"
-    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')) }}
+    name: Docker cpu
+    if: ${{ inputs.image == 'cpu' || inputs.image == 'all' }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -92,10 +93,10 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Set lowercase image prefix
-        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV"
 
       - name: Login to GHCR
-        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -108,27 +109,26 @@ jobs:
         with:
           images: ${{ env.IMAGE_PREFIX }}-cpu
           tags: |
-            type=ref,event=branch
+            type=ref,event=tag
             type=sha,prefix=sha-
-            type=raw,value=latest,enable={{is_default_branch}}
-            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
-            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
+            type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }}
+            type=raw,value=latest,enable=${{ inputs.push_image == 'true' }}
 
-      - name: Build & push
+      - name: Build and push
         uses: docker/build-push-action@v6
         with:
           context: .
           file: .devops/Dockerfile
           platforms: linux/amd64,linux/arm64
-          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cpu
           cache-to: type=gha,mode=max,scope=cpu
 
   build-cuda-image:
-    name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)"
-    if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')) }}
+    name: Docker cuda
+    if: ${{ inputs.image == 'cuda' || inputs.image == 'all' }}
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -139,10 +139,10 @@ jobs:
       - uses: docker/setup-buildx-action@v3
 
       - name: Set lowercase image prefix
-        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV"
 
       - name: Login to GHCR
-        if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+        if: ${{ inputs.push_image == 'true' }}
         uses: docker/login-action@v3
         with:
           registry: ${{ env.REGISTRY }}
@@ -155,20 +155,66 @@ jobs:
         with:
           images: ${{ env.IMAGE_PREFIX }}-cuda
           tags: |
-            type=ref,event=branch
+            type=ref,event=tag
             type=sha,prefix=sha-
-            type=raw,value=latest,enable={{is_default_branch}}
-            type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }}
-            type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }}
+            type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }}
+            type=raw,value=latest,enable=${{ inputs.push_image == 'true' }}
 
-      - name: Build & push
+      - name: Build and push
         uses: docker/build-push-action@v6
         with:
           context: .
           file: .devops/Dockerfile.backend
           platforms: linux/amd64
-          push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }}
+          push: ${{ inputs.push_image == 'true' }}
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cuda
           cache-to: type=gha,mode=max,scope=cuda
+
+  build-frontend-image:
+    name: Docker frontend
+    if: ${{ inputs.image == 'frontend' || inputs.image == 'all' }}
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: docker/setup-qemu-action@v3
+      - uses: docker/setup-buildx-action@v3
+
+      - name: Set lowercase image prefix
+        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV"
+
+      - name: Login to GHCR
+        if: ${{ inputs.push_image == 'true' }}
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ${{ env.IMAGE_PREFIX }}-frontend
+          tags: |
+            type=ref,event=tag
+            type=sha,prefix=sha-
+            type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }}
+            type=raw,value=latest,enable=${{ inputs.push_image == 'true' }}
+
+      - name: Build and push
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          file: .devops/Dockerfile.frontend
+          platforms: linux/amd64,linux/arm64
+          push: ${{ inputs.push_image == 'true' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=frontend
+          cache-to: type=gha,mode=max,scope=frontend

From 947c760b69d77a7a9ab0c0ab38314e0b5cfe5fec Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 22:31:15 +0530
Subject: [PATCH 23/45] ci(docker): refactor image build workflow and add
 frontend job

---
 .github/workflows/release.yml | 212 +++++++++++++++++-----------------
 1 file changed, 104 insertions(+), 108 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 9d73bc0..7b8e9ab 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -1,83 +1,79 @@
 name: Release
 
 on:
-  push:
-    tags:
-      - "v*"
   workflow_dispatch:
     inputs:
       version:
-        description: "Release version, with or without a leading v"
+        description: "Release version, for example v1.2.3 or 1.2.3"
         required: true
 
+concurrency:
+  group: release
+  cancel-in-progress: false
+
 env:
   ARTIFACT_ROOT: release-assets
 
 jobs:
-  prepare-release:
-    name: Prepare release metadata
+  release-metadata:
+    name: Release metadata
     runs-on: ubuntu-latest
     outputs:
-      tag_name: ${{ steps.meta.outputs.tag_name }}
-      version: ${{ steps.meta.outputs.version }}
+      tag_name: ${{ steps.tag.outputs.tag_name }}
     steps:
-      - id: meta
+      - id: tag
         shell: bash
         run: |
           set -euo pipefail
-          if [ "${GITHUB_EVENT_NAME}" = "workflow_dispatch" ]; then
-            raw_version="${{ inputs.version }}"
+          raw_tag="${{ inputs.version }}"
+          if [[ "${raw_tag}" == v* ]]; then
+            tag_name="${raw_tag}"
           else
-            raw_version="${GITHUB_REF_NAME}"
+            tag_name="v${raw_tag}"
           fi
-
-          raw_version="${raw_version#v}"
-          tag_name="v${raw_version}"
-
           echo "tag_name=${tag_name}" >> "$GITHUB_OUTPUT"
-          echo "version=${raw_version}" >> "$GITHUB_OUTPUT"
 
-  build-linux:
-    name: Linux ${{ matrix.arch }} CPU
-    needs: prepare-release
-    runs-on: ${{ matrix.runner }}
+  ubuntu-cpu:
+    name: Ubuntu ${{ matrix.build }} CPU
+    needs: release-metadata
+    runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
       matrix:
         include:
-          - arch: x64
-            runner: ubuntu-22.04
-            compiler: g++
-            packages: build-essential file
-            cxxflags: -std=c++17 -O3 -march=native
-            artifact: quadtrix-ubuntu-x64-cpu.tar.gz
-          - arch: arm64
-            runner: ubuntu-24.04-arm
-            compiler: g++
-            packages: build-essential file
-            cxxflags: -std=c++17 -O3 -march=native
-            artifact: quadtrix-ubuntu-arm64-cpu.tar.gz
-          - arch: s390x
-            runner: ubuntu-22.04
-            compiler: s390x-linux-gnu-g++
-            packages: g++-s390x-linux-gnu file
-            cxxflags: -std=c++17 -O3
-            artifact: quadtrix-ubuntu-s390x-cpu.tar.gz
+          - build: x64
+            os: ubuntu-22.04
+          - build: arm64
+            os: ubuntu-24.04-arm
+          - build: s390x
+            os: ubuntu-24.04-s390x
     steps:
-      - uses: actions/checkout@v4
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
 
-      - name: Install toolchain
+      - name: Dependencies
         shell: bash
         run: |
           set -euo pipefail
           sudo apt-get update
-          sudo apt-get install -y ${{ matrix.packages }}
+          sudo apt-get install -y build-essential file
+
+      - name: Toolchain workaround
+        if: ${{ contains(matrix.os, 'ubuntu-24.04') }}
+        shell: bash
+        run: |
+          set -euo pipefail
+          sudo apt-get install -y gcc-14 g++-14
+          echo "CC=gcc-14" >> "$GITHUB_ENV"
+          echo "CXX=g++-14" >> "$GITHUB_ENV"
 
-      - name: Build binary
+      - name: Build
         shell: bash
         run: |
           set -euo pipefail
-          ${{ matrix.compiler }} ${{ matrix.cxxflags }} \
+          ${CXX:-g++} -std=c++17 -O3 -DNDEBUG \
             -I. -Iinclude \
             -o quadtrix main.cpp
           file quadtrix
@@ -89,88 +85,97 @@ jobs:
           ./quadtrix --chat >/dev/null 2>&1
           exit 0
 
-      - name: Package artifact
+      - name: Pack artifacts
         shell: bash
         run: |
           set -euo pipefail
-          mkdir -p "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu"
-          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/"
-          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/"
-          tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-ubuntu-${{ matrix.arch }}-cpu"
+          package="quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-ubuntu-${{ matrix.build }}-cpu"
+          mkdir -p "${ARTIFACT_ROOT}/${package}"
+          cp quadtrix README.md LICENSE "${ARTIFACT_ROOT}/${package}/"
+          tar -czf "${package}.tar.gz" -C "${ARTIFACT_ROOT}" "${package}"
 
-      - name: Upload artifact
+      - name: Upload artifacts
         uses: actions/upload-artifact@v4
         with:
-          name: quadtrix-ubuntu-${{ matrix.arch }}-cpu
-          path: ${{ matrix.artifact }}
+          name: quadtrix-bin-ubuntu-${{ matrix.build }}-cpu
+          path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-ubuntu-${{ matrix.build }}-cpu.tar.gz
           if-no-files-found: error
           retention-days: 30
 
-  build-windows:
+  windows-cpu:
     name: Windows ${{ matrix.arch }} CPU
-    needs: prepare-release
-    runs-on: ${{ matrix.runner }}
+    needs: release-metadata
+    runs-on: windows-2022
     strategy:
       fail-fast: false
       matrix:
         include:
           - arch: x64
-            runner: windows-latest
-            msvc_arch: x64
-            artifact: quadtrix-windows-x64-cpu.zip
+            vcvars: x64
           - arch: arm64
-            runner: windows-11-arm
-            msvc_arch: arm64
-            artifact: quadtrix-windows-arm64-cpu.zip
+            vcvars: amd64_arm64
     steps:
-      - uses: actions/checkout@v4
-
-      - name: Set up MSVC
-        uses: ilammy/msvc-dev-cmd@v1
+      - name: Clone
+        uses: actions/checkout@v4
         with:
-          arch: ${{ matrix.msvc_arch }}
+          fetch-depth: 0
 
-      - name: Build binary
+      - name: Build
         shell: cmd
         run: |
-          cl /nologo /std:c++17 /O2 /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe
+          call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" ${{ matrix.vcvars }}
+          cl /nologo /std:c++17 /O2 /DNDEBUG /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe
 
       - name: Smoke test
+        if: ${{ matrix.arch == 'x64' }}
         shell: pwsh
         run: |
           $ErrorActionPreference = 'Continue'
           & .\quadtrix.exe --chat | Out-Null
           exit 0
 
-      - name: Package artifact
+      - name: Pack artifacts
         shell: pwsh
         run: |
-          New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu" | Out-Null
-          Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\"
-          Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\"
-          Compress-Archive -Path "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\*" -DestinationPath "${{ matrix.artifact }}" -Force
+          $package = "quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-windows-${{ matrix.arch }}-cpu"
+          New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\${package}" | Out-Null
+          Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\${package}\"
+          Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\${package}\"
+          Compress-Archive -Path "${env:ARTIFACT_ROOT}\${package}\*" -DestinationPath "${package}.zip" -Force
 
-      - name: Upload artifact
+      - name: Upload artifacts
         uses: actions/upload-artifact@v4
         with:
-          name: quadtrix-windows-${{ matrix.arch }}-cpu
-          path: ${{ matrix.artifact }}
+          name: quadtrix-bin-windows-${{ matrix.arch }}-cpu
+          path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-windows-${{ matrix.arch }}-cpu.zip
           if-no-files-found: error
           retention-days: 30
 
-  # Optional — cancelling this job will not block the release
-  build-macos-arm64:
-    name: macOS arm64 CPU
-    needs: prepare-release
-    runs-on: macos-14
+  macos-cpu:
+    name: macOS ${{ matrix.build }} CPU
+    needs: release-metadata
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - build: arm64
+            arch: arm64
+            os: macos-14
+          - build: x64
+            arch: x86_64
+            os: macos-13
     steps:
-      - uses: actions/checkout@v4
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
 
-      - name: Build binary
+      - name: Build
         shell: bash
         run: |
           set -euo pipefail
-          clang++ -std=c++17 -O3 -arch arm64 \
+          clang++ -std=c++17 -O3 -DNDEBUG -arch ${{ matrix.arch }} \
             -I. -Iinclude \
             -o quadtrix main.cpp
           file quadtrix
@@ -182,45 +187,35 @@ jobs:
           ./quadtrix --chat >/dev/null 2>&1
           exit 0
 
-      - name: Package artifact
+      - name: Pack artifacts
         shell: bash
         run: |
           set -euo pipefail
-          mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu"
-          cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/"
-          cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/"
-          tar -czf "quadtrix-macos-arm64-cpu.tar.gz" -C "${ARTIFACT_ROOT}" "quadtrix-macos-arm64-cpu"
+          package="quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-macos-${{ matrix.build }}-cpu"
+          mkdir -p "${ARTIFACT_ROOT}/${package}"
+          cp quadtrix README.md LICENSE "${ARTIFACT_ROOT}/${package}/"
+          tar -czf "${package}.tar.gz" -C "${ARTIFACT_ROOT}" "${package}"
 
-      - name: Upload artifact
+      - name: Upload artifacts
         uses: actions/upload-artifact@v4
         with:
-          name: quadtrix-macos-arm64-cpu
-          path: quadtrix-macos-arm64-cpu.tar.gz
+          name: quadtrix-bin-macos-${{ matrix.build }}-cpu
+          path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-macos-${{ matrix.build }}-cpu.tar.gz
           if-no-files-found: error
           retention-days: 30
 
   publish-release:
     name: Publish GitHub release
     needs:
-      - prepare-release
-      - build-linux
-      - build-windows
-      - build-macos-arm64
+      - release-metadata
+      - ubuntu-cpu
+      - windows-cpu
+      - macos-cpu
     runs-on: ubuntu-latest
-    if: |
-      always() &&
-      needs.prepare-release.result == 'success' &&
-      needs.build-linux.result == 'success' &&
-      needs.build-windows.result == 'success' &&
-      (
-        needs.build-macos-arm64.result == 'success' ||
-        needs.build-macos-arm64.result == 'cancelled' ||
-        needs.build-macos-arm64.result == 'skipped'
-      )
     permissions:
       contents: write
     steps:
-      - name: Download all artifacts
+      - name: Download artifacts
         uses: actions/download-artifact@v4
         with:
           path: dist
@@ -229,7 +224,8 @@ jobs:
       - name: Publish release
         uses: softprops/action-gh-release@v2
         with:
-          tag_name: ${{ needs.prepare-release.outputs.tag_name }}
+          tag_name: ${{ needs.release-metadata.outputs.tag_name }}
           target_commitish: ${{ github.sha }}
+          prerelease: false
           files: dist/*
           generate_release_notes: true

From 3b6555384f10e4d876866ea8f77e902fdcaa01c8 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Wed, 3 Jun 2026 22:41:20 +0530
Subject: [PATCH 24/45] Remove frontend job from Docker Images workflow

---
 .github/workflows/docker-publish.yml | 48 ----------------------------
 1 file changed, 48 deletions(-)

diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
index b7c1584..0986534 100644
--- a/.github/workflows/docker-publish.yml
+++ b/.github/workflows/docker-publish.yml
@@ -11,7 +11,6 @@ on:
           - cpp
           - cpu
           - cuda
-          - frontend
           - all
       version:
         description: "Optional image tag for manual runs"
@@ -171,50 +170,3 @@ jobs:
           labels: ${{ steps.meta.outputs.labels }}
           cache-from: type=gha,scope=cuda
           cache-to: type=gha,mode=max,scope=cuda
-
-  build-frontend-image:
-    name: Docker frontend
-    if: ${{ inputs.image == 'frontend' || inputs.image == 'all' }}
-    runs-on: ubuntu-latest
-    permissions:
-      contents: read
-      packages: write
-    steps:
-      - uses: actions/checkout@v4
-
-      - uses: docker/setup-qemu-action@v3
-      - uses: docker/setup-buildx-action@v3
-
-      - name: Set lowercase image prefix
-        run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV"
-
-      - name: Login to GHCR
-        if: ${{ inputs.push_image == 'true' }}
-        uses: docker/login-action@v3
-        with:
-          registry: ${{ env.REGISTRY }}
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-
-      - name: Extract metadata
-        id: meta
-        uses: docker/metadata-action@v5
-        with:
-          images: ${{ env.IMAGE_PREFIX }}-frontend
-          tags: |
-            type=ref,event=tag
-            type=sha,prefix=sha-
-            type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }}
-            type=raw,value=latest,enable=${{ inputs.push_image == 'true' }}
-
-      - name: Build and push
-        uses: docker/build-push-action@v6
-        with:
-          context: .
-          file: .devops/Dockerfile.frontend
-          platforms: linux/amd64,linux/arm64
-          push: ${{ inputs.push_image == 'true' }}
-          tags: ${{ steps.meta.outputs.tags }}
-          labels: ${{ steps.meta.outputs.labels }}
-          cache-from: type=gha,scope=frontend
-          cache-to: type=gha,mode=max,scope=frontend

From 1d63e8b4775c1c2afb45e39d19113a11ef10456d Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 00:07:57 +0530
Subject: [PATCH 25/45] Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.
---
 .github/workflows/release.yml | 44 ++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 7b8e9ab..e92b486 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -45,8 +45,6 @@ jobs:
             os: ubuntu-22.04
           - build: arm64
             os: ubuntu-24.04-arm
-          - build: s390x
-            os: ubuntu-24.04-s390x
     steps:
       - name: Clone
         uses: actions/checkout@v4
@@ -162,9 +160,6 @@ jobs:
           - build: arm64
             arch: arm64
             os: macos-14
-          - build: x64
-            arch: x86_64
-            os: macos-13
     steps:
       - name: Clone
         uses: actions/checkout@v4
@@ -221,11 +216,48 @@ jobs:
           path: dist
           merge-multiple: true
 
+      - name: Write release notes
+        shell: bash
+        run: |
+          cat > release-notes.md <<'EOF'
+          macOS/iOS:
+
+          macOS Apple Silicon (arm64)
+          macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
+          macOS Intel (x64) SKIPPED
+          iOS XCFramework DISABLED
+
+          Linux:
+
+          Ubuntu x64 (CPU)
+          Ubuntu arm64 (CPU)
+          Ubuntu s390x (CPU) SKIPPED
+          Ubuntu x64 (Vulkan) DISABLED
+          Ubuntu arm64 (Vulkan) DISABLED
+          Ubuntu x64 (ROCm 7.2) DISABLED
+          Ubuntu x64 (OpenVINO) DISABLED
+          Ubuntu x64 (SYCL FP32) DISABLED
+
+          Android:
+
+          Android arm64 (CPU) DISABLED
+
+          Windows:
+
+          Windows x64 (CPU)
+          Windows arm64 (CPU)
+          Windows x64 (CUDA 12) - CUDA 12.4 DLLs DISABLED
+          Windows x64 (CUDA 13) - CUDA 13.3 DLLs DISABLED
+          Windows x64 (Vulkan) DISABLED
+          Windows x64 (SYCL) DISABLED
+          Windows x64 (HIP) DISABLED
+          EOF
+
       - name: Publish release
         uses: softprops/action-gh-release@v2
         with:
           tag_name: ${{ needs.release-metadata.outputs.tag_name }}
           target_commitish: ${{ github.sha }}
           prerelease: false
+          body_path: release-notes.md
           files: dist/*
-          generate_release_notes: true

From e29f1bf3fb5336aab12b0523f00f5dc71700f069 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 00:18:45 +0530
Subject: [PATCH 26/45] feat: add local orchestration script for frontend and
 backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.
---
 init.py | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 init.py

diff --git a/init.py b/init.py
new file mode 100644
index 0000000..e870447
--- /dev/null
+++ b/init.py
@@ -0,0 +1,113 @@
+from __future__ import annotations
+
+import os
+import signal
+import subprocess
+import sys
+import time
+import webbrowser
+from pathlib import Path
+
+
+ROOT = Path(__file__).resolve().parent
+BACKEND = ROOT / "backend"
+FRONTEND = ROOT / "frontend"
+DEFAULT_CHECKPOINT = ROOT / "engine" / "best_model.pt"
+
+
+def npm_command() -> str:
+    return "npm.cmd" if os.name == "nt" else "npm"
+
+
+def python_command() -> str:
+    venv_python = ROOT / ".venv" / ("Scripts/python.exe" if os.name == "nt" else "bin/python")
+    return str(venv_python) if venv_python.exists() else sys.executable
+
+
+def start_process(name: str, command: list[str], cwd: Path, env: dict[str, str]) -> subprocess.Popen:
+    print(f"[start] {name}: {' '.join(command)}")
+    return subprocess.Popen(command, cwd=str(cwd), env=env)
+
+
+def stop_process(process: subprocess.Popen) -> None:
+    if process.poll() is not None:
+        return
+    if os.name == "nt":
+        process.terminate()
+    else:
+        process.send_signal(signal.SIGTERM)
+    try:
+        process.wait(timeout=8)
+    except subprocess.TimeoutExpired:
+        process.kill()
+
+
+def main() -> int:
+    api_port = os.environ.get("API_PORT", "3001")
+    frontend_port = os.environ.get("FRONTEND_PORT", "5173")
+    checkpoint = Path(os.environ.get("TORCH_CHECKPOINT_PATH", str(DEFAULT_CHECKPOINT))).resolve()
+
+    if not checkpoint.exists():
+        print(f"[error] .pt checkpoint not found: {checkpoint}")
+        print("        Set TORCH_CHECKPOINT_PATH to your best_model.pt file.")
+        return 1
+
+    backend_env = os.environ.copy()
+    backend_env.update(
+        {
+            "API_PORT": api_port,
+            "CORS_ORIGINS": f"http://localhost:{frontend_port},http://127.0.0.1:{frontend_port}",
+            "TORCH_CHECKPOINT_PATH": str(checkpoint),
+        }
+    )
+
+    frontend_env = os.environ.copy()
+    frontend_env.update(
+        {
+            "VITE_API_BASE_URL": f"http://localhost:{api_port}",
+            "VITE_TORCH_ONLY": "1",
+        }
+    )
+
+    backend = start_process(
+        "backend (.pt)",
+        [python_command(), "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", api_port, "--reload"],
+        BACKEND,
+        backend_env,
+    )
+    frontend = start_process(
+        "frontend",
+        [npm_command(), "run", "dev", "--", "--port", frontend_port],
+        FRONTEND,
+        frontend_env,
+    )
+
+    url = f"http://localhost:{frontend_port}"
+    print(f"[ready] frontend: {url}")
+    print(f"[ready] backend : http://localhost:{api_port}")
+    print("[mode]  PyTorch .pt only")
+    print("[stop]  Press Ctrl+C to stop both servers.")
+
+    if os.environ.get("NO_BROWSER") != "1":
+        time.sleep(2)
+        webbrowser.open(url)
+
+    try:
+        while True:
+            if backend.poll() is not None:
+                print(f"[exit] backend stopped with code {backend.returncode}")
+                return backend.returncode or 1
+            if frontend.poll() is not None:
+                print(f"[exit] frontend stopped with code {frontend.returncode}")
+                return frontend.returncode or 1
+            time.sleep(1)
+    except KeyboardInterrupt:
+        print("\n[stop] stopping servers...")
+        return 0
+    finally:
+        stop_process(frontend)
+        stop_process(backend)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())

From 5f95d0f218298a13bcacc3bc7bbc3c5249f20dd8 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Thu, 4 Jun 2026 10:28:52 +0530
Subject: [PATCH 27/45] chore(deps): bump actions/github-script from 7 to 9
 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---
 .github/workflows/pr-check.yml | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml
index 4824b9e..b50de6b 100644
--- a/.github/workflows/pr-check.yml
+++ b/.github/workflows/pr-check.yml
@@ -15,7 +15,7 @@ jobs:
       pr-sha: ${{ steps.get-sha.outputs.sha }}
     steps:
       - name: Check commenter permission
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             const { data } = await github.rest.repos.getCollaboratorPermissionLevel({
@@ -34,7 +34,7 @@ jobs:
             }
 
       - name: React with rocket
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             await github.rest.reactions.createForIssueComment({
@@ -46,7 +46,7 @@ jobs:
 
       - name: Get PR head SHA
         id: get-sha
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             const { data: pr } = await github.rest.pulls.get({
@@ -57,7 +57,7 @@ jobs:
             core.setOutput('sha', pr.head.sha);
 
       - name: Set checks to pending
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             const sha = '${{ steps.get-sha.outputs.sha }}';
@@ -96,7 +96,7 @@ jobs:
 
       - name: Report status
         if: always()
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             await github.rest.repos.createCommitStatus({
@@ -158,7 +158,7 @@ jobs:
 
       - name: Report status
         if: always()
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             await github.rest.repos.createCommitStatus({
@@ -218,7 +218,7 @@ jobs:
 
       - name: Report status
         if: always()
-        uses: actions/github-script@v7
+        uses: actions/github-script@v9
         with:
           script: |
             await github.rest.repos.createCommitStatus({
@@ -237,7 +237,7 @@ jobs:
     runs-on: ubuntu-latest
     if: always()
     steps:
-      - uses: actions/github-script@v7
+      - uses: actions/github-script@v9
         with:
           script: |
             const jobs   = ${{ toJSON(needs) }};

From e4d340985734637061d6615619cae8f7a8d861be Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 10:41:34 +0530
Subject: [PATCH 28/45] feat(cuda): introduce log_message utility and LogLevel
 enum

---
 CUDA/includes/logger.h | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 CUDA/includes/logger.h

diff --git a/CUDA/includes/logger.h b/CUDA/includes/logger.h
new file mode 100644
index 0000000..219c50f
--- /dev/null
+++ b/CUDA/includes/logger.h
@@ -0,0 +1,37 @@
+#pragma once
+
+#include <cstdarg>
+#include <cstdio>
+
+namespace quadtrix {
+namespace cuda {
+
+enum class LogLevel {
+    Info,
+    Warn,
+    Error,
+};
+
+inline const char* log_level_name(LogLevel level) {
+    switch (level) {
+        case LogLevel::Info:
+            return "info";
+        case LogLevel::Warn:
+            return "warn";
+        case LogLevel::Error:
+            return "error";
+    }
+    return "unknown";
+}
+
+inline void log_message(LogLevel level, const char* format, ...) {
+    std::fprintf(level == LogLevel::Error ? stderr : stdout, "[cuda:%s] ", log_level_name(level));
+    va_list args;
+    va_start(args, format);
+    std::vfprintf(level == LogLevel::Error ? stderr : stdout, format, args);
+    va_end(args);
+    std::fprintf(level == LogLevel::Error ? stderr : stdout, "\n");
+}
+
+}  // namespace cuda
+}  // namespace quadtrix

From 71e9abea4ec5477f07dbf096a551b1634828982a Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 10:42:49 +0530
Subject: [PATCH 29/45] feat(cuda): add cuBLAS handle wrapper and matmul
 operations

---
 CUDA/includes/matmul.cuh | 99 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)
 create mode 100644 CUDA/includes/matmul.cuh

diff --git a/CUDA/includes/matmul.cuh b/CUDA/includes/matmul.cuh
new file mode 100644
index 0000000..12dd4b2
--- /dev/null
+++ b/CUDA/includes/matmul.cuh
@@ -0,0 +1,99 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cublas_v2.h>
+#include <cuda_runtime.h>
+
+#include <cstdint>
+
+namespace quadtrix {
+namespace cuda {
+
+enum class MatmulTranspose : std::uint8_t {
+    None,
+    Transpose,
+};
+
+struct BlasStatus {
+    bool ok;
+    cublasStatus_t cublas_status;
+    const char* message;
+
+    static BlasStatus success() {
+        return {true, CUBLAS_STATUS_SUCCESS, "ok"};
+    }
+
+    static BlasStatus failure(cublasStatus_t status, const char* message) {
+        return {false, status, message};
+    }
+};
+
+const char* cublas_status_name(cublasStatus_t status);
+
+class BlasHandle {
+public:
+    explicit BlasHandle(int device_id = 0);
+    ~BlasHandle();
+
+    BlasHandle(const BlasHandle&) = delete;
+    BlasHandle& operator=(const BlasHandle&) = delete;
+
+    BlasHandle(BlasHandle&& other) noexcept;
+    BlasHandle& operator=(BlasHandle&& other) noexcept;
+
+    cublasHandle_t get() const {
+        return handle_;
+    }
+
+    int device_id() const {
+        return device_id_;
+    }
+
+    BlasStatus set_stream(cudaStream_t stream);
+
+private:
+    cublasHandle_t handle_ = nullptr;
+    int device_id_ = 0;
+};
+
+BlasStatus matmul(
+    BlasHandle& handle,
+    const TensorView& a,
+    MatmulTranspose op_a,
+    const TensorView& b,
+    MatmulTranspose op_b,
+    TensorView c,
+    float alpha = 1.0f,
+    float beta = 0.0f,
+    cudaStream_t stream = nullptr);
+
+BlasStatus matmul_forward(
+    BlasHandle& handle,
+    const TensorView& input,
+    const TensorView& weight,
+    TensorView output,
+    cudaStream_t stream = nullptr,
+    float alpha = 1.0f,
+    float beta = 0.0f);
+
+BlasStatus matmul_backward_input(
+    BlasHandle& handle,
+    const TensorView& grad_output,
+    const TensorView& weight,
+    TensorView grad_input,
+    cudaStream_t stream = nullptr,
+    float alpha = 1.0f,
+    float beta = 0.0f);
+
+BlasStatus matmul_backward_weight(
+    BlasHandle& handle,
+    const TensorView& input,
+    const TensorView& grad_output,
+    TensorView grad_weight,
+    cudaStream_t stream = nullptr,
+    float alpha = 1.0f,
+    float beta = 0.0f);
+
+}  // namespace cuda
+}  // namespace quadtrix

From 7c9db4e009998859ecd1f30fdb7a749340a89c12 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 10:43:39 +0530
Subject: [PATCH 30/45] feat(cuda): implement core Tensor, TensorShape, and
 TensorView abstractions

---
 CUDA/includes/tensor.cuh | 168 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)
 create mode 100644 CUDA/includes/tensor.cuh

diff --git a/CUDA/includes/tensor.cuh b/CUDA/includes/tensor.cuh
new file mode 100644
index 0000000..c61d77e
--- /dev/null
+++ b/CUDA/includes/tensor.cuh
@@ -0,0 +1,168 @@
+#pragma once
+
+#include "common.h"
+#include "memory.cuh"
+
+#include <array>
+#include <cstddef>
+#include <cstdint>
+
+namespace quadtrix {
+namespace cuda {
+
+constexpr int kMaxTensorDims = 8;
+
+struct TensorShape {
+    int rank = 0;
+    std::array<std::int64_t, kMaxTensorDims> dims{};
+    std::array<std::int64_t, kMaxTensorDims> strides{};
+
+    static TensorShape contiguous(const std::int64_t* sizes, int ndim) {
+        if (ndim < 1 || ndim > kMaxTensorDims) {
+            std::fprintf(stderr, "Tensor rank %d is outside supported range [1, %d]\n", ndim, kMaxTensorDims);
+            std::abort();
+        }
+
+        TensorShape shape;
+        shape.rank = ndim;
+        for (int i = 0; i < ndim; ++i) {
+            if (sizes[i] <= 0) {
+                std::fprintf(stderr, "Tensor dimension %d must be positive, got %lld\n", i, static_cast<long long>(sizes[i]));
+                std::abort();
+            }
+            shape.dims[i] = sizes[i];
+        }
+
+        std::int64_t stride = 1;
+        for (int i = ndim - 1; i >= 0; --i) {
+            shape.strides[i] = stride;
+            stride *= shape.dims[i];
+        }
+        return shape;
+    }
+
+    std::size_t numel() const {
+        std::size_t total = 1;
+        for (int i = 0; i < rank; ++i) {
+            if (dims[i] <= 0) {
+                return 0;
+            }
+            std::size_t next = 0;
+            if (!checked_mul(total, static_cast<std::size_t>(dims[i]), &next)) {
+                return 0;
+            }
+            total = next;
+        }
+        return rank == 0 ? 0 : total;
+    }
+
+    bool is_contiguous() const {
+        std::int64_t expected = 1;
+        for (int i = rank - 1; i >= 0; --i) {
+            if (strides[i] != expected) {
+                return false;
+            }
+            expected *= dims[i];
+        }
+        return true;
+    }
+};
+
+struct TensorView {
+    void* data = nullptr;
+    TensorShape shape;
+    DType dtype = DType::F32;
+    DeviceKind device = DeviceKind::CUDA;
+    int device_id = 0;
+
+    std::size_t numel() const {
+        return shape.numel();
+    }
+
+    std::size_t bytes() const {
+        std::size_t out = 0;
+        if (!checked_mul(numel(), dtype_size(dtype), &out)) {
+            return 0;
+        }
+        return out;
+    }
+
+    template <typename T>
+    T* data_as() {
+        return static_cast<T*>(data);
+    }
+
+    template <typename T>
+    const T* data_as() const {
+        return static_cast<const T*>(data);
+    }
+};
+
+class Tensor {
+public:
+    Tensor() = default;
+
+    Tensor(const std::int64_t* dims, int rank, DType dtype, int device_id = 0)
+        : shape_(TensorShape::contiguous(dims, rank)), dtype_(dtype), device_id_(device_id) {
+        allocate();
+    }
+
+    Tensor(const Tensor&) = delete;
+    Tensor& operator=(const Tensor&) = delete;
+    Tensor(Tensor&&) noexcept = default;
+    Tensor& operator=(Tensor&&) noexcept = default;
+
+    TensorView view() {
+        return {storage_.data(), shape_, dtype_, DeviceKind::CUDA, device_id_};
+    }
+
+    TensorView view() const {
+        return {const_cast<void*>(storage_.data()), shape_, dtype_, DeviceKind::CUDA, device_id_};
+    }
+
+    const TensorShape& shape() const {
+        return shape_;
+    }
+
+    DType dtype() const {
+        return dtype_;
+    }
+
+    int device_id() const {
+        return device_id_;
+    }
+
+    std::size_t numel() const {
+        return shape_.numel();
+    }
+
+    std::size_t bytes() const {
+        return storage_.bytes();
+    }
+
+    void* data() {
+        return storage_.data();
+    }
+
+    const void* data() const {
+        return storage_.data();
+    }
+
+private:
+    void allocate() {
+        std::size_t bytes = 0;
+        if (!checked_mul(shape_.numel(), dtype_size(dtype_), &bytes)) {
+            std::fprintf(stderr, "Tensor allocation size overflow\n");
+            std::abort();
+        }
+        storage_.allocate(bytes, device_id_);
+    }
+
+    TensorShape shape_;
+    DType dtype_ = DType::F32;
+    int device_id_ = 0;
+    DeviceBuffer storage_;
+};
+
+}  // namespace cuda
+}  // namespace quadtrix

From dbf79df0dba61b4e0ab6fc92dc1dc5656bf80fc4 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 11:04:01 +0530
Subject: [PATCH 31/45] refactor: untie embedding and lm_head weights and to
 quadtrix

---
 CUDA/includes/memory.cuh | 120 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)
 create mode 100644 CUDA/includes/memory.cuh

diff --git a/CUDA/includes/memory.cuh b/CUDA/includes/memory.cuh
new file mode 100644
index 0000000..e08fa4a
--- /dev/null
+++ b/CUDA/includes/memory.cuh
@@ -0,0 +1,120 @@
+#pragma once
+
+#include "common.h"
+#include "runtime.cuh"
+
+#include <cuda_runtime.h>
+
+#include <cstddef>
+#include <utility>
+
+namespace quadtrix {
+namespace cuda {
+
+class DeviceBuffer {
+public:
+    DeviceBuffer() = default;
+
+    explicit DeviceBuffer(std::size_t bytes, int device_id = -1) {
+        allocate(bytes, device_id);
+    }
+
+    ~DeviceBuffer() {
+        release();
+    }
+
+    DeviceBuffer(const DeviceBuffer&) = delete;
+    DeviceBuffer& operator=(const DeviceBuffer&) = delete;
+
+    DeviceBuffer(DeviceBuffer&& other) noexcept {
+        swap(other);
+    }
+
+    DeviceBuffer& operator=(DeviceBuffer&& other) noexcept {
+        if (this != &other) {
+            release();
+            swap(other);
+        }
+        return *this;
+    }
+
+    void allocate(std::size_t bytes, int device_id = -1) {
+        release();
+        if (bytes == 0) {
+            return;
+        }
+        if (device_id >= 0) {
+            device_id_ = device_id;
+            DeviceGuard guard(device_id);
+            QUADTRIX_CUDA_ABORT(cudaMalloc(&ptr_, bytes));
+        } else {
+            device_id_ = current_device();
+            QUADTRIX_CUDA_ABORT(cudaMalloc(&ptr_, bytes));
+        }
+        bytes_ = bytes;
+    }
+
+    void release() {
+        if (ptr_ != nullptr) {
+            if (device_id_ >= 0) {
+                DeviceGuard guard(device_id_);
+                cudaFree(ptr_);
+            } else {
+                cudaFree(ptr_);
+            }
+            ptr_ = nullptr;
+            bytes_ = 0;
+            device_id_ = -1;
+        }
+    }
+
+    void* data() {
+        return ptr_;
+    }
+
+    const void* data() const {
+        return ptr_;
+    }
+
+    std::size_t bytes() const {
+        return bytes_;
+    }
+
+    bool empty() const {
+        return ptr_ == nullptr || bytes_ == 0;
+    }
+
+    int device_id() const {
+        return device_id_;
+    }
+
+    void swap(DeviceBuffer& other) noexcept {
+        std::swap(ptr_, other.ptr_);
+        std::swap(bytes_, other.bytes_);
+        std::swap(device_id_, other.device_id_);
+    }
+
+private:
+    void* ptr_ = nullptr;
+    std::size_t bytes_ = 0;
+    int device_id_ = -1;
+};
+
+inline Status copy_h2d(void* dst_device, const void* src_host, std::size_t bytes, cudaStream_t stream = nullptr) {
+    return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_device, src_host, bytes, cudaMemcpyHostToDevice, stream));
+}
+
+inline Status copy_d2h(void* dst_host, const void* src_device, std::size_t bytes, cudaStream_t stream = nullptr) {
+    return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_host, src_device, bytes, cudaMemcpyDeviceToHost, stream));
+}
+
+inline Status copy_d2d(void* dst_device, const void* src_device, std::size_t bytes, cudaStream_t stream = nullptr) {
+    return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_device, src_device, bytes, cudaMemcpyDeviceToDevice, stream));
+}
+
+inline Status memset_device(void* dst_device, int value, std::size_t bytes, cudaStream_t stream = nullptr) {
+    return QUADTRIX_CUDA_CHECK(cudaMemsetAsync(dst_device, value, bytes, stream));
+}
+
+}  // namespace cuda
+}  // namespace quadtrix

From 7c461b8e36084249a9b89e247ef47d6e4fc59b31 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 11:04:49 +0530
Subject: [PATCH 32/45] feat(cuda): add NCCL communicator wrapper and
 all-reduce primitives

---
 CUDA/includes/nccl_all_reduce.cuh | 96 +++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 CUDA/includes/nccl_all_reduce.cuh

diff --git a/CUDA/includes/nccl_all_reduce.cuh b/CUDA/includes/nccl_all_reduce.cuh
new file mode 100644
index 0000000..c712a6a
--- /dev/null
+++ b/CUDA/includes/nccl_all_reduce.cuh
@@ -0,0 +1,96 @@
+#pragma once
+
+#include "tensor.cuh"
+
+#include <cuda_runtime.h>
+
+#ifdef QUADTRIX_ENABLE_NCCL
+#include <nccl.h>
+#else
+typedef struct {
+    char internal[128];
+} ncclUniqueId;
+typedef struct ncclComm* ncclComm_t;
+typedef enum {
+    ncclSuccess = 0,
+    ncclUnhandledCudaError = 1,
+    ncclSystemError = 2,
+    ncclInternalError = 3,
+    ncclInvalidArgument = 4,
+    ncclInvalidUsage = 5,
+    ncclNumResults = 6
+} ncclResult_t;
+#endif
+
+namespace quadtrix {
+namespace cuda {
+
+struct NcclStatus {
+    bool ok;
+    ncclResult_t nccl_status;
+    const char* message;
+
+    static NcclStatus success() {
+        return {true, ncclSuccess, "ok"};
+    }
+
+    static NcclStatus failure(ncclResult_t status, const char* message) {
+        return {false, status, message};
+    }
+};
+
+const char* nccl_status_name(ncclResult_t status);
+
+class NcclCommunicator {
+public:
+    NcclCommunicator() = default;
+    NcclCommunicator(ncclUniqueId unique_id, int world_size, int rank, int device_id);
+    ~NcclCommunicator();
+
+    NcclCommunicator(const NcclCommunicator&) = delete;
+    NcclCommunicator& operator=(const NcclCommunicator&) = delete;
+
+    NcclCommunicator(NcclCommunicator&& other) noexcept;
+    NcclCommunicator& operator=(NcclCommunicator&& other) noexcept;
+
+    ncclComm_t get() const {
+        return comm_;
+    }
+
+    int world_size() const {
+        return world_size_;
+    }
+
+    int rank() const {
+        return rank_;
+    }
+
+    int device_id() const {
+        return device_id_;
+    }
+
+    bool valid() const {
+        return comm_ != nullptr;
+    }
+
+private:
+    ncclComm_t comm_ = nullptr;
+    int world_size_ = 1;
+    int rank_ = 0;
+    int device_id_ = 0;
+};
+
+NcclStatus create_unique_id(ncclUniqueId* unique_id);
+
+NcclStatus all_reduce_sum(
+    NcclCommunicator& communicator,
+    TensorView tensor,
+    cudaStream_t stream = nullptr);
+
+NcclStatus all_reduce_average(
+    NcclCommunicator& communicator,
+    TensorView tensor,
+    cudaStream_t stream = nullptr);
+
+}  // namespace cuda
+}  // namespace quadtrix

From c5d06b6f3b8f70e2af6da338d7eeb1ebaf4bd94b Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Thu, 4 Jun 2026 22:04:35 +0530
Subject: [PATCH 33/45] Update README.md with workflow badges

Added badges for release, package, and CI workflows.
---
 README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README.md b/README.md
index 56f99cc..d8a6ca1 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,9 @@
 
 <p align="center">
   <img width="785" height="261" alt="image" src="https://github.com/user-attachments/assets/7bd2d8c6-d1e3-4ca0-96c0-0161d3cf235a" />
+
+  [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml)  [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml)
+  [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml)
 </p>
 
 A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend.

From e0400256d1c0f3540c708e340812079a12da4318 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sat, 6 Jun 2026 18:04:07 +0530
Subject: [PATCH 34/45] kernels: add AdamW optimization kernel with stochastic
 rounding  Introduces the AdamW fused CUDA kernel including linear
 interpolation  optimizations (`lerp`), multi-slice batching support via 2D
 grids, and  `init_from_master` utility functions for low-precision parameter
 handling.

---
 CUDA/llmcpp/adamw.cuh | 98 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 CUDA/llmcpp/adamw.cuh

diff --git a/CUDA/llmcpp/adamw.cuh b/CUDA/llmcpp/adamw.cuh
new file mode 100644
index 0000000..4453576
--- /dev/null
+++ b/CUDA/llmcpp/adamw.cuh
@@ -0,0 +1,98 @@
+/*
+AdamW kernel
+*/
+
+// llmc internal imports
+#include "cuda_common.h"
+#include "cuda_utils.cuh"
+
+// ----------------------------------------------------------------------------
+// CUDA kernels
+
+// Implements linear interpolation using only two floating-point operations (as opposed to three in a naive implementation).
+// Reference: https://developer.nvidia.com/blog/lerp-faster-cuda
+__device__ float lerp(float start, float end, float weight) {
+    return fma(weight, end, fma(-weight, start, start));
+}
+
+template <typename Tp, typename Tg>
+__device__ void adamw_update(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters,
+                             float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay,
+                             float grad_scale, unsigned int seed) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx >= num_parameters) { return; }  // guard
+
+    // get the gradient, m, and v for this parameter
+    float grad = grad_scale * (float)grads_memory[idx];
+    float m = m_memory[idx];
+    float v = v_memory[idx];
+    // update the first moment (momentum)
+    m = lerp(grad, m, beta1);
+    m_memory[idx] = m;
+    // update the second moment (RMSprop)
+    v = lerp(grad * grad, v, beta2);
+    v_memory[idx] = v;
+    m /= beta1_correction;  // m_hat
+    v /= beta2_correction;  // v_hat
+    // fetch the old value of this parameter as a float, from either source
+    float old_param = (master_params_memory != NULL) ? master_params_memory[idx] : (float)params_memory[idx];
+    // update this parameter
+    float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param));
+    // update our low precision version of the parameters using stochastic rounding
+    // this will be used in the next forward pass
+    stochastic_rounding(param, &params_memory[idx], seed);
+    // write the full, float version of the param into our master copy, if we maintain one
+    // this will be used in the next update
+    if (master_params_memory != NULL) { master_params_memory[idx] = param; }
+}
+
+template <typename Tp, typename Tg>
+__global__ void adamw_kernel3(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters,
+                              ptrdiff_t w_stride, ptrdiff_t g_stride, ptrdiff_t s_stride,
+                              float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay,
+                              float grad_scale, unsigned int seed) {
+    adamw_update(params_memory + blockIdx.y * w_stride,
+                 master_params_memory ? master_params_memory + blockIdx.y * s_stride : NULL,
+                 grads_memory + blockIdx.y * g_stride,
+                 m_memory + blockIdx.y * s_stride,
+                 v_memory + blockIdx.y * s_stride,
+                 num_parameters, learning_rate, beta1, beta2, beta1_correction, beta2_correction, eps, weight_decay, grad_scale,
+                 seed
+                 );
+}
+
+template <typename Tp>
+__global__ void init_from_master_kernel(Tp* params_memory, float* master_params_memory, size_t num_parameters,
+                                          ptrdiff_t w_stride, ptrdiff_t s_stride, unsigned int seed) {
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx >= num_parameters) { return; }
+    params_memory += blockIdx.y * w_stride; // adjust for layer offset
+    master_params_memory += blockIdx.y * s_stride;
+    stochastic_rounding(master_params_memory[idx], &params_memory[idx], seed);
+}
+
+template <typename Tp, typename Tg>
+void adamw_update(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters,
+                  ptrdiff_t w_stride, ptrdiff_t g_stride, ptrdiff_t s_stride,  int num_slices, float learning_rate, float beta1, float beta2, int t, float eps, float weight_decay,
+                  float grad_scale, unsigned int seed, cudaStream_t stream) {
+    // AdamW update
+    int block_size = 512;
+    int num_blocks = CEIL_DIV(num_parameters, block_size);
+    float beta1_correction = 1.0f - powf(beta1, t);
+    float beta2_correction = 1.0f - powf(beta2, t);
+    adamw_kernel3<<<dim3(num_blocks, num_slices), block_size, 0, stream>>>(params_memory, master_params_memory, grads_memory,
+                                                         m_memory, v_memory, num_parameters, w_stride, g_stride, s_stride,
+                                                         learning_rate, beta1, beta2, beta1_correction, beta2_correction, eps, weight_decay,
+                                                         grad_scale, seed);
+    cudaCheck(cudaGetLastError());
+}
+
+template <typename Tp>
+void init_from_master(Tp* params_memory, float* master_params_memory, size_t num_parameters,
+                        ptrdiff_t w_stride, ptrdiff_t s_stride, int num_slices, unsigned int seed, cudaStream_t stream) {
+    int block_size = 512; // must match block size of adamw_update so that RNG also matches
+    int num_blocks = CEIL_DIV(num_parameters, block_size);
+    init_from_master_kernel<<<dim3(num_blocks, num_slices), block_size, 0, stream>>>
+                             (params_memory, master_params_memory, num_parameters, w_stride, s_stride, seed);
+    cudaCheck(cudaGetLastError());
+}

From c3dc5ae4c37280839ebf95feb4ce8acd51da8b1c Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sat, 6 Jun 2026 18:07:43 +0530
Subject: [PATCH 35/45] cudnn: implement cached SDPA forward graph using cuDNN
 frontend

---
 CUDA/llmcpp/cudnn_att.cpp | 297 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 297 insertions(+)
 create mode 100644 CUDA/llmcpp/cudnn_att.cpp

diff --git a/CUDA/llmcpp/cudnn_att.cpp b/CUDA/llmcpp/cudnn_att.cpp
new file mode 100644
index 0000000..0330abe
--- /dev/null
+++ b/CUDA/llmcpp/cudnn_att.cpp
@@ -0,0 +1,297 @@
+// all cudnn-related functions are in this file, so that they don't need to be recompiled everytime
+// we change some unrelated piece of the code.
+// TODO this currently duplicates some of the utilities from the main file
+
+#define NOMINMAX
+#include <unistd.h>
+#include "cudnn_att.h"
+#include <cudnn_frontend.h>
+
+namespace fe = cudnn_frontend;
+
+// Specific configurations based on the enabled precision
+#if defined(ENABLE_FP32)
+static_assert(false, "cuDNN is not supported in FP32 mode.")
+// use fp16 (note: this may require gradient scaler, currently not implemented!)
+#elif defined(ENABLE_FP16)
+#define CUDNN_16BIT fe::DataType_t::HALF
+#else // Default to bfloat16
+#define CUDNN_16BIT fe::DataType_t::BFLOAT16
+#endif
+
+static cudnnHandle_t cudnn_handle;
+static size_t cudnn_workspace_size = 0; // dynamically allocated as needed (up to 256MiB!)
+static void* cudnn_workspace = NULL;
+
+static void cuDNNCheck(cudnnStatus_t error, const char *file, int line) {
+    if (error != CUDNN_STATUS_SUCCESS) {
+        printf("[CUDNN ERROR] at file %s:%d:\n%s\n", file, line, cudnnGetErrorString(error));
+        exit(EXIT_FAILURE);
+    }
+};
+#define cuDNNCheck(err) (cuDNNCheck(err, __FILE__, __LINE__))
+
+static void checkCudnnFE(const fe::error_object& e, const char *file, int line) {
+    if(!e.is_good()) {
+        printf("[CUDNN ERROR] at file %s:%d:\n%s\n", file, line, e.err_msg.c_str());
+        exit(EXIT_FAILURE);
+    }
+}
+#define checkCudnnFE(err) checkCudnnFE(err, __FILE__, __LINE__)
+
+enum UIDs {
+    Q_UID,
+    K_UID,
+    V_UID,
+    Attn_scale_UID,
+    O_UID,
+    Stats_UID,
+    dO_UID,
+    dQ_UID,
+    dK_UID,
+    dV_UID
+};
+
+// Need a cache because graph->build_operation_graph() is slow but everything else seems fast
+using cache_type_fwd = std::map<std::tuple<int,int,int,int, int>, std::shared_ptr<fe::graph::Graph>>;
+using cache_type_bwd = std::map<std::tuple<int,int,int,int>, std::shared_ptr<fe::graph::Graph>>;
+
+// Loosely based on cuDNN frontend samples functions and massively simplified
+auto lookup_cache_or_build_graph_fwd(int B,int H,int T,int HS, int is_inference_only) {
+
+    static cache_type_fwd user_maintained_cache_fwd;
+
+    auto key = std::make_tuple(B, H, T, HS, is_inference_only);
+
+    auto it = user_maintained_cache_fwd.find(key);
+    if (it != user_maintained_cache_fwd.end()) {
+        return it->second;
+    }
+
+    auto graph = std::make_shared<fe::graph::Graph>();
+    graph->set_io_data_type(CUDNN_16BIT)
+          .set_intermediate_data_type(fe::DataType_t::FLOAT)
+          .set_compute_data_type(fe::DataType_t::FLOAT);
+
+    // QKV is (B, T, 3, NH, HS) which cuDNN can handle directly without an external permute
+    auto Q = graph->tensor(fe::graph::Tensor_attributes().set_name("Q")
+                               .set_dim({B, H, T, HS})
+                               .set_uid(Q_UID)
+                               .set_stride({3 * H * HS * T,  HS, 3 * H * HS, 1}));
+    auto K = graph->tensor(fe::graph::Tensor_attributes().set_name("K")
+                               .set_dim({B, H, T, HS})
+                               .set_uid(K_UID)
+                               .set_stride({3 * H * HS * T, HS, 3 * H * HS, 1}));
+    auto V = graph->tensor(fe::graph::Tensor_attributes().set_name("V")
+                               .set_dim({B, H, T, HS})
+                               .set_uid(V_UID)
+                               .set_stride({3 * H * HS * T, HS, 3 * H * HS, 1}));
+    auto attn_scale = graph->tensor(fe::graph::Tensor_attributes().set_name("attn_scale")
+                               .set_dim({1, 1, 1, 1})
+                               .set_stride({1, 1, 1, 1})
+                               .set_uid(Attn_scale_UID)
+                               .set_is_pass_by_value(true)
+                               .set_data_type(fe::DataType_t::FLOAT));
+
+    auto sdpa_options = fe::graph::SDPA_attributes().set_name("flash_attention");
+    sdpa_options.set_is_inference(is_inference_only);
+    sdpa_options.set_attn_scale(attn_scale);
+    sdpa_options.set_causal_mask(true);
+
+    // Create the graph operation and get the output tensors back
+    auto [O, stats] = graph->sdpa(Q, K, V, sdpa_options);
+
+    // Output is (B, T, NH, HS) BF16/FP16 and stats for backward pass is (B, NH, T) FP32
+    O->set_output(true).set_dim({B, H, T, HS}).set_stride({H * HS * T, HS, H * HS, 1}).set_uid(O_UID);
+
+    assert(stats == nullptr || is_inference_only == false);
+    if (is_inference_only == false) {
+        stats->set_output(true).set_data_type(fe::DataType_t::FLOAT)
+                               .set_dim({B, H, T, 1})
+                               .set_stride({H * T, T, 1, 1})
+                               .set_uid(Stats_UID);
+    }
+
+    checkCudnnFE(graph->validate());
+
+    // Build the operation graph and execution part (this is the VERY SLOW PART)
+    checkCudnnFE(graph->build_operation_graph(cudnn_handle));
+    auto plans = graph->create_execution_plans({fe::HeurMode_t::A});
+    checkCudnnFE(graph->check_support(cudnn_handle));
+    checkCudnnFE(graph->build_plans(cudnn_handle));
+    // Reallocate the workspace if the required size is greater than the current workspace
+    // In H100 this may be around 16B
+    if (graph->get_workspace_size() > cudnn_workspace_size) {
+        if (cudnn_workspace_size > 0) {
+            cudaCheck(cudaFree(cudnn_workspace));
+        }
+        cudnn_workspace_size = graph->get_workspace_size();
+        cudaCheck(cudaMalloc(&cudnn_workspace, cudnn_workspace_size));
+    }
+
+    user_maintained_cache_fwd.insert({key, graph});
+
+    return graph;
+}
+
+auto lookup_cache_or_build_graph_bwd(int B, int NH, int T, int HS) {
+    static cache_type_bwd user_maintained_cache_bwd;
+
+    auto key = std::make_tuple(B, NH, T, HS);
+
+    auto it = user_maintained_cache_bwd.find(key);
+    if (it != user_maintained_cache_bwd.end()) {
+        return it->second;
+    }
+
+    auto graph = std::make_shared<fe::graph::Graph>();
+    graph->set_io_data_type(CUDNN_16BIT)
+          .set_intermediate_data_type(fe::DataType_t::FLOAT)
+          .set_compute_data_type(fe::DataType_t::FLOAT);
+
+    // (B, N, 3, NH, HS)
+    // must come from inp (which means we also need to convert THAT to FP16)
+    auto Q = graph->tensor(fe::graph::Tensor_attributes().set_name("Q")
+                            .set_dim({B, NH, T, HS})
+                            .set_uid(Q_UID)
+                            .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}));
+    auto K = graph->tensor(fe::graph::Tensor_attributes().set_name("K")
+                            .set_dim({B, NH, T, HS})
+                            .set_uid(K_UID)
+                            .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}));
+    auto V = graph->tensor(fe::graph::Tensor_attributes().set_name("V")
+                            .set_dim({B, NH, T, HS})
+                            .set_uid(V_UID)
+                            .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}));
+    auto O = graph->tensor(fe::graph::Tensor_attributes().set_name("O")
+                            .set_dim({B, NH, T, HS})
+                            .set_uid(O_UID)
+                            .set_stride({NH * HS * T, HS, NH * HS, 1}));
+    auto dO = graph->tensor(fe::graph::Tensor_attributes().set_name("dO")
+                            .set_dim({B, NH, T, HS})
+                            .set_uid(dO_UID)
+                            .set_stride({NH * HS * T, HS, NH * HS, 1}));
+
+    auto stats = graph->tensor(fe::graph::Tensor_attributes().set_name("stats")
+                            .set_dim({B, NH, T, 1})
+                            .set_uid(Stats_UID)
+                            .set_stride({NH * T, T, 1, 1})
+                            .set_data_type(fe::DataType_t::FLOAT));
+    auto attn_scale = graph->tensor(fe::graph::Tensor_attributes().set_name("attn_scale")
+                            .set_dim({1, 1, 1, 1})
+                            .set_stride({1, 1, 1, 1})
+                            .set_is_pass_by_value(true)
+                            .set_uid(Attn_scale_UID)
+                            .set_data_type(fe::DataType_t::FLOAT));
+    auto sdpa_backward_options = fe::graph::SDPA_backward_attributes().set_name("flash_attention_backward")
+#if CUDNN_FRONTEND_MAJOR_VERSION > 1 || CUDNN_FRONTEND_MINOR_VERSION >= 5
+                            .set_deterministic_algorithm(true) // 1.5+ needs this for determinism
+#endif
+                            .set_causal_mask(true)
+                            .set_attn_scale(attn_scale);
+
+    // Create the graph operation and get the output tensors back
+    auto [dQ, dK, dV] = graph->sdpa_backward(Q, K, V, O, dO, stats, sdpa_backward_options);
+
+    dQ->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dQ_UID);
+    dK->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dK_UID);
+    dV->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dV_UID);
+
+    checkCudnnFE(graph->validate());
+
+    // Build the operation graph and execution part (this is the VERY SLOW PART)
+    checkCudnnFE(graph->build_operation_graph(cudnn_handle));
+    auto plans = graph->create_execution_plans({fe::HeurMode_t::A});
+    checkCudnnFE(graph->check_support(cudnn_handle));
+    checkCudnnFE(graph->build_plans(cudnn_handle));
+
+    // Reallocate the workspace if the required size is greater than the current workspace
+    // By default, cuDNN uses up to 256MiB of workspace, so we don't want to just allocate the maximum
+    if (graph->get_workspace_size() > cudnn_workspace_size) {
+        if (cudnn_workspace_size > 0) {
+            cudaCheck(cudaFree(cudnn_workspace));
+        }
+        cudnn_workspace_size = graph->get_workspace_size();
+        cudaCheck(cudaMalloc(&cudnn_workspace, cudnn_workspace_size));
+    }
+
+    user_maintained_cache_bwd.insert({key, graph});
+    return graph;
+}
+
+void attention_forward_cudnn(floatX* out,  // output: (B, T, NH, HS)
+                             float* stats, // output for backward pass: (B, NH, T)
+                             floatX* inp,  // input: (B, T, 3, NH, HS) QKV
+                             int B, int T, int NH, int C, cudaStream_t stream) {
+    NVTX_RANGE_FN();
+    int HS = C / NH; // number of features per head
+    bool is_inference_only = (stats == nullptr);
+
+    cuDNNCheck(cudnnSetStream(cudnn_handle, stream));
+
+    // Get graph and tensors from cache (or generate it on first use)
+    auto graph = lookup_cache_or_build_graph_fwd(B, NH, T, HS, is_inference_only);
+
+    // Prepare all the tensor pointers for executing the graph
+    void* devPtrQ = inp;
+    void* devPtrK = (inp + C);
+    void* devPtrV = (inp + 2 * C);
+    float attn_scale_cpu = 1.0 / sqrtf(HS);
+    void* devPtrO = out;
+
+    // Build variant pack
+    std::unordered_map<int64_t , void*> variant_pack = {
+        {Q_UID, devPtrQ}, {K_UID, devPtrK}, {V_UID, devPtrV}, {Attn_scale_UID, &attn_scale_cpu}, {O_UID, devPtrO}};
+
+    // Add the stats tensor unless we are only doing inference (only needed for backward pass)
+    if (is_inference_only == false) {
+        variant_pack[Stats_UID] = stats;
+    }
+
+    // Execute graph
+    checkCudnnFE(graph->execute(cudnn_handle, variant_pack, cudnn_workspace));
+    cudaCheck(cudaGetLastError());
+}
+
+void attention_backward_cudnn(floatX* dqkvr,                                       // output
+                              floatX* dout, floatX* qkvr, floatX* o, float* stats, // inputs
+                              int B, int T, int NH, int C, cudaStream_t stream) {
+    NVTX_RANGE_FN();
+    int HS = C / NH; // number of features per head
+
+    // Get graph and tensors from cache (or generate it on first use)
+    auto graph = lookup_cache_or_build_graph_bwd(B, NH, T, HS);
+
+    // Prepare all the tensor pointers for executing the graph
+    void* devPtrQ = qkvr;
+    void* devPtrK = (qkvr + NH * HS);
+    void* devPtrV = (qkvr + 2 * NH * HS);
+    void* devPtrO = o;
+    void* devPtrdO = dout;
+    void* devPtrStats = stats;
+    float attn_scale_cpu = 1.0 / sqrtf(HS);
+
+    void* devPtrdQ = dqkvr;
+    void* devPtrdK = (dqkvr + NH * HS);
+    void* devPtrdV = (dqkvr + 2 * NH * HS);
+
+    // Build variant pack that links each tensor to its data pointer
+    std::unordered_map<int64_t, void*> variant_pack = {
+        {Q_UID, devPtrQ}, {K_UID, devPtrK}, {V_UID, devPtrV}, {O_UID, devPtrO}, {dO_UID, devPtrdO}, {Stats_UID, devPtrStats},
+        {dQ_UID, devPtrdQ}, {dK_UID, devPtrdK}, {dV_UID, devPtrdV},
+        {Attn_scale_UID, &attn_scale_cpu}};
+
+    // Execute graph
+    cuDNNCheck(cudnnSetStream(cudnn_handle, stream));
+    checkCudnnFE(graph->execute(cudnn_handle, variant_pack, cudnn_workspace));
+    cudaCheck(cudaGetLastError());
+}
+
+void create_cudnn() {
+    cuDNNCheck(cudnnCreate(&cudnn_handle));
+}
+
+void destroy_cudnn() {
+    if (cudnn_workspace != NULL) { cudaCheck(cudaFree(cudnn_workspace)); }
+    cuDNNCheck(cudnnDestroy(cudnn_handle));
+}
\ No newline at end of file

From 49099aeb88a87ed8a1f95493afd1827ed5507257 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sat, 6 Jun 2026 18:08:37 +0530
Subject: [PATCH 36/45] feat(cuda): implement Packed128 memory vectorization
 utilities

---
 CUDA/llmcpp/cuda_utils.cuh | 286 +++++++++++++++++++++++++++++++++++++
 1 file changed, 286 insertions(+)
 create mode 100644 CUDA/llmcpp/cuda_utils.cuh

diff --git a/CUDA/llmcpp/cuda_utils.cuh b/CUDA/llmcpp/cuda_utils.cuh
new file mode 100644
index 0000000..030ec07
--- /dev/null
+++ b/CUDA/llmcpp/cuda_utils.cuh
@@ -0,0 +1,286 @@
+// Utilities for use in __device__ code
+
+#ifndef CUDA_UTILS_CUH
+#define CUDA_UTILS_CUH
+
+#include "cuda_common.h"
+
+// ----------------------------------------------------------------------------
+// Packed128 data structure that forces the compiler to use 128-bit loads/stores
+// in GPUs that support (the LDG.128 and STS.128 instructions)
+// This is a bit similar to the use of float4 in the case of 32-bit floats, but
+// supports arbitrary precision.
+
+template<class ElementType>
+struct alignas(16) Packed128 {
+    Packed128() = default;
+    __device__ explicit Packed128(int4 bits) {
+        static_assert(sizeof(bits) == sizeof(payload), "Size mismatch.");
+        memcpy(&payload, &bits, sizeof(bits));
+    }
+
+    __device__  static Packed128 constant(ElementType value) {
+        Packed128 result;
+        for(int k = 0; k < size; ++k) {
+            result.payload[k] = value;
+        }
+        return result;
+    }
+    __device__ static Packed128 zeros() {
+        return constant(0.f);
+    }
+    __device__ static Packed128 ones() {
+        return constant(1.f);
+    }
+
+    __device__ ElementType& operator[](int index) {
+        return payload[index];
+    }
+    __device__ const ElementType& operator[](int index) const {
+        return payload[index];
+    }
+    __device__ int4 get_bits() const {
+        int4 bits;
+        static_assert(sizeof(bits) == sizeof(payload), "Size mismatch.");
+        memcpy(&bits, &payload, sizeof(bits));
+        return bits;
+    }
+    static constexpr const size_t size = sizeof(int4) / sizeof(ElementType);
+    ElementType payload[size];
+};
+
+// load a Packed128 from an aligned memory address
+template<class ElementType>
+__device__ Packed128<ElementType> load128(const ElementType* address) {
+    return Packed128<ElementType>{*reinterpret_cast<const int4*>(address)};
+}
+// load a Packed128 from an aligned memory address with streaming cache hint
+template<class ElementType>
+__device__ Packed128<ElementType> load128cs(const ElementType* address) {
+    return Packed128<ElementType>{__ldcs(reinterpret_cast<const int4*>(address))};
+}
+// store a Packed128 to an aligned memory address
+template<class ElementType>
+__device__ void store128(ElementType* target, Packed128<ElementType> value) {
+    *reinterpret_cast<int4*>(target) = value.get_bits();
+}
+// store a Packed128 to an aligned memory address with streaming cache hint
+template<class ElementType>
+__device__ void store128cs(ElementType* target, Packed128<ElementType> value) {
+    __stcs(reinterpret_cast<int4*>(target), value.get_bits());
+}
+// store a Packed128 to an aligned memory address while caching in L2 but bypassing L1
+template<class ElementType>
+__device__ void store128cg(ElementType* target, Packed128<ElementType> value) {
+    __stcg(reinterpret_cast<int4*>(target), value.get_bits());
+}
+
+// short-form typedefs
+typedef Packed128<float> f128;
+typedef Packed128<floatX> x128;
+
+// ----------------------------------------------------------------------------
+// DType support
+
+// enumerator to indentify the datatype of a tensor.
+enum class DType : uint8_t {
+    FP32, FP16, BF16
+};
+
+// Given a datatype enum, returns the underlying number of bytes
+// for a scalar of that type
+size_t sizeof_dtype(DType type) {
+    switch (type) {
+        case DType::FP32:
+            return sizeof(float);
+        case DType::FP16:
+            return sizeof(half);
+        case DType::BF16:
+            return sizeof(nv_bfloat16);
+        default: // handle or get compiler warning
+            fprintf(stderr, "Unknown datatype\n");
+            exit(EXIT_FAILURE);
+    }
+}
+
+DType dtype_of(float* f) { return DType::FP32; }
+DType dtype_of(nv_bfloat16 * f) { return DType::BF16; }
+DType dtype_of(half * f) { return DType::FP16; }
+
+
+
+// ----------------------------------------------------------------------------
+// Copy, cast functions
+
+// device functions and the kernel to cast data between types
+template<typename Td, typename Ts>
+__device__ Td cast_value(Ts val);
+
+template<>
+__device__ float cast_value<float, float>(float val) {
+    return val;
+}
+
+template<>
+__device__ float cast_value<float, half>(half val) {
+    return __half2float(val);
+}
+
+template<>
+__device__ float cast_value<float, __nv_bfloat16>(__nv_bfloat16 val) {
+    return __bfloat162float(val);
+}
+
+template<typename Td, typename Ts>
+__global__ void copy_and_cast_kernel(Td* dst, const Ts* src, size_t n, ptrdiff_t stride_dst, ptrdiff_t stride_src) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    // need to try grid stride looping for more perf later
+    if (idx < n) {
+        dst[idx + stride_dst * blockIdx.y] = cast_value<Td, Ts>(src[idx + stride_src * blockIdx.y]);
+    }
+}
+
+// ----------------------------------------------------------------------------
+// Warp/Block communication primitives
+
+// warp-level reduction for summing values
+__device__ inline float warpReduceSum(float val) {
+    for (int offset = 16; offset > 0; offset /= 2) {
+        val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
+    }
+    return val;
+}
+// warp-level reduction for finding the maximum value
+__device__ inline float warpReduceMax(float val) {
+    for (int offset = 16; offset > 0; offset /= 2) {
+        val = fmaxf(val, __shfl_xor_sync(0xFFFFFFFF, val, offset));
+    }
+    return val;
+}
+// requires all 32 threads in the warp to be active, but should work for any block size
+// uses non-dynamic shared memory so every call increases shared memory requirements by 128 bytes
+// the fact it's unique shared memory allows us to avoid an extra __syncthreads() call at the end
+// but if called inside a loop, the shared memory will be implicitly reused, so set final_sync to 1
+using reduction_func_t = float (*) (float);
+template<reduction_func_t warp_reduction>
+__device__ inline float blockReduce(float val, bool final_sync=false, float out_of_bounds=0.0f) {
+    // two reductions of up to 1024 threads:
+    // 1) inside warp (shuffle), 2) cross-warp (shared memory), 3) inside warp (shuffle)
+    __shared__ float shared_val[WARP_SIZE];
+    const int lane_id = threadIdx.x % WARP_SIZE;
+    const int warp_id = threadIdx.x / WARP_SIZE;
+    const int num_warps = blockDim.x / WARP_SIZE;
+
+    float warp_val = warp_reduction(val);
+    if (lane_id == 0) { shared_val[warp_id] = warp_val; }
+    __syncthreads();
+    warp_val = (lane_id < num_warps) ? shared_val[lane_id] : out_of_bounds;
+    float block_val = warp_reduction(warp_val);
+
+    if (final_sync) {
+        __syncthreads(); // only needed in loops when effectively reusing shared memory etc.
+    }
+    return block_val;
+}
+
+// Performs a _deterministic_ sum reduction. determinism is achieved by requiring that only
+// a single block be used.
+template<class Float>
+__global__ void global_sum_single_block_kernel(float* result, const Float* values, size_t count) {
+    assert(gridDim.x == 1);     // only a single block!
+    float thread_sum = 0;
+    for(size_t index = threadIdx.x; index < count; index += blockDim.x) {
+        thread_sum += (float)values[index];
+    }
+
+    float reduction = blockReduce<warpReduceSum>(thread_sum, true);
+    if(threadIdx.x == 0) {
+        *result = reduction;
+    }
+}
+
+template<class Float>
+void global_sum_deterministic(float* result, const Float* values, int count, cudaStream_t stream) {
+    global_sum_single_block_kernel<<<1, 1024, 0, stream>>>(result, values, count);
+    cudaCheck(cudaGetLastError());
+}
+
+// ----------------------------------------------------------------------------
+// memory management
+
+// allocate memory, preferrably on the device
+// returns a status code. 0 = OK, 1 = fell back to managed memory
+int cudaMallocConditionallyManaged(void** out, size_t bytes, const char *file, int line) {
+    // try to allocate
+    cudaError_t err = cudaMalloc(out, bytes);
+    if(err == cudaErrorMemoryAllocation) {
+        // if we OOM, fallback to a managed allocation. slower but at least won't crash.
+        cudaGetLastError(); // reset the error before the next API call
+        cudaCheck_(cudaMallocManaged(out, bytes), file, line);
+        cudaCheck_(cudaMemAdvise(*out, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId), file, line);
+        return 1;
+    } else {
+        cudaCheck_(err, file, line);
+        return 0;
+    }
+}
+
+#define cudaMallocConditionallyManaged(out, bytes)\
+(cudaMallocConditionallyManaged((void**)out, bytes, __FILE__, __LINE__))
+
+// ----------------------------------------------------------------------------
+// Random Number Generation used in Stochastic Rounding
+
+// SquirrelNoise5 - Squirrel's Raw Noise utilities (version 5)
+// This gives us a random number from threadIdx/blockIdx + a single seed for the entire GPU
+// todo - possibly overkill and we don't need such high quality random numbers? (tbd)
+// http://eiserloh.net/noise/SquirrelNoise5.hpp
+__device__ __host__ constexpr unsigned int SquirrelNoise5(unsigned int positionX, unsigned int seed)
+{
+    constexpr unsigned int SQ5_BIT_NOISE1 = 0xd2a80a3f;	// 11010010101010000000101000111111
+    constexpr unsigned int SQ5_BIT_NOISE2 = 0xa884f197;	// 10101000100001001111000110010111
+    constexpr unsigned int SQ5_BIT_NOISE3 = 0x6C736F4B; // 01101100011100110110111101001011
+    constexpr unsigned int SQ5_BIT_NOISE4 = 0xB79F3ABB;	// 10110111100111110011101010111011
+    constexpr unsigned int SQ5_BIT_NOISE5 = 0x1b56c4f5;	// 00011011010101101100010011110101
+    unsigned int mangledBits = positionX;
+    mangledBits *= SQ5_BIT_NOISE1;
+    mangledBits += seed;
+    mangledBits ^= (mangledBits >> 9);
+    mangledBits += SQ5_BIT_NOISE2;
+    mangledBits ^= (mangledBits >> 11);
+    mangledBits *= SQ5_BIT_NOISE3;
+    mangledBits ^= (mangledBits >> 13);
+    mangledBits += SQ5_BIT_NOISE4;
+    mangledBits ^= (mangledBits >> 15);
+    mangledBits *= SQ5_BIT_NOISE5;
+    mangledBits ^= (mangledBits >> 17);
+    return mangledBits;
+}
+__device__ __host__ constexpr unsigned int Get2dNoiseUint(int indexX, int indexY, unsigned int seed)
+{
+    constexpr unsigned int PRIME_NUMBER = 198491317u; // Large prime number with non-boring bits
+    unsigned int x = static_cast<unsigned int>(indexX);
+    unsigned int y = static_cast<unsigned int>(indexY);
+
+    return SquirrelNoise5(x + (PRIME_NUMBER * y), seed);
+}
+
+// stochastic rounding built on top of Squirel Noise above (with seed updated per step via xorshift)
+__device__ __forceinline__ void stochastic_rounding(float in, __nv_bfloat16 *out, unsigned int seed) {
+    // todo - is this stochastic rounding *too good*? can we cut any corners?
+    // makes sure each thread gets a different random number
+    unsigned int random = Get2dNoiseUint(threadIdx.x, blockIdx.x * blockDim.x + blockIdx.y, seed);
+    unsigned int threshold = random & 0xFFFF;
+    unsigned int float_bits = __float_as_uint(in);
+    unsigned int rounded_bits = float_bits & 0x0000FFFF;
+    float_bits = (rounded_bits > threshold) ? (float_bits | 0xFFFF) : (float_bits  & ~0xFFFF);
+    *out = __float2bfloat16_rn(__uint_as_float(float_bits));
+}
+__device__ __forceinline__ void stochastic_rounding(float in, half *out, unsigned int random) {
+    *out = (float)in; // todo - implement this...
+}
+__device__ __forceinline__ void stochastic_rounding(float in, float *out, unsigned int random) {
+    *out = in; // dummy function for when floatX is float (FP32 mode)
+}
+
+#endif
\ No newline at end of file

From b41b9892eebc894445e19b34027bf1e6a4d5b2d6 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sat, 6 Jun 2026 18:09:59 +0530
Subject: [PATCH 37/45] feat: add distributed sharded DataLoader for binary
 token files

---
 CUDA/llmcpp/dataloader.h | 496 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 496 insertions(+)
 create mode 100644 CUDA/llmcpp/dataloader.h

diff --git a/CUDA/llmcpp/dataloader.h b/CUDA/llmcpp/dataloader.h
new file mode 100644
index 0000000..0ee0588
--- /dev/null
+++ b/CUDA/llmcpp/dataloader.h
@@ -0,0 +1,496 @@
+#ifndef DATALOADER_H
+#define DATALOADER_H
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <assert.h>
+#include <string.h>
+#include "utils.h"
+#include "rand.h"
+#ifndef _WIN32
+#include <glob.h>
+#endif
+#define HEADER_SIZE 256
+
+typedef struct
+{
+
+    int process_rank;
+    int num_processes;
+
+    size_t B;
+    size_t T;
+    size_t num_tokens;
+    size_t shard_num_samples;
+
+    glob_t glob_result;
+    size_t current_shard_idx;
+    size_t current_sample_idx;
+
+    FILE *tokens_file;
+
+    uint16_t *buffer;
+    int *inputs;
+    int *targets;
+
+    mt19937_state shuffle_rng;
+    int should_shuffle;
+    int *shard_indices;
+    int *intra_shard_indices;
+
+    size_t total_batch_size_bytes;
+    size_t local_batch_offset_bytes;
+    size_t header_bytes;
+    int64_t file_size_bytes;
+} DataLoader;
+
+int64_t dataloader_load_shard_(DataLoader *loader, int shard_index)
+{
+    if (loader->should_shuffle)
+    {
+        shard_index = loader->shard_indices[shard_index];
+    }
+
+    const char *filename = loader->glob_result.gl_pathv[shard_index];
+
+    if (loader->tokens_file != NULL)
+    {
+        fcloseCheck(loader->tokens_file);
+    }
+    loader->tokens_file = fopenCheck(filename, "rb");
+
+    int header[HEADER_SIZE];
+    freadCheck(header, sizeof(int), HEADER_SIZE, loader->tokens_file);
+    if (header[0] != 20240520)
+    {
+
+        printf("---> HINT: Are you passing in a correct file?\n");
+        printf("---> HINT: The data encoding may have changed, re-run data prepro or refer again to README.\n");
+        exit(EXIT_FAILURE);
+    }
+    if (header[1] != 1)
+    {
+        printf("Bad version in data file\n");
+        exit(EXIT_FAILURE);
+    }
+    int64_t ntok = header[2]; //
+    assert(ntok > 0);
+    fseekCheck(loader->tokens_file, 0, SEEK_END);
+    loader->file_size_bytes = ftell(loader->tokens_file);
+    fseekCheck(loader->tokens_file, 0, SEEK_SET);
+    int64_t expected_file_size = HEADER_SIZE * sizeof(int) + ntok * sizeof(uint16_t);
+    if (loader->file_size_bytes != expected_file_size)
+    {
+        printf("Error: file size is not as expected\n");
+        exit(EXIT_FAILURE);
+    }
+
+    loader->shard_num_samples = (ntok * sizeof(uint16_t) - sizeof(uint16_t)) / loader->total_batch_size_bytes;
+    return ntok;
+}
+
+void prepare_intra_shard_indices_(DataLoader *loader)
+{
+
+    if (loader->intra_shard_indices != NULL)
+    {
+
+        free(loader->intra_shard_indices);
+    }
+    loader->intra_shard_indices = (int *)mallocCheck(loader->shard_num_samples * sizeof(int));
+    init_identity_permutation(loader->intra_shard_indices, (int)loader->shard_num_samples);
+    random_permutation(loader->intra_shard_indices, (int)loader->shard_num_samples, &loader->shuffle_rng);
+}
+
+void dataloader_reset(DataLoader *loader)
+{
+    loader->current_shard_idx = 0;
+    loader->current_sample_idx = 0;
+
+    if (loader->should_shuffle)
+    {
+        random_permutation(loader->shard_indices, (int)loader->glob_result.gl_pathc, &loader->shuffle_rng);
+    }
+
+    dataloader_load_shard_(loader, (int)loader->current_shard_idx);
+
+    if (loader->should_shuffle)
+    {
+        prepare_intra_shard_indices_(loader);
+    }
+}
+
+void dataloader_advance_(DataLoader *loader)
+{
+    if (loader->current_shard_idx == loader->glob_result.gl_pathc - 1)
+    {
+
+        dataloader_reset(loader);
+        return;
+    }
+
+    loader->current_shard_idx = (loader->current_shard_idx + 1) % loader->glob_result.gl_pathc;
+    loader->current_sample_idx = 0;
+    dataloader_load_shard_(loader, (int)loader->current_shard_idx);
+
+    if (loader->should_shuffle)
+    {
+        prepare_intra_shard_indices_(loader);
+    }
+}
+
+void dataloader_init(DataLoader *loader,
+                     const char *filename_pattern,
+                     size_t B,
+                     size_t T,
+                     int process_rank,
+                     int num_processes,
+                     int should_shuffle)
+{
+    loader->process_rank = process_rank;
+    loader->num_processes = num_processes;
+    loader->B = B;
+    loader->T = T;
+    loader->tokens_file = NULL;
+    loader->should_shuffle = should_shuffle;
+    loader->header_bytes = HEADER_SIZE * sizeof(int);
+    loader->total_batch_size_bytes = ((loader->num_processes * (loader->B * loader->T)) * sizeof(uint16_t));
+    loader->local_batch_offset_bytes = loader->process_rank * loader->B * loader->T * sizeof(uint16_t);
+
+    int glob_status = glob(filename_pattern, 0, NULL, &loader->glob_result);
+    if (glob_status != 0)
+    {
+        printf("Error: failed to glob pattern: %s\n", filename_pattern);
+        exit(EXIT_FAILURE);
+    }
+    if (loader->glob_result.gl_pathc == 0)
+    {
+        printf("Error: no files found matching the pattern: %s\n", filename_pattern);
+        exit(EXIT_FAILURE);
+    }
+
+    if (should_shuffle)
+    {
+        mt19937_state shuffle_rng;
+        manual_seed(&shuffle_rng, 42 + process_rank);
+        loader->shuffle_rng = shuffle_rng;
+        loader->shard_indices = (int *)mallocCheck(loader->glob_result.gl_pathc * sizeof(int));
+        init_identity_permutation(loader->shard_indices, (int)loader->glob_result.gl_pathc);
+        loader->intra_shard_indices = NULL;
+    }
+
+    int64_t ntok_total = 0;
+    for (int shard_index = 0; shard_index < loader->glob_result.gl_pathc; shard_index++)
+    {
+        int64_t shard_ntok = dataloader_load_shard_(loader, shard_index);
+
+        assert(shard_ntok >= (int64_t)(num_processes * B * T + 1));
+        ntok_total += shard_ntok;
+    }
+
+    loader->buffer = (uint16_t *)mallocCheck((B * T + 1) * sizeof(uint16_t));
+    loader->inputs = (int *)mallocCheck(B * T * sizeof(int));
+    loader->targets = (int *)mallocCheck(B * T * sizeof(int));
+    loader->num_tokens = ntok_total;
+
+    dataloader_reset(loader);
+}
+
+void dataloader_load_batch(DataLoader *loader)
+{
+    assert(!loader->should_shuffle || (loader->should_shuffle && loader->intra_shard_indices != NULL));
+    assert(loader->current_sample_idx < loader->shard_num_samples);
+    size_t idx = loader->should_shuffle ? loader->intra_shard_indices[loader->current_sample_idx] : loader->current_sample_idx;
+    size_t global_batch_offset_bytes = idx * loader->total_batch_size_bytes;
+    int64_t current_offset = loader->header_bytes + global_batch_offset_bytes + loader->local_batch_offset_bytes;
+
+    size_t B = loader->B;
+    size_t T = loader->T;
+
+    fseekCheck(loader->tokens_file, (int)current_offset, SEEK_SET);
+    freadCheck(loader->buffer, sizeof(uint16_t), B * T + 1, loader->tokens_file);
+
+    for (int i = 0; i < B * T; i++)
+    {
+        loader->inputs[i] = (int)loader->buffer[i];
+        loader->targets[i] = (int)loader->buffer[i + 1];
+    }
+}
+
+void dataloader_next_batch(DataLoader *loader)
+{
+
+    if (loader->current_sample_idx >= loader->shard_num_samples)
+    {
+        dataloader_advance_(loader);
+    }
+    dataloader_load_batch(loader);
+    loader->current_sample_idx += 1;
+}
+
+void dataloader_resume(DataLoader *loader, size_t current_shard_idx, size_t current_sample_idx)
+{
+
+    loader->current_shard_idx = current_shard_idx;
+    loader->current_sample_idx = current_sample_idx;
+    dataloader_load_shard_(loader, (int)loader->current_shard_idx);
+}
+
+void dataloader_free(DataLoader *loader)
+{
+    free(loader->buffer);
+    free(loader->inputs);
+    free(loader->targets);
+    if (loader->should_shuffle)
+    {
+        free(loader->shard_indices);
+        free(loader->intra_shard_indices);
+    }
+    fcloseCheck(loader->tokens_file);
+    globfree(&loader->glob_result);
+}
+
+#define ASSUMED_NUM_COMPLETIONS 4
+#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N))
+
+typedef struct
+{
+
+    int process_rank;
+    int num_processes;
+
+    size_t B;
+    size_t T;
+    FILE *eval_file;
+    uint16_t *buffer;
+    int num_examples;
+    int num_batches;
+    int start_example_index;
+    int end_example_index;
+    int current_example_index;
+    int *inputs;
+    int *targets;
+    char *mask;
+    int *label;
+    int num_completions;
+} EvalLoader;
+
+void evalloader_reset(EvalLoader *loader)
+{
+    int examples_per_process = CEIL_DIV(loader->num_examples, loader->num_processes);
+    int can_fit_examples = (int)(loader->B / ASSUMED_NUM_COMPLETIONS);
+    if (can_fit_examples == 0)
+    {
+
+        printf("HellaSwag EvalLoader: batch size %zu is < %d\n", loader->B, ASSUMED_NUM_COMPLETIONS);
+        printf("---> HINT: Disable HellaSwag eval with -h 0, or increase batch size with -b\n");
+        exit(EXIT_FAILURE);
+    }
+    loader->num_batches = CEIL_DIV(examples_per_process, can_fit_examples);
+
+    loader->start_example_index = examples_per_process * loader->process_rank;
+    loader->end_example_index = examples_per_process * (loader->process_rank + 1);
+
+    if (loader->end_example_index > loader->num_examples)
+    {
+        loader->end_example_index = loader->num_examples;
+    }
+
+    int64_t header_bytes = HEADER_SIZE * sizeof(int);
+    fseekCheck(loader->eval_file, (int)header_bytes, SEEK_SET);
+    for (int i = 0; i < loader->start_example_index; i++)
+    {
+        uint16_t example_header[3];
+        // read 3 uint16_t values: <START_EXAMPLE>, <EXAMPLE_BYTES>, <EXAMPLE_INDEX>
+        freadCheck(&example_header[0], sizeof(uint16_t), 3, loader->eval_file);
+        // validate the <START_EXAMPLE> delimiter
+        assert(example_header[0] == 65535); // <START_EXAMPLE> delimiter
+        // validate the <EXAMPLE_INDEX>
+        assert(example_header[2] == i); // <EXAMPLE_INDEX> should match the loop index
+        // skip to the next example, keeping in mind that we already read the header
+        size_t remaining_bytes = example_header[1] - sizeof(uint16_t) * 3;
+        assert(remaining_bytes > 0); // we expect some bytes in the example
+        fseekCheck(loader->eval_file, (int)remaining_bytes, SEEK_CUR);
+    }
+    // now we are at the start of the example we want to start at, pointing at <START_EXAMPLE>
+    loader->current_example_index = loader->start_example_index;
+}
+
+void evalloader_init(EvalLoader *loader,
+                     const char *filename,
+                     size_t B,
+                     size_t T,
+                     int process_rank,
+                     int num_processes)
+{
+    loader->process_rank = process_rank;
+    loader->num_processes = num_processes;
+    loader->B = B;
+    loader->T = T;
+
+    // open the file and validate the header
+    loader->eval_file = fopenCheck(filename, "rb");
+    // validate the header
+    int header[HEADER_SIZE];
+    freadCheck(header, sizeof(int), HEADER_SIZE, loader->eval_file);
+    if (header[0] != 20240522)
+    {
+        printf("Bad magic in eval file\n");
+        exit(EXIT_FAILURE);
+    }
+    if (header[1] != 1)
+    {
+        printf("Bad version in data file\n");
+        exit(EXIT_FAILURE);
+    }
+    loader->num_examples = header[2];              // number of examples in the file
+    assert(loader->num_examples >= num_processes); // avoid headaches for now
+    size_t longest_example_bytes = header[3];      // longest example in the file
+    // basic sensibility check we could relax later. but roughly each example
+    // contains the prompt (or "context") and 4 completions, all of these have to be
+    // up to T tokens, and their tokens are uint16_t (so 2 bytes/token).
+    // There's a few more things in each example but they are minor.
+    // So longest example should be roughly this. Just trying to make sure it's sensible.
+    assert(longest_example_bytes > 0 && longest_example_bytes < (1 + ASSUMED_NUM_COMPLETIONS) * T * 2);
+
+    // allocate all the space we'll need
+    int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS);
+    loader->buffer = (uint16_t *)mallocCheck(longest_example_bytes);
+    loader->inputs = (int *)calloc(B * T, sizeof(int));
+    loader->targets = (int *)calloc(B * T, sizeof(int));
+    loader->mask = (char *)mallocCheck(B * T * sizeof(char));
+    loader->label = (int *)mallocCheck(can_fit_examples * sizeof(int));
+
+    // reset the loader, to initialize it
+    evalloader_reset(loader);
+}
+
+void evalloader_next_example_(EvalLoader *loader, int example_batch_index)
+{
+    size_t B = loader->B;
+    size_t T = loader->T;
+    int batch_dim_offset = example_batch_index * ASSUMED_NUM_COMPLETIONS;
+    uint16_t example_header[3];
+    freadCheck(&example_header[0], sizeof(uint16_t), 3, loader->eval_file);
+    assert(example_header[0] == 65535);
+    assert(example_header[2] == loader->current_example_index);
+    assert(example_header[2] >= loader->start_example_index && example_header[2] < loader->end_example_index);
+
+    size_t example_bytes = example_header[1] - sizeof(uint16_t) * 3;
+    freadCheck(loader->buffer, sizeof(char), example_bytes, loader->eval_file);
+    int label = (int)loader->buffer[0];
+    int can_fit_examples = (int)(loader->B / ASSUMED_NUM_COMPLETIONS);
+    assert(label >= 0 && label < ASSUMED_NUM_COMPLETIONS);
+    assert(example_batch_index >= 0 && example_batch_index < can_fit_examples);
+    loader->label[example_batch_index] = label;
+    int num_completions = (int)loader->buffer[1];
+    assert(num_completions == ASSUMED_NUM_COMPLETIONS);
+    assert(batch_dim_offset + num_completions <= B);
+    loader->num_completions = num_completions;
+
+    int context_length = (int)loader->buffer[2];
+    uint16_t *context_tokens_start = &loader->buffer[3];
+    assert(context_length > 0 && context_length < T);
+    for (int b = 0; b < num_completions; b++)
+    {
+        for (int i = 0; i < context_length; i++)
+        {
+            int boff = batch_dim_offset + b;
+            int tok_cur = (int)context_tokens_start[i];
+            loader->inputs[boff * T + i] = tok_cur;
+        }
+    }
+    uint16_t *completions_iter = loader->buffer + 3 + context_length;
+    for (int c = 0; c < num_completions; c++)
+    {
+        int coff = batch_dim_offset + c;
+        int completion_length = (int)completions_iter[0];
+        uint16_t *completion_tokens_start = completions_iter + 1;
+        assert(completion_length > 0 && context_length + completion_length < T);
+        for (int i = 0; i < completion_length; i++)
+        {
+            int tok_cur = (int)completion_tokens_start[i];
+
+            loader->inputs[coff * T + context_length + i] = tok_cur;
+
+            loader->targets[coff * T + context_length + i - 1] = tok_cur;
+
+            loader->mask[coff * T + context_length + i - 1] = 1;
+        }
+        completions_iter += 1 + completion_length;
+        loader->current_example_index += 1;
+    }
+
+    void evalloader_next_batch(EvalLoader * loader)
+    {
+        size_t B = loader->B;
+        size_t T = loader->T;
+        memset(loader->mask, 0, B * T * sizeof(char));
+        int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS);
+        for (int i = 0; i < can_fit_examples; i++)
+        {
+            if (loader->current_example_index >= loader->end_example_index)
+            {
+                break;
+            }
+            evalloader_next_example_(loader, i);
+        }
+    }
+
+    int evalloader_stat_losses(EvalLoader * loader, float *losses)
+    {
+        int correct = 0;
+        size_t B = loader->B;
+        size_t T = loader->T;
+        int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS);
+        for (int i = 0; i < can_fit_examples; i++)
+        {
+            float min_loss = 0.0f;
+            int min_loss_index = -1;
+            char active = 0;
+            for (int b = 0; b < ASSUMED_NUM_COMPLETIONS; b++)
+            {
+                int boff = i * ASSUMED_NUM_COMPLETIONS + b;
+                float average_loss = 0.0f;
+                int count = 0;
+                for (int t = 0; t < T; t++)
+                {
+                    char mask = loader->mask[boff * T + t];
+                    if (mask == 1)
+                    {
+                        active = 1;
+                        average_loss += losses[boff * T + t];
+                        count++;
+                    }
+                }
+                if (count > 0)
+                {
+                    average_loss /= count;
+                }
+                if (b == 0 || average_loss < min_loss)
+                {
+                    min_loss = average_loss;
+                    min_loss_index = b;
+                }
+            }
+            if (active && (min_loss_index == loader->label[i]))
+            {
+                correct += 1;
+            }
+        }
+        return correct;
+    }
+
+    void evalloader_free(EvalLoader * loader)
+    {
+        free(loader->buffer);
+        free(loader->inputs);
+        free(loader->targets);
+        free(loader->mask);
+        free(loader->label);
+        fcloseCheck(loader->eval_file);
+    }
+
+#endif
\ No newline at end of file

From 58ab6040f8ecb8d7ba7111971ec88a7bec16379b Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 17:47:09 +0530
Subject: [PATCH 38/45] feat(multi-gpu): add foundational utilities for ZeRO
 sharding

---
 CUDA/llmcpp/zero.cuh | 597 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 597 insertions(+)
 create mode 100644 CUDA/llmcpp/zero.cuh

diff --git a/CUDA/llmcpp/zero.cuh b/CUDA/llmcpp/zero.cuh
new file mode 100644
index 0000000..e6c5b6e
--- /dev/null
+++ b/CUDA/llmcpp/zero.cuh
@@ -0,0 +1,597 @@
+/*
+Utilities for ZeRO sharding
+*/
+
+#ifndef LLMC_ZERO_CUH
+#define LLMC_ZERO_CUH
+
+#include <cuda_runtime_api.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stddef.h>
+
+#ifdef MULTI_GPU
+#include <nccl.h>
+#ifdef USE_MPI
+#include <mpi.h>
+#endif
+#endif
+
+// defines: fcloseCheck, fwriteCheck, scloseCheck, sclosesocketCheck
+#include "utils.h"
+
+// ----------------------------------------------------------------------------
+// Multi-GPU related
+#ifdef MULTI_GPU
+
+#if defined(ENABLE_FP32)
+const ncclDataType_t ncclFloatX = ncclFloat;
+#elif defined(ENABLE_FP16)
+const ncclDataType_t ncclFloatX = ncclHalf;
+#else // Default to bfloat16
+const ncclDataType_t ncclFloatX = ncclBfloat16;
+#endif
+
+void nccl_check(ncclResult_t status, const char *file, int line) {
+    if (status != ncclSuccess) {
+        printf("[NCCL ERROR] at file %s:%d:\n%s\n", file, line, ncclGetErrorString(status));
+        exit(EXIT_FAILURE);
+    }
+}
+#define ncclCheck(err) (nccl_check(err, __FILE__, __LINE__))
+
+#ifdef USE_MPI
+void mpi_check(int status, const char *file, int line) {
+    if (status != MPI_SUCCESS) {
+        char mpi_error[4096];
+        int mpi_error_len = 0;
+        assert(MPI_Error_string(status, &mpi_error[0], &mpi_error_len) == MPI_SUCCESS);
+        printf("[MPI ERROR] at file %s:%d:\n%.*s\n", file, line, mpi_error_len, mpi_error);
+        exit(EXIT_FAILURE);
+    }
+}
+#define mpiCheck(err) (mpi_check(err, __FILE__, __LINE__))
+#endif
+
+#endif // MULTI_GPU
+
+// ----------------------------------------------------------------------------
+// Parameters specific to training on multiple GPUs.
+typedef struct {
+    int process_rank;      // Rank of this process among all processes. 0 if no multi-GPU.
+    int num_processes;     // Total number of processes. 1 if no multi-GPU.
+    int local_device_idx;  // This process GPU index on current machine. 0 if no multi-GPU.
+
+    // Zero Redundancy Optimizer stage - https://fairscale.readthedocs.io/en/stable/deep_dive/oss_sdp_fsdp.html
+    // 0-Disabled
+    // 1-Optimizer State Sharding (OSS)
+    // 2-Optimizer + Gradient State Sharding (SDP)
+    // 3-Optimizer + Gradient + Horizontal Model Sharding (FSDP)
+    int zero_stage;
+    size_t shard_num_parameters;
+#ifdef MULTI_GPU
+    ncclComm_t nccl_comm;       // NCCL communication primitive, used for collective multi-GPU work.
+    cudaStream_t nccl_stream;   // CUDA Stream to perform NCCL operations.
+    cudaEvent_t compute_nccl_sync; // Event used to synchronize NCCL with the compute
+    float* unified_buffer;
+#endif
+} MultiGpuConfig;
+
+// one global variable to hold the multi-GPU configuration for this process
+// inline, so we can include this header multiple times without getting multiple definitions
+inline MultiGpuConfig multi_gpu_config;
+
+#ifdef MULTI_GPU
+
+#ifdef _WIN32
+void send_nccl_id_to_clients_windows(ncclUniqueId *nccl_id, SOCKET client_sockets[], int num_clients) {
+    for (int i = 0; i < num_clients; ++i) {
+        if (send(client_sockets[i], (const char *)nccl_id, sizeof(*nccl_id), 0) == SOCKET_ERROR) {
+            printf("Failed to send nccl_id");
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+        closesocketCheck(client_sockets[i]);
+    }
+}
+#else
+void send_nccl_id_to_clients(ncclUniqueId *nccl_id, int client_sockets[], int num_clients) {
+    for (int i = 0; i < num_clients; ++i) {
+        if (send(client_sockets[i], nccl_id, sizeof(*nccl_id), 0) == -1) {
+            printf("Failed to send nccl_id");
+            exit(EXIT_FAILURE);
+        }
+        scloseCheck(client_sockets[i]);
+    }
+}
+#endif
+
+#ifdef _WIN32
+// Same as get_nccl_id_via_tcp but for Windows
+ncclUniqueId get_nccl_id_via_tcp_windows(MultiGpuConfig* result, const char* server_ip) {
+    ncclUniqueId nccl_id;
+
+    int SERVER_PORT = 12345;  // hardcoded an arbitrary port number between 1024 and 49151 (registered ports)
+    WSADATA wsaData;
+    if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0) {
+        printf("WSAStartup failed");
+        exit(EXIT_FAILURE);
+    }
+
+    if (result->process_rank == 0) {
+        ncclCheck(ncclGetUniqueId(&nccl_id));
+
+        int MAX_CLIENTS = result->num_processes - 1;
+        SOCKET client_sockets[MAX_CLIENTS];
+        int num_clients = 0;
+        SOCKET server_socket, new_socket;
+        struct sockaddr_in address;
+        int addrlen = sizeof(address);
+
+        // Step 1) create a server TCP socket
+        if ((server_socket = socket(AF_INET, SOCK_STREAM, 0)) == INVALID_SOCKET) {
+            printf("Socket failed");
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 2) set the server address and port
+        address.sin_family = AF_INET;  // IPv4
+        address.sin_addr.s_addr = inet_addr(server_ip);
+        address.sin_port = htons(SERVER_PORT);
+
+        // Step 3) bind the socket to the address and port
+        if (bind(server_socket, (struct sockaddr *)&address, sizeof(address)) == SOCKET_ERROR) {
+            printf("Bind failed");
+            closesocketCheck(server_socket);
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 4) MAX_CLIENTS specifies the maximum number of clients that can be queued for this server
+        if (listen(server_socket, MAX_CLIENTS) == SOCKET_ERROR) {
+            printf("Listen failed");
+            closesocketCheck(server_socket);
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 5) accept connections from clients
+        printf("Waiting for clients to connect...\n");
+        while (num_clients < MAX_CLIENTS) {
+            if ((new_socket = accept(server_socket, (struct sockaddr *)&address, &addrlen)) == INVALID_SOCKET) {
+                printf("Accept failed");
+                closesocketCheck(server_socket);
+                WSACleanup();
+                exit(EXIT_FAILURE);
+            }
+            client_sockets[num_clients++] = new_socket;
+            printf("Client %d connected\n", num_clients);
+        }
+
+        // Step 6) send the NCCL ID to all clients
+        send_nccl_id_to_clients_windows(&nccl_id, client_sockets, num_clients);
+        printf("NCCL ID sent to all clients\n");
+
+        closesocketCheck(server_socket);
+    } else {
+        int num_connection_attempts = 5;
+        int time_to_sleep = 2;
+        SOCKET client_socket;
+        struct sockaddr_in serv_addr;
+
+        // Step 1) create a client TCP socket
+        if ((client_socket = socket(AF_INET, SOCK_STREAM, 0)) == INVALID_SOCKET) {
+            printf("Socket creation error");
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 2) set the server address and port
+        serv_addr.sin_family = AF_INET;
+        serv_addr.sin_port = htons(SERVER_PORT);
+        if (inet_pton(AF_INET, server_ip, &serv_addr.sin_addr) <= 0) {
+            printf("Invalid address or address not supported");
+            closesocketCheck(client_socket);
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 3) Try to connect to the server - retry up to `num_connection_attempts` times if the connection fails
+        while (connect(client_socket, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) == SOCKET_ERROR) {
+            printf("%d Connection failed, retrying in %d seconds\n", result->process_rank, time_to_sleep);
+            if (--num_connection_attempts == 0) {
+                printf("Failed to connect to the server\n");
+                closesocketCheck(client_socket);
+                WSACleanup();
+                exit(EXIT_FAILURE);
+            }
+            Sleep(time_to_sleep * 1000);
+        }
+
+        // Step 4) receive the NCCL ID from the server
+        if (recv(client_socket, (char *)&nccl_id, sizeof(nccl_id), 0) <= 0) {
+            printf("Failed to receive nccl_id");
+            closesocketCheck(client_socket);
+            WSACleanup();
+            exit(EXIT_FAILURE);
+        }
+
+        printf("Received NCCL ID\n");
+        closesocketCheck(client_socket);
+    }
+
+    WSACleanup();
+    return nccl_id;
+}
+#else
+ncclUniqueId get_nccl_id_via_tcp(MultiGpuConfig* result, const char* server_ip) {
+    ncclUniqueId nccl_id;
+
+    int SERVER_PORT = 12345;  // hardcoded an arbitrary port number between 1024 and 49151 (registered ports)
+    if (result->process_rank == 0) {
+        ncclCheck(ncclGetUniqueId(&nccl_id));
+
+        int MAX_CLIENTS = result->num_processes - 1;
+        int client_sockets[MAX_CLIENTS];
+        int num_clients = 0;
+        int server_socket, new_socket;
+        struct sockaddr_in address;
+        int addrlen = sizeof(address);
+        int opt = 1;
+
+        // Step 1) create a server TCP socket
+        if ((server_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
+            printf("Socket failed");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 2) set socket options
+        // SOL_SOCKET - means that option is configured at socket level
+        // SO_REUSEADDR - allows to bind to an address which is in a TIME_WAIT state (already used by another socket) - useful when restarting the server
+        // SO_REUSEPORT - allows to bind to the same port multiple times
+        if (setsockopt(server_socket, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT, &opt, sizeof(opt)) < 0) {
+            printf("Setsockopt failed");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 3) set the server address and port
+        address.sin_family = AF_INET;  // IPv4
+        address.sin_addr.s_addr = inet_addr(server_ip); // alternatively use INADDR_ANY to bind to all interfaces, currently we only allow ethernet
+        address.sin_port = htons(SERVER_PORT);
+
+        // Step 4) bind the socket to the address and port
+        if (bind(server_socket, (struct sockaddr *)&address, sizeof(address)) < 0) {
+            printf("Bind failed");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 5) MAX_CLIENTS specifies the maximum number of clients that can be queued for this server
+        if (listen(server_socket, MAX_CLIENTS) < 0) {
+            printf("Listen failed");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 6) accept connections from clients
+        printf("Waiting for clients to connect...\n");
+        while (num_clients < MAX_CLIENTS) {
+            if ((new_socket = accept(server_socket, (struct sockaddr *)&address, (socklen_t*)&addrlen)) < 0) {
+                printf("Accept failed");
+                exit(EXIT_FAILURE);
+            }
+            client_sockets[num_clients++] = new_socket;
+            printf("Client %d connected\n", num_clients);
+        }
+
+        // Step 7) send the NCCL ID to all clients
+        send_nccl_id_to_clients(&nccl_id, client_sockets, num_clients);
+        printf("NCCL ID sent to all clients\n");
+
+        scloseCheck(server_socket);
+    } else {
+        int num_connection_attempts = 5;
+        int time_to_sleep = 2;
+        int client_socket;
+        struct sockaddr_in serv_addr;
+
+        // Step 1) create a client TCP socket
+        if ((client_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
+            printf("Socket creation error");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 2) set the server address and port
+        serv_addr.sin_family = AF_INET;
+        serv_addr.sin_port = htons(SERVER_PORT);
+        if (inet_pton(AF_INET, server_ip, &serv_addr.sin_addr) <= 0) {
+            printf("Invalid address or address not supported");
+            exit(EXIT_FAILURE);
+        }
+
+        // Step 3) Try to connect to the server - retry up to `num_connection_attempts` times if the connection fails
+        while (connect(client_socket, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
+            printf("%d Connection failed, retrying in %d seconds\n", result->process_rank, time_to_sleep);
+            if (--num_connection_attempts == 0) {
+                printf("Failed to connect to the server\n");
+                exit(EXIT_FAILURE);
+            }
+            sleep(time_to_sleep);
+        }
+
+        // Step 4) receive the NCCL ID from the server
+        if (recv(client_socket, &nccl_id, sizeof(nccl_id), 0) <= 0) {
+            printf("Failed to receive nccl_id");
+            exit(EXIT_FAILURE);
+        }
+
+        printf("Received NCCL ID\n");
+        scloseCheck(client_socket);
+    }
+
+    return nccl_id;
+}
+#endif
+
+ncclUniqueId get_nccl_id_via_fs(MultiGpuConfig* result, char* fs_path) {
+    // Works assuming that the filesystem is shared among all processes
+    ncclUniqueId nccl_id;
+    FILE* idFile;
+    static char filename[1024];
+    snprintf(filename, sizeof(filename), "%s/ncclUniqueId.sync", fs_path);
+
+    if (result->process_rank != 0) {  // client processse should wait for the server to write to the file
+        // This is a naive and not 100% robust way to synchronize the processes but it should work almost always
+        sleep(2);
+    }
+
+    if (result->process_rank == 0) {
+        ncclCheck(ncclGetUniqueId(&nccl_id));
+        idFile = fopen(filename, "wb");
+        assert(idFile != NULL);
+        fwriteCheck(&nccl_id, sizeof(nccl_id), 1, idFile);
+        fcloseCheck(idFile);
+    } else {
+        // Other ranks wait until the file is available and read the unique ID
+        do {
+            sleep(1);  // 1 second
+            idFile = fopen(filename, "rb");
+            if (idFile != NULL) break;
+        } while (idFile == NULL);
+        freadCheck(&nccl_id, sizeof(nccl_id), 1, idFile);
+        fcloseCheck(idFile);
+    }
+
+    return nccl_id;
+}
+
+#ifdef USE_MPI
+// Determine which GPU this process should use.
+// Processes on the same machines use different GPU indicies. Processes on other machines don't.
+// Copied from NCCL examples: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-2-one-device-per-process-or-thread
+int multi_gpu_get_local_device_idx(int process_rank, int num_processes) {
+    char hostname[1024];
+    hostname[1023] = '\0';
+    // All processes on the same machine will share the same hostname.
+    gethostname(hostname, 1023);
+    for (int i=0; i < 1024; i++) {
+        if (hostname[i] == '.') {
+            hostname[i] = '\0';
+            break;
+        }
+    }
+    uint64_t hostname_hash = 5381u;
+    for (int c = 0; hostname[c] != '\0'; c++){ hostname_hash = ((hostname_hash << 5u) + hostname_hash) ^ hostname[c]; }
+
+    // Distribute all hostname hashes to all processes.
+    uint64_t* all_hostsname_hashes = (uint64_t*)malloc(num_processes * sizeof(uint64_t));
+    all_hostsname_hashes[process_rank] = hostname_hash;
+    mpiCheck(MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, all_hostsname_hashes, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD));
+
+    // Identify which GPU we need to use.
+    int local_device_idx = 0;
+    for (int current_process = 0; current_process < num_processes; ++current_process) {
+        if (current_process == process_rank) {
+        // Found my gpu, local_device_idx now has my target GPU index.
+        break;
+        }
+        if (all_hostsname_hashes[current_process] == all_hostsname_hashes[process_rank]) {
+        // This process ID runs on the same machine, but it's not me, skip this GPU
+        local_device_idx++;
+        }
+    }
+
+    free(all_hostsname_hashes);
+    return local_device_idx;
+}
+#endif
+
+#endif
+
+MultiGpuConfig multi_gpu_config_init(int num_processes, int process_rank, int gpus_per_node, char* server_ip, char* fs_path, char* init_method) {
+#ifdef MULTI_GPU
+    MultiGpuConfig result;
+    ncclUniqueId nccl_id;
+    // Get nccl_id using MPI, TCP, or FS (file system synchronization) methods
+    // On newer slurm versions (slurm-wlm package) PMIx is disabled so we can not use MPI for NCCL init in multi node setup
+    if (strcmp(init_method, "mpi") == 0) {
+        #ifdef USE_MPI
+        mpiCheck(MPI_Init(NULL, NULL));
+        mpiCheck(MPI_Comm_rank(MPI_COMM_WORLD, &result.process_rank));
+        mpiCheck(MPI_Comm_size(MPI_COMM_WORLD, &result.num_processes));
+        result.local_device_idx = multi_gpu_get_local_device_idx(result.process_rank, result.num_processes);
+        if (result.process_rank == 0) {
+            ncclCheck(ncclGetUniqueId(&nccl_id));
+        }
+        mpiCheck(MPI_Bcast(&nccl_id, sizeof(nccl_id), MPI_BYTE, 0, MPI_COMM_WORLD));
+        #else
+        printf("MPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method.\n");
+        exit(EXIT_FAILURE);
+        #endif
+    } else {
+        result.process_rank = process_rank;
+        result.num_processes = num_processes;
+        result.local_device_idx = process_rank % gpus_per_node;
+        if (strcmp(init_method, "tcp") == 0) {
+            #ifdef _WIN32
+            nccl_id = get_nccl_id_via_tcp_windows(&result, server_ip);
+            #else
+            nccl_id = get_nccl_id_via_tcp(&result, server_ip);
+            #endif
+        } else if (strcmp(init_method, "fs") == 0) {
+            nccl_id = get_nccl_id_via_fs(&result, fs_path);
+        } else {
+            printf("Invalid NCCL-init method\n");
+            exit(EXIT_FAILURE);
+        }
+    }
+    cudaCheck(cudaSetDevice(result.local_device_idx));
+    ncclCheck(ncclCommInitRank(&result.nccl_comm, result.num_processes, nccl_id, result.process_rank));
+    cudaCheck(cudaStreamCreate(&result.nccl_stream));
+    // event without timing for maximum performance
+    cudaCheck(cudaEventCreate(&result.compute_nccl_sync, cudaEventDisableTiming));
+    nvtxNameCudaStreamA(result.nccl_stream, "nccl stream");
+    nvtxNameCudaEventA(result.compute_nccl_sync, "nccl compute sync");
+    cudaCheck(cudaMallocManaged(&result.unified_buffer, sizeof(float)));
+    return result;
+#else
+    printf("Multi-GPU support is disabled. Using a single GPU.\n");
+    cudaCheck(cudaSetDevice(0));
+    MultiGpuConfig result;
+    result.process_rank = 0;
+    result.num_processes = 1;
+    result.local_device_idx = 0;
+    return result;
+#endif
+}
+
+void multi_gpu_config_free(MultiGpuConfig* config) {
+#ifdef MULTI_GPU
+    ncclCheck(ncclCommDestroy(config->nccl_comm));
+    cudaCheck(cudaStreamDestroy(config->nccl_stream));
+    cudaCheck(cudaEventDestroy(config->compute_nccl_sync));
+    cudaCheck(cudaFree(config->unified_buffer));
+    #ifdef USE_MPI
+    mpiCheck(MPI_Finalize());
+    #endif
+#endif
+}
+
+void multi_gpu_barrier(const MultiGpuConfig* config) {
+#ifdef MULTI_GPU
+    if (config->num_processes > 1) {
+        ncclCheck(ncclAllReduce(config->unified_buffer, config->unified_buffer, sizeof(float), ncclFloat, ncclSum, config->nccl_comm, config->nccl_stream));
+    }
+    cudaCheck(cudaDeviceSynchronize());
+#endif
+}
+
+// Offset and size of a tensor shard
+typedef struct {
+    ptrdiff_t offset;
+    size_t size;
+} ShardInfo;
+
+// Get info about sharding for a tensor of elements many numbers
+ShardInfo multi_gpu_get_shard_offset(size_t elements, const MultiGpuConfig* config, int shard_at_stage) {
+    const int nproc = config->num_processes;
+    if(config->zero_stage >= shard_at_stage) {
+        if (elements % nproc != 0) {
+            fprintf(stderr, "Number of elements %zu must be a multiple of the number of processes %d\n", elements, nproc);
+            exit(EXIT_FAILURE);
+        }
+        return {(ptrdiff_t) (config->process_rank * (elements / nproc)), elements / nproc};
+    } else {
+        return {0, elements};
+    }
+}
+
+// Block NCCL stream until computations on compute_stream are done, then aggregate multiple pointers in an NCCL group.
+// This can work either as an all-reduce (i.e., no ZeRo), or a reduce-scatter (ZeRO 1).
+// The awkward `(&pointers)[N]` syntax ensures we are capturing the parameters as sized arrays, so that it becomes impossible
+// to call this function if pointers and pointers_sizes do not match.
+template<int N>
+void multi_gpu_async_reduce_gradient(
+        floatX* const (&pointers)[N], const size_t (&pointers_sizes)[N],
+        MultiGpuConfig* config, cudaStream_t compute_stream) {
+    if (config->num_processes == 1) {
+        return; // no multi-GPU, just exit.
+    }
+
+#ifdef MULTI_GPU
+    NVTX_RANGE_FN();
+    // mark an event on the compute stream, and immediately wait on this in the nccl stream
+    // this means that the nccl stream won't start executing before all compute kernels that
+    // have been submitted before this point have finished.
+    // by using an event instead of cudaSyncStream, we avoid having to synchronize the host, and
+    // can enqueue new work to the GPU right away.
+    cudaCheck(cudaEventRecord(config->compute_nccl_sync, compute_stream));
+    cudaCheck(cudaStreamWaitEvent(config->nccl_stream, config->compute_nccl_sync));
+    ncclCheck(ncclGroupStart()); // NCCL group: aggregate all pointers in a single NCCL GPU kernel.
+    for (int i = 0; i < N; ++i) {
+        if(config->zero_stage == 0) {
+            ncclCheck(ncclAllReduce(
+                    pointers[i], pointers[i],
+                    pointers_sizes[i],
+                    ncclFloatX, ncclAvg,
+                    config->nccl_comm, config->nccl_stream
+            ));
+        } else if(config->zero_stage == 1) {
+            assert(pointers_sizes[i] % config->num_processes == 0);
+            size_t shard_size = pointers_sizes[i] / config->num_processes;
+            ptrdiff_t shard_offset = (ptrdiff_t)shard_size * config->process_rank;
+            ncclCheck(ncclReduceScatter(
+                    pointers[i], pointers[i] + shard_offset,
+                    shard_size,
+                    ncclFloatX, ncclAvg,
+                    config->nccl_comm, config->nccl_stream
+            ));
+        }
+    }
+    ncclCheck(ncclGroupEnd());
+#endif
+}
+
+// convenience macro that only prints if the rank of process is zero
+#define printf0(...) if (::multi_gpu_config.process_rank == 0) { printf(__VA_ARGS__); }
+
+void set_zero_configs(MultiGpuConfig* config, int zero_stage, size_t total_parameters) {
+    config->zero_stage = 0;
+    config->shard_num_parameters = total_parameters;
+    // Check the Zero Stage and define sharding parameters
+    if (zero_stage == 0) {
+        printf0("| Zero Optimization is disabled                                              |\n");
+    }
+    else if (zero_stage == 1) {
+        if (total_parameters % config->num_processes != 0) {
+            printf0("| Zero Optimization is disabled, Can't equally partition parameters          |\n");
+            config->zero_stage = 0;
+        }
+        else {
+            config->zero_stage = 1;
+            config->shard_num_parameters = total_parameters / config->num_processes;
+        }
+    }
+    else{
+        printf0("| Disabling Zero Optimization, Zero Stage2 and Stage3 are not yet supported  |\n");
+        config->zero_stage = 0;
+    }
+}
+
+// Compute sum of a single CPU value across all GPU processes. No-op when multi-GPU is disabled.
+float multi_gpu_cpu_float_sum(float value, MultiGpuConfig* config) {
+#ifdef MULTI_GPU
+    if (config->num_processes == 1) return value;
+
+    float* unified_buffer = config->unified_buffer;
+    *unified_buffer = value;
+    ncclCheck(ncclAllReduce(unified_buffer, unified_buffer, sizeof(float), ncclFloat, ncclSum, config->nccl_comm, config->nccl_stream));
+    cudaCheck(cudaDeviceSynchronize());
+    return *unified_buffer;
+#else
+    return value;
+#endif
+}
+
+#endif
+

From b91f867b826f8b74772d7f7c236a544f84e0c083 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 17:49:34 +0530
Subject: [PATCH 39/45] feat(utils): add  I/O and memory error-checking
 wrappers

---
 CUDA/llmcpp/layernorm.cuh | 505 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 505 insertions(+)
 create mode 100644 CUDA/llmcpp/layernorm.cuh

diff --git a/CUDA/llmcpp/layernorm.cuh b/CUDA/llmcpp/layernorm.cuh
new file mode 100644
index 0000000..9777d06
--- /dev/null
+++ b/CUDA/llmcpp/layernorm.cuh
@@ -0,0 +1,505 @@
+/*
+LayerNorm CUDA kernel, and also Residual, because sometimes they are fused
+
+Note in llm.c we try to be clever in the backward pass to conserve memory.
+All parameters use a += in the backward pass, so we can do gradient accumulation.
+But all activations have = instead of += because these are faster (just read, no write).
+This is okay for all activations except for those in the residual stream, where the
+gradients have to add. We make sure that we do a += as necessary.
+E.g., the layernorms are connected to the residuals so we += in layernorm backward.
+*/
+
+#include <assert.h>
+// llmc internal imports
+#include "cuda_common.h"
+#include "cuda_utils.cuh"
+
+// ----------------------------------------------------------------------------
+// CUDA kernels
+
+__global__ void layernorm_forward_kernel3(floatX* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
+                                    const floatX*  __restrict__ inp, const floatX*  __restrict__ weight,
+                                    const floatX* __restrict__ bias, int N, int C) {
+    int lane_id = threadIdx.x % WARP_SIZE;
+    int warp_id = threadIdx.x / WARP_SIZE;
+    int num_warps = blockDim.x / WARP_SIZE;
+
+    int idx = blockIdx.x * num_warps + warp_id;
+    if(idx >= N) { return; } // guard
+
+    // the row of input that this group of threads is responsible for
+    const floatX* x = inp + idx * C;
+
+    // mean
+    float sum = 0.0f;
+    for (int i = lane_id; i < C; i += WARP_SIZE) {
+        sum += (float)x[i];
+    }
+    sum = warpReduceSum(sum);
+    float m = sum / C;
+    if(lane_id == 0 && mean != nullptr) {
+        __stcs(mean + idx, m);
+    }
+
+    // rstd
+    sum = 0.0f;
+    for (int i = lane_id; i < C; i += WARP_SIZE) {
+        float diff = (float)x[i] - m;
+        sum += diff * diff;
+    }
+    sum = warpReduceSum(sum);
+    float s = rsqrtf(sum / C + 1e-5f);
+    if(lane_id == 0 && rstd != nullptr) {
+        __stcs(rstd + idx, s);
+    }
+
+    // final normalization and scaling by weight/bias
+    floatX* o = out + idx * C;
+    for (int c = lane_id; c < C; c += WARP_SIZE) {
+        // load and store using the .cs "streaming" hint to the compiler,
+        // indicating that this data will not be reused soon, and can be streamed through the caches
+        // this allows the threads to get more cache-hits for the (shared) weight and bias parameters
+        float n = s * ((float)__ldcs(x+c) - m);
+        __stcs(o+c, (floatX)(n * (float)weight[c] + (float)bias[c]));
+    }
+}
+
+__global__ void layernorm_forward_kernel6(floatX* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd,
+                                    const floatX*  __restrict__ inp, const floatX*  __restrict__ weight,
+                                    const floatX* __restrict__ bias, int N, int C) {
+    assert(blockDim.x == WARP_SIZE);
+
+    // load weights and biases into shared memory
+    // do this before we allow any threads to exit!
+    extern __shared__ char* params[];
+    // load128/store128 sometimes generated multiple instructions when the types here were floatX*, so
+    // let's keep everything as x128
+    x128* s_weight = reinterpret_cast<x128*>(params);
+    x128* s_bias = reinterpret_cast<x128*>(params) + (C / x128::size);
+    x128* s_in = reinterpret_cast<x128*>(params) + ((2 + threadIdx.y) * C / x128::size);
+
+    int sidx = (threadIdx.x + WARP_SIZE * threadIdx.y) * x128::size;
+    for(int i = sidx; i < C; i += blockDim.y * WARP_SIZE * x128::size) {
+        s_weight[i/x128::size] = load128(weight + i);
+        s_bias[i/x128::size] = load128(bias + i);
+    }
+    __syncthreads();
+
+    int idx = blockIdx.x * blockDim.y + threadIdx.y;
+    if(idx >= N) { return; } // guard
+
+    // adjust pointers to current token
+    inp += idx * C;
+    out += idx * C;
+
+    const float eps = 1e-5f;
+    float sum = 0.0f;
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 in_data = load128cs(inp + c);
+        for(int k = 0; k < x128::size; ++k) {
+            sum += (float)in_data[k];
+        }
+        s_in[c / x128::size] = in_data;
+    }
+
+    sum = warpReduceSum(sum);
+    float m = sum / C;
+    float v = 0.f;
+
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 in_data = s_in[c / x128::size];
+        for(int k = 0; k < x128::size; ++k) {
+            v += ((float)in_data[k] - m) * ((float)in_data[k] - m);
+        }
+    }
+
+    v = warpReduceSum(v) / C;
+    float s = rsqrtf(v + eps);
+
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 in_data = s_in[c / x128::size];
+        const x128 w = s_weight[c / x128::size];
+        const x128 b = s_bias[c / x128::size];
+        x128 out_data;
+        for(int k = 0; k < x128::size; ++k) {
+            float n = s * ((float)in_data[k] - m); // normalized output
+            float o = n * (float)w[k] + (float)b[k]; // scale and shift it
+            out_data[k] = (floatX)o;
+        }
+
+        store128cs(out + c, out_data);
+    }
+    // cache the mean and rstd for the backward pass later
+    if(threadIdx.x == 0 && mean != nullptr) {
+        __stcs(mean + idx, m);
+    }
+    // store the rstd, no need to cache it
+    if(threadIdx.x == 0 && rstd != nullptr) {
+        __stcs(rstd + idx, s);
+    }
+}
+
+__global__ void fused_residual_forward_kernel5(floatX* residual, floatX* normed, float* mean, float* rstd,
+                                               const floatX* inp1, const floatX* inp2,
+                                               const floatX* weight, const floatX* bias,
+                                               int N, int C) {
+    assert(blockDim.x == WARP_SIZE);
+
+    // load weights and biases into shared memory
+    // do this before we allow any threads to exit!
+    extern __shared__ char* params[];
+    // load128/store128 sometimes generated multiple instructions when the types here were floatX*, so
+    // let's keep everything as x128
+    x128* s_weight = reinterpret_cast<x128*>(params);
+    x128* s_bias = reinterpret_cast<x128*>(params) + (C / x128::size);
+    x128* s_res = reinterpret_cast<x128*>(params) + ((2 + threadIdx.y) * C / x128::size);
+
+    int sidx = (threadIdx.x + WARP_SIZE * threadIdx.y) * x128::size;
+    for(int i = sidx; i < C; i += blockDim.y * WARP_SIZE * x128::size) {
+        s_weight[i/x128::size] = load128(weight + i);
+        s_bias[i/x128::size] = load128(bias + i);
+    }
+    __syncthreads();
+
+    int idx = blockIdx.x * blockDim.y + threadIdx.y;
+    if(idx > N) return;
+
+    // adjust pointers to current token
+    residual += C * idx;
+    normed += C * idx;
+    inp1 += C * idx;
+    inp2 += C * idx;
+
+    const float eps = 1e-5f;
+    float sum = 0.0f;
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 in1 = load128cs(inp1 + c);
+        const x128 in2 = load128cs(inp2 + c);
+        x128 out;
+        for(int k = 0; k < x128::size; ++k) {
+            out[k] = (float)in1[k] + (float)in2[k];
+            sum += (float)out[k];
+        }
+        store128cs(residual + c, out);
+        s_res[c / x128::size] = out;
+    }
+
+    sum = warpReduceSum(sum);
+    float m = sum / C;
+    float v = 0.f;
+
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 res = s_res[c / x128::size];
+        for(int k = 0; k < x128::size; ++k) {
+            v += ((float)res[k] - m) * ((float)res[k] - m);
+        }
+    }
+
+    v = warpReduceSum(v) / C;
+    float s = rsqrtf(v + eps);
+
+    for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) {
+        const x128 res = s_res[c / x128::size];
+        const x128 w = s_weight[c / x128::size];
+        const x128 b = s_bias[c / x128::size];
+        x128 out;
+        for(int k = 0; k < x128::size; ++k) {
+            float n = s * ((float)res[k] - m); // normalized output
+            float o = n * (float)w[k] + (float)b[k]; // scale and shift it
+            out[k] = o;
+        }
+
+        store128cs(normed + c, out);
+    }
+    // cache the mean and rstd for the backward pass later
+    if(threadIdx.x == 0) {
+        mean[idx] = m;
+        rstd[idx] = s;
+    }
+}
+
+__global__ void residual_forward_kernel(floatX* out, const floatX* inp1, const floatX* inp2) {
+    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * x128::size;
+
+    x128 packed_out;
+    x128 packed_inp1 = load128cs(inp1 + idx);
+    x128 packed_inp2 = load128cs(inp2 + idx);
+    for (int k = 0; k < packed_inp1.size; k++) {
+        packed_out[k] = (floatX)((float)packed_inp1[k] + (float)packed_inp2[k]);
+    }
+    store128(out + idx, packed_out);
+}
+
+__global__ void __launch_bounds__(512, 2) // todo - any warnings on Turing with only 1024 threads?
+    layernorm_backward_kernel10(floatX* dinp, floatX* dweight, floatX* dbias, float* scratch,
+                                const floatX* dout, const floatX* inp, const floatX* weight,
+                                const float* mean, const float* rstd,
+                                int B, int T, int C) {
+    int BLOCK_SIZE = blockDim.x;
+    int warpsInBlock = BLOCK_SIZE / WARP_SIZE; //number of warps in block
+    extern __shared__ float shared[];
+
+    int warpId = threadIdx.x / WARP_SIZE; // warp index within a block
+    int baseIdx = blockIdx.x * warpsInBlock + warpId;
+    int warpThreadIdx = threadIdx.x % WARP_SIZE; // Thread index within the warp
+    int warpsInGrid = gridDim.x * warpsInBlock;
+    int C_per_iteration = WARP_SIZE * x128::size;
+    int iterations_C = CEIL_DIV(C, C_per_iteration); // + 2;
+
+    // the first half of shared memory is bias, second is weight
+    size_t rounded_C = CEIL_DIV(C, (32 * x128::size)) * (32 * x128::size);
+    float* dbias_shared = shared;
+    float* dweight_shared = shared + rounded_C;
+    // warp zero doesn't actually write to the _tmp_shared memory locations, so we don't need to reserve memory
+    // the obvious solution is to change the addressing below to use (threadId.x-32) as offset, but that causes
+    // register spills, so instead we mess with the base pointer here, which doesn't increase register usage.
+    float* dbias_tmp_shared = shared + 2 * rounded_C - WARP_SIZE * f128::size;
+    float* dweight_tmp_shared = shared + 2 * rounded_C + f128::size * BLOCK_SIZE - 2 * WARP_SIZE * f128::size;
+
+    // init shared memory to zero
+    for(int i = threadIdx.x * f128::size; i < rounded_C; i += BLOCK_SIZE * f128::size) {
+        store128(dbias_shared + i, f128::zeros());
+        store128(dweight_shared + i, f128::zeros());
+    }
+    __syncthreads();
+
+    for (int bt = baseIdx; bt < B * T; bt += warpsInGrid) {
+        const floatX* dout_bt = dout + bt * C;
+        const floatX* inp_bt = inp +bt * C;
+        floatX* dinp_bt = dinp + bt * C;
+
+        // first: two reduce operations
+        float dnorm_mean = 0.0f;
+        float dnorm_norm_mean = 0.0f;
+        for (int i = warpThreadIdx * x128::size; i < C; i += WARP_SIZE * x128::size) {
+            x128 dout128_i   = load128(dout_bt + i);
+            x128 inp128_i    = load128(inp_bt  + i);
+            x128 weight128_i = load128(weight  + i);
+            for (int k = 0; k < x128::size; k++) {
+                float dnorm_i = (float)weight128_i[k] * (float)dout128_i[k];
+                dnorm_mean += dnorm_i;
+                dnorm_norm_mean += dnorm_i * (float)inp128_i[k];
+            }
+        }
+
+        const float mean_bt = mean[bt];
+        const float rstd_bt = rstd[bt];
+        dnorm_mean = warpReduceSum(dnorm_mean) / C;
+        dnorm_norm_mean = warpReduceSum(dnorm_norm_mean) / C * rstd_bt - dnorm_mean * mean_bt * rstd_bt;
+
+        for (int c = 0; c < iterations_C; c++) {
+            int global_index = (warpThreadIdx * x128::size) + (c * C_per_iteration);
+
+            x128 dout128   = x128::zeros();
+            x128 inp128    = x128::zeros();
+            x128 dinp128   = x128::zeros();
+            x128 weight128 = x128::zeros();
+
+            if(global_index < C) {
+                dout128 = load128cs(dout_bt + global_index);
+                inp128 = load128cs(inp_bt + global_index);
+                dinp128 = load128(dinp_bt + global_index);
+                weight128 = load128(weight + global_index);
+            }
+
+            for(int o = 0; o < x128::size / f128::size; ++o) {
+                f128 dbias_f;
+                f128 dweight_f;
+                for(int i = 0; i < f128::size; ++i) {
+                    int x = o * f128::size + i;
+                    float dout_i = (float)dout128[x];
+                    float norm_bti = ((float)inp128[x] - mean_bt) * rstd_bt;
+                    dbias_f[i] = dout_i;
+                    dweight_f[i] = norm_bti * dout_i;
+
+                    float dval = 0.0f;
+                    dval += (float) weight128[x] * (float)dout128[x]; // term 1
+                    dval -= dnorm_mean; // term 2
+                    dval -= norm_bti * dnorm_norm_mean; // term 3
+                    dval *= rstd_bt; // final scale
+                    dinp128[x] = (floatX) ((float) dinp128[x] + dval);
+                }
+
+                if (warpId != 0) {
+                    store128(dbias_tmp_shared + threadIdx.x * f128::size, dbias_f);
+                    // this seems to generate a 64-bit store, instead of 128-bit.
+                    // however, forcing 128-bit (e.g., using inline ptx), results in register
+                    // spilling and much worse performance, so we'll keep it like this for now
+                    // but ideally, we could reduce the register pressure a little.
+                    store128(dweight_tmp_shared + threadIdx.x * f128::size, dweight_f);
+                }
+                __syncthreads();
+                if (warpId == 0) {
+                    for (int j = 1; j < warpsInBlock; j++) {
+                        f128 dbias_tmp = load128(dbias_tmp_shared + f128::size * (threadIdx.x + j * WARP_SIZE));
+                        f128 dweight_tmp = load128(dweight_tmp_shared + f128::size * (threadIdx.x + j * WARP_SIZE));
+                        for(int i = 0; i < f128::size; ++i) {
+                            dbias_f[i] += dbias_tmp[i];
+                            dweight_f[i] += dweight_tmp[i];
+                        }
+                    }
+                }
+                __syncthreads();
+                if (warpId == 0) {
+                    f128 db_old = load128(dbias_shared + global_index + f128::size * o);
+                    f128 dw_old = load128(dweight_shared + global_index + f128::size * o);
+                    for(int i = 0; i < f128::size; ++i) {
+                        dbias_f[i] += db_old[i];
+                        dweight_f[i] += dw_old[i];
+                    }
+                    store128(dbias_shared + global_index + f128::size * o, dbias_f);
+                    store128(dweight_shared + global_index + f128::size * o, dweight_f);
+                }
+            }
+            if(global_index < C) {
+                // cache in L2 as this is read by the next kernel, but bypass L1 to minimise thrashing
+                store128cg(dinp_bt + global_index, dinp128);
+            }
+        }
+    }
+    __syncthreads();
+    // Each block writes its partial sum to global memory
+    // The last block to finish becomes responsible for summing up all the partial sums
+    // This is done by atomically incrementing a flag (cleared to 0 before launching the kernel)
+    unsigned int* scratchFlag = (unsigned int*)(scratch);
+    // Increment scratch pointer by a full cacheline so that everything remains cacheline aligned
+    scratch += 32;
+    float* scratch_dbias = scratch;
+    float* scratch_dweight = scratch + C;
+    for(int i = threadIdx.x * f128::size; i < C; i += BLOCK_SIZE * f128::size) {
+        // Write to global memory in the same "shared memory banking friendly" order
+        store128(scratch_dbias + i + 2*C*blockIdx.x, load128(dbias_shared + i));
+        store128(scratch_dweight + i + 2*C*blockIdx.x, load128(dweight_shared + i));
+    }
+    __syncthreads();
+    // that portion of shared memory is no longer used, so we can repurpose it for the scratch flag.
+    unsigned int *tmp_flag = (unsigned int*)(shared + 2*rounded_C);
+    if (threadIdx.x == 0) {
+        *tmp_flag = atomicInc(scratchFlag, gridDim.x);
+    }
+    __syncthreads();
+    if (*tmp_flag == gridDim.x-1) {
+        // Reduction of the partial sums by the final block
+        // todo - there isn't enough parallelism even inside that single SM...
+        // ==> so could maybe split into another kernel with YET ANOTHER level of reduction?!
+        for(int i = threadIdx.x * f128::size; i < C; i += BLOCK_SIZE * f128::size) {
+            f128 dbias_accum = f128::zeros();
+            f128 dweight_accum = f128::zeros();
+
+            for (int read_block_idx = 0; read_block_idx < gridDim.x; read_block_idx++) {
+                int offset = i + 2*C*read_block_idx;
+                f128 dbias128 = load128(scratch_dbias + offset);
+                f128 dweight128 = load128(scratch_dweight + offset);
+                for(int k = 0; k < f128::size; k++) {
+                    dbias_accum[k] += dbias128[k];
+                    dweight_accum[k] += dweight128[k];
+                }
+            }
+            store128(dbias_shared + i, dbias_accum);
+            store128(dweight_shared + i, dweight_accum);
+        }
+        __syncthreads();
+
+        // convert from float/FP32 to floatX/BF16 for the final write
+        // this is separate because it cannot use as many warps as the above (f128 vs x128)
+        // todo - if we split this code into another kernel, we could maybe do it at the same time?
+        for (int c = warpId; c < iterations_C; c += warpsInBlock) {
+            int global_index = (warpThreadIdx * x128::size) + (c * C_per_iteration);
+            if (global_index >= C) {
+                break;
+            }
+
+            x128 dbias128 = load128(dbias + global_index);
+            x128 dweight128 = load128(dweight + global_index);
+            for(int o = 0; o < x128::size / f128::size; ++o) {
+                f128 s_db = load128(dbias_shared + global_index + o * f128::size);
+                f128 s_dw = load128(dweight_shared + global_index + o * f128::size);
+                for(int i = 0; i < f128::size; ++i) {
+                    int x = o * f128::size + i;
+                    dbias128[x] = (floatX)(s_db[i] + (float)dbias128[x]);
+                    dweight128[x] = (floatX)(s_dw[i] + (float)dweight128[x]);
+                }
+            }
+            store128(dbias + global_index, dbias128);
+            store128(dweight + global_index, dweight128);
+        }
+    }
+}
+
+// ----------------------------------------------------------------------------
+// kernel launchers
+
+// similar to `fused_residual_forward5`
+void layernorm_forward(floatX* out, float* mean, float* rstd,
+                       floatX* inp, const floatX* weight, const floatX* bias,
+                       int B, int T, int C, cudaStream_t stream) {
+    NVTX_RANGE_FN();
+    const int block_size = 256;
+    int block_y = block_size / WARP_SIZE;
+    const int N = B * T;
+    const int grid_size = CEIL_DIV(N, block_y);
+    size_t smem = (2 + block_y) * C * sizeof(floatX);
+
+    // in order to use more than 48 KiB of smem, need to call cudaFuncSetAttribute
+    // this may fail, in which case we fall back to the smem free implementation.
+    cudaCheck(cudaGetLastError());
+    auto status = cudaFuncSetAttribute(layernorm_forward_kernel6, cudaFuncAttributeMaxDynamicSharedMemorySize, smem);
+    cudaCheck(cudaGetLastError());
+    if (status == cudaSuccess) {
+        layernorm_forward_kernel6<<<grid_size, dim3(WARP_SIZE, block_y), smem, stream>>>(out, mean, rstd, inp, weight, bias, N, C);
+    } else {
+        // fall back to the version without shared memory
+        const int grid_size_fb = CEIL_DIV(N * WARP_SIZE, block_size);
+        layernorm_forward_kernel3<<<grid_size_fb, block_size, 0, stream>>>(out, mean, rstd, inp, weight, bias, N, C);
+    }
+    cudaCheck(cudaGetLastError());
+}
+
+void residual_forward(floatX* out, const floatX* inp1, const floatX* inp2, int N, cudaStream_t stream) {
+    NVTX_RANGE_FN();
+    const int block_size = 256;
+    assert(N % (block_size * x128::size) == 0);
+    const int grid_size = CEIL_DIV(N, block_size * x128::size);
+    residual_forward_kernel<<<grid_size, block_size, 0, stream>>>(out, inp1, inp2);
+    cudaCheck(cudaGetLastError());
+}
+
+void fused_residual_forward5(floatX* residual, floatX* normed, float* mean, float* rstd,
+                             const floatX* inp1, const floatX* inp2,
+                             const floatX* weight, const floatX* bias,
+                             int N, int C, cudaStream_t stream) {
+    const int block_size = 256;
+    int block_y = block_size / WARP_SIZE;
+    const int grid_size = CEIL_DIV(N, block_y);
+    size_t smem = (2 + block_y) * C * sizeof(floatX);
+
+    // in order to use more than 48 KiB of smem, need to call cudaFuncSetAttribute
+    // this may fail, in which case we fall back to the smem free implementation.
+    cudaCheck(cudaGetLastError());
+    auto status = cudaFuncSetAttribute(fused_residual_forward_kernel5, cudaFuncAttributeMaxDynamicSharedMemorySize, smem);
+    cudaCheck(cudaGetLastError());
+    if(status == cudaSuccess) {
+        fused_residual_forward_kernel5<<<grid_size, dim3(WARP_SIZE, block_y), smem, stream>>>(residual, normed,
+                                                                                              mean, rstd, inp1, inp2,
+                                                                                              weight, bias, N, C);
+    } else {
+        residual_forward(residual, inp1, inp2, N*C, stream);
+        layernorm_forward(normed, mean, rstd, residual, weight, bias, N, 1, C, stream);
+    }
+    cudaCheck(cudaGetLastError());
+}
+
+void layernorm_backward(floatX* dinp, floatX* dweight, floatX* dbias, float* scratch,
+                        const floatX* dout, const floatX* inp, const floatX* weight, const float* mean, const float* rstd,
+                        int B, int T, int C, cudaStream_t stream) {
+    NVTX_RANGE_FN();
+    const int block_size = 512;
+    const int blocks_per_sm = 2; // supported on every architecture and less cache thrashing than 3
+    const int grid_size = blocks_per_sm * deviceProp.multiProcessorCount;
+    size_t rounded_C = CEIL_DIV(C, (32 * x128::size)) * (32 * x128::size);
+    size_t shared_mem_size = (2 * rounded_C + 2 * (block_size - 32) * f128::size) * sizeof(float);
+
+    cudaCheck(cudaMemsetAsync(scratch, 0, 1 * sizeof(float), stream)); // only need to reset the flag to 0
+    layernorm_backward_kernel10<<<grid_size, block_size, shared_mem_size, stream>>>(dinp, dweight, dbias, scratch, dout, inp, weight, mean, rstd, B, T, C);
+    cudaCheck(cudaGetLastError());
+}

From 811018613fdf9579daaaa35fbe324c0f6cbaa109 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 17:50:59 +0530
Subject: [PATCH 40/45] feat : add PyTorch-compatible Mersenne Twister random
 utilities

---
 CUDA/llmcpp/rand.h | 240 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)
 create mode 100644 CUDA/llmcpp/rand.h

diff --git a/CUDA/llmcpp/rand.h b/CUDA/llmcpp/rand.h
new file mode 100644
index 0000000..b66aa04
--- /dev/null
+++ b/CUDA/llmcpp/rand.h
@@ -0,0 +1,240 @@
+/*
+Mersenne Twisters implementation, numerically identical to torch.
+
+Example usage:
+
+    mt19937_state state;
+    manual_seed(&state, 137);
+    printf("%u\n", randint32(&state));
+    printf("%u\n", randint32(&state));
+    printf("%u\n", randint32(&state));
+    printf("%u\n", randint32(&state));
+    printf("%u\n", randint32(&state));
+
+    float t8[8];
+    normal_(t8, 8, 0, 1, &state);
+    for (int i = 0; i < 8; i++) {
+        printf("%f\n", t8[i]);
+    }
+    printf("%u\n", randint32(&state));
+
+    float t16[16];
+    normal_(t16, 16, 0, 1, &state);
+    for (int i = 0; i < 16; i++) {
+        printf("%f\n", t16[i]);
+    }
+    printf("%u\n", randint32(&state));
+
+PyTorch reference (producing identical results):
+
+    import torch
+    torch.manual_seed(137)
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    t = torch.zeros(8);
+    t.normal_()
+    for i in range(len(t)) :
+        print(t[i].item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+    t = torch.zeros(16);
+    t.normal_()
+    for i in range(len(t)) :
+        print(t[i].item())
+    print(torch.randint(0, 0xFFFFFFFF, [1]).item())
+
+Both output:
+
+    4053805790
+    2173880614
+    380293709
+    1237255315
+    2986595568
+    0.7947664260864258
+    1.4369317293167114
+    - 0.2292192131280899
+    0.47556325793266296
+    - 0.6334410905838013
+    - 0.5791953802108765
+    - 0.0925704762339592
+    - 0.8659197092056274
+    2186503452
+    - 1.2813878059387207
+    - 2.646395683288574
+    - 0.06569503247737885
+    0.2180829495191574
+    - 0.46536165475845337
+    - 0.33108410239219666
+    2.5485482215881348
+    0.10425379872322083
+    0.8460659980773926
+    0.9462448358535767
+    - 0.2913765013217926
+    0.34313806891441345
+    - 1.1186704635620117
+    - 0.18305328488349915
+    - 2.3153159618377686
+    0.3961987793445587
+    2756748748
+*/
+
+#ifndef RAND_H
+#define RAND_H
+
+#include <math.h>
+
+#define MERSENNE_STATE_M 397u
+#define MERSENNE_STATE_N 624u
+
+#define LMASK 0x7ffffffful
+#define UMASK 0x80000000ul
+
+// Copyright(c) Makoto Matsumoto and Takuji Nishimura
+
+// This implementation follows PyTorch so that we are numerically identical when running verification tests.
+
+typedef struct {
+    unsigned long long seed_;
+    int left_;
+    unsigned int next_;
+    unsigned int state_[MERSENNE_STATE_N];
+    unsigned int MATRIX_A[2];
+} mt19937_state;
+
+void manual_seed(mt19937_state* state, unsigned int seed) {
+    state->MATRIX_A[0] = 0x0u;
+    state->MATRIX_A[1] = 0x9908b0df;
+    state->state_[0] = seed & 0xffffffff;
+    for (unsigned int j = 1; j < MERSENNE_STATE_N; j++) {
+        state->state_[j] = 1812433253 * (state->state_[j - 1] ^ (state->state_[j - 1] >> 30)) + j;
+        state->state_[j] &= 0xffffffff;
+    }
+    state->left_ = 1;
+    state->next_ = 0;
+}
+
+void next_state(mt19937_state* state) {
+    state->left_ = MERSENNE_STATE_N;
+    state->next_ = 0;
+    unsigned int y, j;
+    for (j = 0; j < MERSENNE_STATE_N - MERSENNE_STATE_M; j++) {
+        y = (state->state_[j] & UMASK) | (state->state_[j + 1] & LMASK);
+        state->state_[j] = state->state_[j + MERSENNE_STATE_M] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1];
+    }
+    for (; j < MERSENNE_STATE_N - 1; j++) {
+        y = (state->state_[j] & UMASK) | (state->state_[j + 1] & LMASK);
+        state->state_[j] = state->state_[j + (MERSENNE_STATE_M - MERSENNE_STATE_N)] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1];
+    }
+    y = (state->state_[MERSENNE_STATE_N - 1] & UMASK) | (state->state_[0] & LMASK);
+    state->state_[MERSENNE_STATE_N - 1] = state->state_[MERSENNE_STATE_M - 1] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1];
+}
+
+unsigned int randint32(mt19937_state* state) {
+    if (!state) return 0;
+    if (state->MATRIX_A[0] != 0 || state->MATRIX_A[1] != 0x9908b0df) manual_seed(state, 5489); // auto-initialize
+    if (--state->left_ <= 0) {
+        next_state(state);
+    }
+    unsigned int y = state->state_[state->next_++];
+    y ^= y >> 11;
+    y ^= (y << 7) & 0x9d2c5680;
+    y ^= (y << 15) & 0xefc60000;
+    y ^= y >> 18;
+    return y;
+}
+
+inline unsigned long long randint64(mt19937_state* state) {
+    return (((unsigned long long)(randint32(state)) << 32) | randint32(state));
+}
+
+inline float randfloat32(mt19937_state* state) {
+    return (randint32(state) & ((1ull << 24) - 1)) * (1.0f / (1ull << 24));
+}
+
+inline double randfloat64(mt19937_state* state) {
+    return (randint64(state) & ((1ull << 53) - 1)) * (1.0 / (1ull << 53));
+}
+
+void uniform_(float* data, unsigned int numel, float from, float to, mt19937_state* state) {
+    for (unsigned int t = 0; t < numel; t++) {
+        data[t] = randfloat32(state) * (to - from) + from;
+    }
+}
+
+// Box-Muller transform: maps uniform random numbers to Gaussian distributed numbers
+// https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
+void normal_fill_16(float* data, float mean, float std) {
+    #define EPSILONE 1e-12f
+    for (unsigned int t = 0; t < 8; t++) {
+        float u1 = 1 - data[t];
+        float u2 = data[t + 8];
+        float radius = sqrtf(-2 * logf(u1 + EPSILONE));
+        float theta = (float) (2.0 * M_PI * u2);
+        data[t] = (radius * cosf(theta) * std + mean);
+        data[t + 8] = (radius * sinf(theta) * std + mean);
+    }
+}
+
+void normal_fill(float* data, unsigned int numel, float mean, float std, mt19937_state* state) {
+    for (unsigned int t = 0; t < numel; t++) {
+        data[t] = randfloat32(state);
+    }
+    for (unsigned int i = 0; i < numel - 15; i += 16) {
+        normal_fill_16(data + i, mean, std);
+    }
+    if (numel % 16 != 0) {
+        // recompute the last 16 values
+        data = data + numel - 16;
+        for (unsigned int i = 0; i < 16; i++) {
+            data[i] = randfloat32(state);
+        }
+        normal_fill_16(data, mean, std);
+    }
+}
+
+void normal_(float* data, unsigned int numel, float mean, float std, mt19937_state* state) {
+    #define EPSILONE 1e-12f
+    if (numel >= 16) {
+        normal_fill(data, numel, mean, std, state);
+    }
+    else {
+        double next_double_normal_sample = 0.0; // make compiler warning happy, won't be used
+        int has_next_double_normal_sample = 0;
+        for (unsigned int  t = 0; t < numel; t++) {
+            if (has_next_double_normal_sample) {
+                data[t] = (float)(next_double_normal_sample * std + mean);
+                has_next_double_normal_sample = 0;
+                continue;
+            }
+            // for numel < 16 we draw a double (float64)
+            float u1 = (float) randfloat64(state);
+            float u2 = (float) randfloat64(state);
+            float radius = sqrtf(-2 * logf(1 - u2 + EPSILONE));
+            float theta = (float) (2.0 * M_PI * u1);
+            next_double_normal_sample = radius * sinf(theta);
+            has_next_double_normal_sample = 1;
+            data[t] = (radius * cosf(theta) * std + mean);
+        }
+    }
+}
+
+void init_identity_permutation(int *data, int numel) {
+    for (int i = 0; i < numel; i++) {
+        data[i] = i;
+    }
+}
+
+void random_permutation(int* data, int numel, mt19937_state* state) {
+    for (int i = numel - 1; i > 0; i--) {
+        // pick an index j in [0, i] with equal probability
+        int j = randint32(state) % (i + 1);
+        // swap i <-> j
+        int tmp = data[i];
+        data[i] = data[j];
+        data[j] = tmp;
+    }
+}
+
+#endif
\ No newline at end of file

From 54b727bcfaaac00e6476d5326adbd8c3b64df022 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 17:59:57 +0530
Subject: [PATCH 41/45] README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.
---
 README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index d8a6ca1..6a0931c 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,13 @@
 # Quadtrix.cpp
 
-<p align="center">
+<h1 align="center">
   <img width="785" height="261" alt="image" src="https://github.com/user-attachments/assets/7bd2d8c6-d1e3-4ca0-96c0-0161d3cf235a" />
+</h1><br>
+
 
   [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml)  [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml)
   [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml)
-</p>
+
 
 A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend.
 

From 9b34e36e0b5f22d1bef168b18cad3185146f532f Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 18:06:49 +0530
Subject: [PATCH 42/45] utils:`fopenCheck`, `freadCheck`, `fwriteCheck`,
 `fcloseCheck`, and `fseekCheck` with explicit crash details and
 project-specific troubleshooting hints. - Add cross-platform socket closure
 wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

---
 CUDA/llmcpp/utils.h | 223 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)
 create mode 100644 CUDA/llmcpp/utils.h

diff --git a/CUDA/llmcpp/utils.h b/CUDA/llmcpp/utils.h
new file mode 100644
index 0000000..775534c
--- /dev/null
+++ b/CUDA/llmcpp/utils.h
@@ -0,0 +1,223 @@
+/*
+ This file contains utilities shared between the different training scripts.
+ In particular, we define a series of macros xxxCheck that call the corresponding
+ C standard library function and check its return code. If an error was reported,
+ the program prints some debug information and exits.
+*/
+#ifndef UTILS_H
+#define UTILS_H
+
+#include <unistd.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+// implementation of dirent for Windows is in dev/unistd.h
+#ifndef _WIN32
+#include <dirent.h>
+#include <arpa/inet.h>
+#endif
+
+// ----------------------------------------------------------------------------
+// fread convenience utils, with nice handling of error checking using macros
+// simple replace fopen, fread, fclose, fseek
+// with fopenCheck, freadCheck, fcloseCheck, fseekCheck
+
+extern inline FILE *fopen_check(const char *path, const char *mode, const char *file, int line) {
+    FILE *fp = fopen(path, mode);
+    if (fp == NULL) {
+        fprintf(stderr, "Error: Failed to open file '%s' at %s:%d\n", path, file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        fprintf(stderr, "  Path: %s\n", path);
+        fprintf(stderr, "  Mode: %s\n", mode);
+        fprintf(stderr, "---> HINT 1: dataset files/code have moved to dev/data recently (May 20, 2024). You may have to mv them from the legacy data/ dir to dev/data/(dataset), or re-run the data preprocessing script. Refer back to the main README\n");
+        fprintf(stderr, "---> HINT 2: possibly try to re-run `python train_gpt2.py`\n");
+        exit(EXIT_FAILURE);
+    }
+    return fp;
+}
+
+#define fopenCheck(path, mode) fopen_check(path, mode, __FILE__, __LINE__)
+
+extern inline void fread_check(void *ptr, size_t size, size_t nmemb, FILE *stream, const char *file, int line) {
+    size_t result = fread(ptr, size, nmemb, stream);
+    if (result != nmemb) {
+        if (feof(stream)) {
+            fprintf(stderr, "Error: Unexpected end of file at %s:%d\n", file, line);
+        } else if (ferror(stream)) {
+            fprintf(stderr, "Error: File read error at %s:%d\n", file, line);
+        } else {
+            fprintf(stderr, "Error: Partial read at %s:%d. Expected %zu elements, read %zu\n",
+                    file, line, nmemb, result);
+        }
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        fprintf(stderr, "  Expected elements: %zu\n", nmemb);
+        fprintf(stderr, "  Read elements: %zu\n", result);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define freadCheck(ptr, size, nmemb, stream) fread_check(ptr, size, nmemb, stream, __FILE__, __LINE__)
+
+extern inline void fclose_check(FILE *fp, const char *file, int line) {
+    if (fclose(fp) != 0) {
+        fprintf(stderr, "Error: Failed to close file at %s:%d\n", file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define fcloseCheck(fp) fclose_check(fp, __FILE__, __LINE__)
+
+extern inline void sclose_check(int sockfd, const char *file, int line) {
+    if (close(sockfd) != 0) {
+        fprintf(stderr, "Error: Failed to close socket at %s:%d\n", file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define scloseCheck(sockfd) sclose_check(sockfd, __FILE__, __LINE__)
+
+#ifdef _WIN32
+extern inline void closesocket_check(int sockfd, const char *file, int line) {
+    if (closesocket(sockfd) != 0) {
+        fprintf(stderr, "Error: Failed to close socket at %s:%d\n", file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define closesocketCheck(sockfd) closesocket_check(sockfd, __FILE__, __LINE__)
+#endif
+
+extern inline void fseek_check(FILE *fp, long off, int whence, const char *file, int line) {
+    if (fseek(fp, off, whence) != 0) {
+        fprintf(stderr, "Error: Failed to seek in file at %s:%d\n", file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  Offset: %ld\n", off);
+        fprintf(stderr, "  Whence: %d\n", whence);
+        fprintf(stderr, "  File:   %s\n", file);
+        fprintf(stderr, "  Line:   %d\n", line);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define fseekCheck(fp, off, whence) fseek_check(fp, off, whence, __FILE__, __LINE__)
+
+extern inline void fwrite_check(void *ptr, size_t size, size_t nmemb, FILE *stream, const char *file, int line) {
+    size_t result = fwrite(ptr, size, nmemb, stream);
+    if (result != nmemb) {
+        if (feof(stream)) {
+            fprintf(stderr, "Error: Unexpected end of file at %s:%d\n", file, line);
+        } else if (ferror(stream)) {
+            fprintf(stderr, "Error: File write error at %s:%d\n", file, line);
+        } else {
+            fprintf(stderr, "Error: Partial write at %s:%d. Expected %zu elements, wrote %zu\n",
+                    file, line, nmemb, result);
+        }
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        fprintf(stderr, "  Expected elements: %zu\n", nmemb);
+        fprintf(stderr, "  Written elements: %zu\n", result);
+        exit(EXIT_FAILURE);
+    }
+}
+
+#define fwriteCheck(ptr, size, nmemb, stream) fwrite_check(ptr, size, nmemb, stream, __FILE__, __LINE__)
+
+// ----------------------------------------------------------------------------
+// malloc error-handling wrapper util
+
+extern inline void *malloc_check(size_t size, const char *file, int line) {
+    void *ptr = malloc(size);
+    if (ptr == NULL) {
+        fprintf(stderr, "Error: Memory allocation failed at %s:%d\n", file, line);
+        fprintf(stderr, "Error details:\n");
+        fprintf(stderr, "  File: %s\n", file);
+        fprintf(stderr, "  Line: %d\n", line);
+        fprintf(stderr, "  Size: %zu bytes\n", size);
+        exit(EXIT_FAILURE);
+    }
+    return ptr;
+}
+
+#define mallocCheck(size) malloc_check(size, __FILE__, __LINE__)
+
+
+// ----------------------------------------------------------------------------
+// check that all tokens are within range
+extern inline void token_check(const int* tokens, int token_count, int vocab_size, const char *file, int line) {
+    for(int i = 0; i < token_count; i++) {
+        if(!(0 <= tokens[i] && tokens[i] < vocab_size)) {
+            fprintf(stderr, "Error: Token out of vocabulary at %s:%d\n", file, line);
+            fprintf(stderr, "Error details:\n");
+            fprintf(stderr, "  File: %s\n", file);
+            fprintf(stderr, "  Line: %d\n", line);
+            fprintf(stderr, "  Token: %d\n", tokens[i]);
+            fprintf(stderr, "  Position: %d\n", i);
+            fprintf(stderr, "  Vocab: %d\n", vocab_size);
+            exit(EXIT_FAILURE);
+        }
+    }
+}
+#define tokenCheck(tokens, count, vocab) token_check(tokens, count, vocab, __FILE__, __LINE__)
+
+// ----------------------------------------------------------------------------
+// I/O ops
+
+extern inline void create_dir_if_not_exists(const char *dir) {
+    if (dir == NULL) { return; }
+    struct stat st = {0};
+    if (stat(dir, &st) == -1) {
+        if (mkdir(dir, 0700) == -1) {
+            printf("ERROR: could not create directory: %s\n", dir);
+            exit(EXIT_FAILURE);
+        }
+        printf("created directory: %s\n", dir);
+    }
+}
+
+extern inline int find_max_step(const char* output_log_dir) {
+    // find the DONE file in the log dir with highest step count
+    if (output_log_dir == NULL) { return -1; }
+    DIR* dir;
+    struct dirent* entry;
+    int max_step = -1;
+    dir = opendir(output_log_dir);
+    if (dir == NULL) { return -1; }
+    while ((entry = readdir(dir)) != NULL) {
+        if (strncmp(entry->d_name, "DONE_", 5) == 0) {
+            int step = atoi(entry->d_name + 5);
+            if (step > max_step) {
+                max_step = step;
+            }
+        }
+    }
+    closedir(dir);
+    return max_step;
+}
+
+extern inline int ends_with_bin(const char* str) {
+    // checks if str ends with ".bin". could be generalized in the future.
+    if (str == NULL) { return 0; }
+    size_t len = strlen(str);
+    const char* suffix = ".bin";
+    size_t suffix_len = strlen(suffix);
+    if (len < suffix_len) { return 0; }
+    int suffix_matches = strncmp(str + len - suffix_len, suffix, suffix_len) == 0;
+    return suffix_matches;
+}
+
+#endif
\ No newline at end of file

From a89ab1c819c08cc2e78217b3dc0ae0a70e990591 Mon Sep 17 00:00:00 2001
From: Eamon <eamon112009@gmail.com>
Date: Sun, 7 Jun 2026 18:08:44 +0530
Subject: [PATCH 43/45] mfu: add GPU specifications database and utilities for
 MFU estimation

---
 CUDA/llmcpp/mfu.h | 244 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)
 create mode 100644 CUDA/llmcpp/mfu.h

diff --git a/CUDA/llmcpp/mfu.h b/CUDA/llmcpp/mfu.h
new file mode 100644
index 0000000..1c40b7b
--- /dev/null
+++ b/CUDA/llmcpp/mfu.h
@@ -0,0 +1,244 @@
+#ifndef MFU_H
+#define MFU_H
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#if __has_include(<nvml.h>)
+#define USE_NVML 1
+#include <nvml.h>
+#else
+#define USE_NVML 0
+#endif
+
+// tied to enum PrecisionMode, in a future refactor make them the same
+#define MFUH_PRECISION_FP32 0
+#define MFUH_PRECISION_FP16 1
+#define MFUH_PRECISION_BF16 2
+
+#if USE_NVML
+inline void nvml_check(nvmlReturn_t status, const char *file, int line) {
+    if (status != NVML_SUCCESS) {
+        printf("[NVML ERROR] at file %s:%d:\n%s\n", file, line, nvmlErrorString(status));
+        exit(EXIT_FAILURE);
+    }
+};
+#define nvmlCheck(err) (nvml_check(err, __FILE__, __LINE__))
+#endif
+
+
+typedef struct {
+    float TF_32;       // tensor-core performance 32 bit
+    float BF_16_32;    // bf16 with 32 bit accumulate
+    float FP_16_32;    // fp16 with 32 bit accumulate
+    float FP_16_16;    // fp16 with 16 bit accumulate
+    float FP_8_32;     // and so on
+    float FP_8_16;
+    float CLOCK;        // clock frequency from the spec sheet
+    float CORES;        // #TCs from the spec sheet
+} PerfData;
+
+// basic default data from the nvidia whitepapers
+static const PerfData VOLTA = {125.0f, -1.f, 125.f, -1.f, -1.f, -1.f, 1530.f, 640.f};
+static const PerfData AMPERE_DATACENTER = {156.f, 312.f, 312.f, 312.f, -1.f, -1.f, 1410.f, 432.f};
+static const PerfData AMPERE_CONSUMER = {40.f, 80.f, 80.f, 160.f, -1.f, -1.f, 1860.f, 336.f};
+static const PerfData HOPPER = {378.f, 756.f, 756.f, 756.f, 1513.f, 1513.f, 1620.f, 456.f};
+static const PerfData ADA = {82.6f, 165.2f, 165.2f, 330.3f, 330.3f, 660.6f, 2520.f, 512.f};
+
+typedef struct {
+    const char* name;
+    const PerfData* perf_data;
+    float new_cores;
+    float new_mhz;
+} GPUEntry;
+
+// the overrides for each specific GPU
+static GPUEntry gpu_db[] = {
+    {"Tesla V100-SXM2-16GB", &VOLTA, 640, 1530},
+    {"Tesla V100-PCIE-32GB", &VOLTA, 640, 1530},
+    {"NVIDIA A100-PCIE-40GB", &AMPERE_DATACENTER, 432, 1410},
+    {"NVIDIA A100-PCIE-80GB", &AMPERE_DATACENTER, 432, 1410},
+    {"NVIDIA A100-SXM4-40GB", &AMPERE_DATACENTER, 432, 1410},
+    {"NVIDIA A100-SXM4-80GB", &AMPERE_DATACENTER, 432, 1410},
+    {"NVIDIA RTX A2000", &AMPERE_CONSUMER, 104, 1200},
+    {"NVIDIA RTX A4000", &AMPERE_CONSUMER, 192, 1560},
+    {"NVIDIA RTX A4500", &AMPERE_CONSUMER, 224, 1650},
+    {"NVIDIA RTX A5000", &AMPERE_CONSUMER, 256, 1695},
+    {"NVIDIA RTX A5500", &AMPERE_CONSUMER, 320, 1770},
+    {"NVIDIA RTX A6000", &AMPERE_CONSUMER, 336, 1800},
+    {"NVIDIA GeForce RTX 3090 Ti", &AMPERE_CONSUMER, 336, 1860},
+    {"NVIDIA GeForce RTX 3090", &AMPERE_CONSUMER, 328, 1695},
+    {"NVIDIA GeForce RTX 3080 Ti", &AMPERE_CONSUMER, 320, 1665},
+    {"NVIDIA GeForce RTX 3080", &AMPERE_CONSUMER, 272, 1710},
+    {"NVIDIA GeForce RTX 3070 Ti", &AMPERE_CONSUMER, 192, 1770},
+    {"NVIDIA GeForce RTX 3070", &AMPERE_CONSUMER, 184, 1725},
+    {"NVIDIA GeForce RTX 3060 Ti", &AMPERE_CONSUMER, 152, 1665},
+    {"NVIDIA GeForce RTX 3060", &AMPERE_CONSUMER, 112, 1777},
+    {"NVIDIA RTX A2000 ADA", &ADA, 88, 2130},
+    {"NVIDIA RTX A4000 ADA", &ADA, 192, 2175},
+    {"NVIDIA RTX A4500 ADA", &ADA, 224, 2580},
+    {"NVIDIA RTX A5000 ADA", &ADA, 400, 2550},
+    {"NVIDIA RTX A5880 ADA", &ADA, 440, 2460},
+    {"NVIDIA RTX A6000 ADA", &ADA, 568, 2505},
+    {"NVIDIA GeForce RTX 4090", &ADA, 512, 2520},
+    {"NVIDIA GeForce RTX 4080 SUPER", &ADA, 320, 2550},
+    {"NVIDIA GeForce RTX 4080", &ADA, 304, 2505},
+    {"NVIDIA GeForce RTX 4070 Ti SUPER", &ADA, 264, 2610},
+    {"NVIDIA GeForce RTX 4070 Ti", &ADA, 240, 2610},
+    {"NVIDIA GeForce RTX 4070 SUPER", &ADA, 224, 2475},
+    {"NVIDIA GeForce RTX 4070", &ADA, 184, 2475},
+    {"NVIDIA GeForce RTX 4070", &ADA, 184, 2475},
+    {"NVIDIA GeForce RTX 4060 Ti", &ADA, 136, 2535},
+    {"NVIDIA GeForce RTX 4060", &ADA, 96, 2460},
+    {"NVIDIA H100 PCIe", &HOPPER, 456, 1620},
+    {"NVIDIA H100 80GB HBM3", &HOPPER, 528, 1830}, // HBM3 = SXM5
+};
+
+float get_flops_promised(const char* device, int precision_mode) {
+    /*
+    This function is used to estimate the Model Flops Utilization (MFU)
+    basically we have to figure out how many flops the GPU can do per second.
+    Note that this is not a simple endeavor and may well go wrong! The details are tricky.
+    The returned value is in units of 1e12.
+
+    For the non-top models, actual performance numbers aren't that easy to find, e.g.,
+    here https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756, does "Theoretical Performance"
+    seems to be without tensor cores.
+
+    So, instead we use that all these cards just use the same types of tensor cores in different
+    numbers and at different frequencies. Then we just need to look up these two easily accesible
+    numbers for all the other GPUs.
+    linear scaling seems to work: comparing spec sheet and calculation:
+    4080: 304TCs, 2505 GHz; 97.5TFlops = 165.2/512*304 /2520 * 2505
+
+    Original numbers for the top GPUS are from.
+    https://resources.nvidia.com/en-us-tensor-core
+    https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf
+    */
+
+   // validate the precision mode as one of the three possible values
+    if (!(precision_mode == MFUH_PRECISION_FP32 || precision_mode == MFUH_PRECISION_FP16 || precision_mode == MFUH_PRECISION_BF16)) {
+        fprintf(stderr, "Invalid precision mode: %d\n", precision_mode);
+        return -1.0f;
+    }
+
+    // do a linear search until you find our GPU, then calculate the flops promised
+    int num_gpu_entries = sizeof(gpu_db) / sizeof(gpu_db[0]);
+    for (int i = 0; i < num_gpu_entries; i++) {
+        if (strcmp(gpu_db[i].name, device) == 0) {
+            const PerfData* perf_data = gpu_db[i].perf_data;
+
+            // look up the default flops value for the given precision mode
+            float value = -1.0f;
+            if (precision_mode == MFUH_PRECISION_BF16) { value = perf_data->BF_16_32; }
+            if (precision_mode == MFUH_PRECISION_FP32) { value = perf_data->TF_32; }
+            if (precision_mode == MFUH_PRECISION_FP16) { value = perf_data->FP_16_32; }
+
+            // we'd get here if we're e.g. trying to use BF16 on Volta GPU or something...
+            if (value < 0.0f) {
+                fprintf(stderr, "No data for GPU %s and precision mode %d\n", device, precision_mode);
+                return -1.0f;
+            }
+
+            // adjust flops based on the specific core count and clock frequency of this GPU
+            float new_cores = gpu_db[i].new_cores;
+            float new_mhz = gpu_db[i].new_mhz;
+            float adjusted = value * (new_cores / perf_data->CORES) * (new_mhz / perf_data->CLOCK);
+            return adjusted;
+        }
+    }
+
+    return -1.0f; // ¯\_(ツ)_/¯
+}
+
+struct GPUUtilInfo {
+    unsigned int clock;
+    unsigned int max_clock;
+    unsigned int power;
+    unsigned int power_limit;
+    unsigned int fan;
+    unsigned int temperature;
+    unsigned int temp_slowdown;
+
+    float gpu_utilization;
+    float mem_utilization;
+    const char* throttle_reason;
+};
+
+// lazily initialize nvml and generate a handle to the GPU
+#if USE_NVML
+nvmlDevice_t nvml_get_device() {
+    static bool needs_init = true;
+    static nvmlDevice_t device;
+    if(needs_init) {
+        needs_init = false;
+        nvmlCheck(nvmlInit());
+        nvmlCheck(nvmlDeviceGetHandleByIndex_v2(0, &device));
+    }
+    return device;
+}
+
+// convert throttle reason bitfield into a text reason.
+// this is a lossy conversion; we just want to give some idea of what is happening
+const char* get_throttle_reason(unsigned long long bits) {
+    if(bits & (nvmlClocksThrottleReasonSwPowerCap | nvmlClocksThrottleReasonHwPowerBrakeSlowdown)) {
+        return "power cap";
+    } else if (bits & (nvmlClocksThrottleReasonSwThermalSlowdown | nvmlClocksThrottleReasonHwThermalSlowdown)) {
+        return "thermal cap";
+    } else if (bits & (nvmlClocksThrottleReasonAll)) {
+        return "other cap";
+    } else {
+        return "no cap";
+    }
+}
+
+// gather data for a GPUUtilInfo object
+GPUUtilInfo get_gpu_utilization_info() {
+    GPUUtilInfo info;
+    nvmlDevice_t device = nvml_get_device();
+    // query different infos directly
+    nvmlCheck(nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &info.clock));
+    nvmlCheck(nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_SM, &info.max_clock));
+    nvmlCheck(nvmlDeviceGetPowerManagementLimit(device, &info.power_limit));
+    nvmlCheck(nvmlDeviceGetPowerUsage(device, &info.power));
+    nvmlCheck(nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &info.temperature));
+    nvmlCheck(nvmlDeviceGetTemperatureThreshold(device, NVML_TEMPERATURE_THRESHOLD_SLOWDOWN, &info.temp_slowdown));
+    unsigned long long throttle;
+    nvmlCheck(nvmlDeviceGetCurrentClocksThrottleReasons(device, &throttle));
+    info.throttle_reason = get_throttle_reason(throttle);
+    nvmlCheck(nvmlDeviceGetFanSpeed(device, &info.fan));
+
+    // for "utilization", we look at recorded samples. In principle, we could query the driver for how many samples
+    // to request, but then we'd need to dynamically allocate sufficient space. Let's just hard-code a limit of 128,
+    // and have no memory management required
+    constexpr const int BUFFER_LIMIT = 128;
+    nvmlSample_t buffer[BUFFER_LIMIT];
+    nvmlValueType_t v_type;
+    unsigned int sample_count = BUFFER_LIMIT;
+    nvmlCheck(nvmlDeviceGetSamples(device, NVML_GPU_UTILIZATION_SAMPLES, 0, &v_type, &sample_count, buffer));
+    float gpu_utilization = 0.f;
+    for(unsigned i = 0; i < sample_count; ++i) {
+        gpu_utilization += (float)buffer[i].sampleValue.uiVal;
+    }
+    gpu_utilization /= (float)sample_count;
+
+    // sample count may have been modified by the query above; reset back to buffer size
+    sample_count = BUFFER_LIMIT;
+    nvmlCheck(nvmlDeviceGetSamples(device, NVML_MEMORY_UTILIZATION_SAMPLES, 0, &v_type, &sample_count, buffer));
+    float mem_utilization = 0.f;
+    for(unsigned i = 0; i < sample_count; ++i) {
+        mem_utilization += (float)buffer[i].sampleValue.uiVal;
+    }
+    mem_utilization /= (float)sample_count;
+
+    info.gpu_utilization = gpu_utilization;
+    info.mem_utilization = mem_utilization;
+    return info;
+}
+#else
+GPUUtilInfo get_gpu_utilization_info() {
+    fprintf(stderr, "Error: Compiled without nvml support. Cannot perform additional GPU state tracking.");
+    exit(EXIT_FAILURE);
+}
+#endif
+#endif // MFU_H

From fd41e1b0e916eec812068f9a422ac07cf70dd6b7 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Mon, 8 Jun 2026 01:06:55 +0530
Subject: [PATCH 44/45] Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6a0931c..b544f73 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Quadtrix.cpp
+# Quadtrix.cpp (llm.cpp)
 
 <h1 align="center">
   <img width="785" height="261" alt="image" src="https://github.com/user-attachments/assets/7bd2d8c6-d1e3-4ca0-96c0-0161d3cf235a" />

From d5cadb603f562f7db4c9a0fe5560a7884a256d63 Mon Sep 17 00:00:00 2001
From: Eamon Sippy <eamon112009@gmail.com>
Date: Mon, 8 Jun 2026 13:08:20 +0530
Subject: [PATCH 45/45] Update README to remove image and clean up content

Removed image from README and adjusted formatting.
---
 README.md | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/README.md b/README.md
index 888604e..b441cf4 100644
--- a/README.md
+++ b/README.md
@@ -8,11 +8,6 @@
   [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml)  [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml)
   [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml)
 
-
-<p align="center">
-  <img width="785" height="261" alt="image" src="https://github.com/user-attachments/assets/7bd2d8c6-d1e3-4ca0-96c0-0161d3cf235a" />
-</p>
-
 A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend.
 
 > Full technical reference: [docs](https://eamon2009.github.io/LLMs/)