From 4ebd73f067cf52f46b7774238107dd95d0d14224 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 31 May 2026 19:26:54 +0530 Subject: [PATCH 01/45] exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- Co-authored-by: Max --- Dockerfile => .devops/Dockerfile | 2 +- Dockerfile.cuda => .devops/Dockerfile.backend | 0 .devops/Dockerfile.cpp | 65 + .devops/nginx.conf | 47 + .dockerignore | 57 +- .github/workflows/ci.yml | 238 +- .github/workflows/docker-publish.yml | 163 +- .github/workflows/pr-check.yml | 238 ++ CUDA/main.cu | 2070 +++++++++++++++++ LICENSE | 2 +- Makefile | 104 + README.md | 4 + config/config.h | 20 +- docker-compose.dev.yml | 45 + docker-compose.gpu.yml | 32 + docker-compose.yml | 181 +- frontend/src/components/chat/EmptyState.tsx | 96 +- .../src/components/chat/MessageAvatar.tsx | 45 +- frontend/src/components/chat/MessageList.tsx | 21 +- frontend/src/components/chat/MessageRow.tsx | 87 +- .../src/components/chat/ThinkingIndicator.tsx | 28 +- include/tensor.h | 749 ++++-- main.cpp | 193 +- run.md | 492 ---- scripts/build.sh | 161 ++ 25 files changed, 4077 insertions(+), 1063 deletions(-) rename Dockerfile => .devops/Dockerfile (94%) rename Dockerfile.cuda => .devops/Dockerfile.backend (100%) create mode 100644 .devops/Dockerfile.cpp create mode 100644 .devops/nginx.conf create mode 100644 .github/workflows/pr-check.yml create mode 100644 CUDA/main.cu create mode 100644 Makefile create mode 100644 docker-compose.dev.yml create mode 100644 docker-compose.gpu.yml delete mode 100644 run.md create mode 100644 scripts/build.sh diff --git a/Dockerfile b/.devops/Dockerfile similarity index 94% rename from Dockerfile rename to .devops/Dockerfile index 65fcca9..c7c0061 100644 --- a/Dockerfile +++ b/.devops/Dockerfile @@ -35,4 +35,4 @@ COPY . . ENV PATH="/app/venv/bin:$PATH" ENV PYTHONUNBUFFERED=1 -ENTRYPOINT ["python3", "engine/main.py"] \ No newline at end of file +ENTRYPOINT ["python3", "engine/main.py"] diff --git a/Dockerfile.cuda b/.devops/Dockerfile.backend similarity index 100% rename from Dockerfile.cuda rename to .devops/Dockerfile.backend diff --git a/.devops/Dockerfile.cpp b/.devops/Dockerfile.cpp new file mode 100644 index 0000000..0a1ce15 --- /dev/null +++ b/.devops/Dockerfile.cpp @@ -0,0 +1,65 @@ + +FROM ubuntu:24.04 AS builder + +LABEL stage=builder + +ARG DEBIAN_FRONTEND=noninteractive +ARG BUILD_TYPE=Release +ARG CMAKE_EXTRA_FLAGS="" + +RUN apt-get update && apt-get install -y --no-install-recommends \ + build-essential \ + g++ \ + cmake \ + ninja-build \ + ccache \ + git \ + ca-certificates \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /src + +COPY main.cpp ./ +COPY benchmark.cpp ./ +COPY config/ ./config/ +COPY include/ ./include/ +COPY data/ ./data/ + +# If model/Cmakelists.txt exists, use cmake; else fall back to direct g++ +RUN set -e; \ + if [ -f model/Cmakelists.txt ] || [ -f CMakeLists.txt ]; then \ + cmake -B build -G Ninja \ + -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \ + -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \ + ${CMAKE_EXTRA_FLAGS} .; \ + cmake --build build --parallel "$(nproc)"; \ + else \ + g++ -std=c++17 -O3 -march=native \ + -I. -Iinclude \ + -o /usr/local/bin/quadtrix \ + main.cpp; \ + fi +FROM ubuntu:24.04 AS runtime + +LABEL org.opencontainers.image.title="Quadtrix.cpp Engine" +LABEL org.opencontainers.image.description="C++ transformer engine for local LM inference" +LABEL org.opencontainers.image.source="https://github.com/Eamon2009/Quadtrix.cpp" + +RUN apt-get update && apt-get install -y --no-install-recommends \ + libstdc++6 \ + libgomp1 \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /app + +COPY --from=builder /usr/local/bin/quadtrix /usr/local/bin/quadtrix +COPY --from=builder /src/data/ ./data/ +VOLUME ["/models"] + +ENV GPT_DATA_PATH=/app/data/input.txt \ + GPT_MODEL_PATH=/models/best_model.bin + +EXPOSE 8080 + +ENTRYPOINT ["/usr/local/bin/quadtrix"] +CMD ["data/input.txt", "--chat"] diff --git a/.devops/nginx.conf b/.devops/nginx.conf new file mode 100644 index 0000000..5804e6e --- /dev/null +++ b/.devops/nginx.conf @@ -0,0 +1,47 @@ +# Quadtrix.cpp — Nginx config +# Serves the Vite SPA and proxies /api/* to the FastAPI backend + +server { + listen 80; + server_name _; + + root /usr/share/nginx/html; + index index.html; + + # Gzip + gzip on; + gzip_types text/plain text/css application/json application/javascript + text/xml application/xml application/xml+rss text/javascript + application/wasm; + gzip_min_length 1024; + + # SPA fallback — all unknown routes return index.html + location / { + try_files $uri $uri/ /index.html; + } + + # Proxy API calls to FastAPI backend + location /api/ { + proxy_pass http://backend:3001; + proxy_http_version 1.1; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_read_timeout 120s; + proxy_send_timeout 120s; + } + + # Static asset cache + location ~* \.(js|css|png|svg|ico|woff2|woff|ttf|webmanifest)$ { + expires 1y; + add_header Cache-Control "public, immutable"; + } + + # Service worker must not be cached + location = /sw.js { + add_header Cache-Control "no-cache"; + } +} diff --git a/.dockerignore b/.dockerignore index f001789..603874e 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,35 +1,44 @@ + .git .gitignore .github .venv -**/__pycache__ -**/*.pyc -**/*.pyo -**/*.pyd -engine/logs/ +__pycache__ +*.pyc +*.pyo +*.pyd +*.egg-info +.pytest_cache +.ruff_cache +dist/ +build/ +*.egg node_modules frontend/node_modules -.npm-cache -frontend/.vite frontend/dist - -# Model weights -*.pt -*.bin -models/ - -# Windows build artifacts -*.exe +frontend/.vite +*.npm-cache +.npmignore +*.o +*.a +*.so +*.dylib quadtrix.exe -*.png -*.jpg -*.jpeg -*.md -LICENSE -contributing.md -SECURITY.md -run.md +quadtrix +build/ +cmake-build-*/ +.vscode +*.bin +*.pt +*.gguf +*.safetensors +engine/best_model.pt +engine/logs/ +engine/fineweb_30mb.txt +data/input.txt .DS_Store Thumbs.db +*.swp +*.swo .idea -.vscode \ No newline at end of file +docker-compose.override.yml diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 311ad33..bf49286 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -2,74 +2,216 @@ name: CI on: push: - branches: - - exp - - master - pull_request: - -permissions: - contents: read + branches: [master, dev] + workflow_dispatch: + inputs: + image: + description: "Which image to build?" + required: true + type: choice + options: + - cpp + - cpu + - cuda + - all + push: + description: "Push to ghcr.io?" + required: true + default: "true" + type: choice + options: ["true", "false"] + +env: + REGISTRY: ghcr.io + IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix jobs: - cpp-build: - name: C++ build + + file-integrity: + name: File integrity + if: github.event_name == 'push' runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Check required files exist + run: | + files=( + "main.cpp" + "engine/main.py" + "requirements.txt" + ) + failed=0 + for f in "${files[@]}"; do + if [ -f "$f" ]; then + echo "✅ $f" + else + echo "❌ $f — MISSING" + failed=1 + fi + done + exit $failed + + + lint-python: + name: Python lint + if: github.event_name == 'push' + runs-on: ubuntu-latest steps: - - name: Check out repository - uses: actions/checkout@v4 + - uses: actions/checkout@v4 - - name: Install compiler - run: sudo apt-get update && sudo apt-get install -y g++ + - name: Lint engine/ (ruff) + uses: chartboost/ruff-action@v1 + with: + args: "check engine/ --ignore E501 --exit-zero" - - name: Build Quadtrix - run: g++ -std=c++17 -O2 -I. -Iinclude -o quadtrix main.cpp - backend-smoke: - name: Backend smoke checks + build-cpp: + name: C++ compile check + if: github.event_name == 'push' runs-on: ubuntu-latest - steps: - - name: Check out repository - uses: actions/checkout@v4 + - uses: actions/checkout@v4 - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: "3.11" + - name: Install g++ + run: sudo apt-get update && sudo apt-get install -y g++ - - name: Install backend runtime dependencies + - name: Compile main.cpp run: | - python -m pip install --upgrade pip - pip install fastapi "uvicorn[standard]" pydantic pydantic-settings httpx redis + g++ -std=c++17 -O3 \ + -I. -Iinclude \ + -o quadtrix main.cpp - - name: Compile Python sources - run: python -m compileall backend engine + - name: Smoke test + run: ./quadtrix --help || true - - name: Import FastAPI application - working-directory: backend - run: | - python -c "from main import app; print(app.title)" - frontend-build: - name: Frontend build + build-cpp-image: + name: Build — cpp + if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - uses: actions/checkout@v4 + + - uses: docker/setup-qemu-action@v3 + - uses: docker/setup-buildx-action@v3 + + - name: Login to GHCR + if: inputs.push == 'true' + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: ${{ env.IMAGE_PREFIX }}-cpp + tags: | + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + + - name: Build & push + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.cpp + platforms: linux/amd64,linux/arm64 + push: ${{ inputs.push == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cpp + cache-to: type=gha,mode=max,scope=cpp + + + build-cpu-image: + name: Build — cpu + if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all') + runs-on: ubuntu-latest + permissions: + contents: read + packages: write steps: - - name: Check out repository - uses: actions/checkout@v4 + - uses: actions/checkout@v4 + + - uses: docker/setup-qemu-action@v3 + - uses: docker/setup-buildx-action@v3 + + - name: Login to GHCR + if: inputs.push == 'true' + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} - - name: Set up Node.js - uses: actions/setup-node@v4 + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 with: - node-version: "20" - cache: "npm" - cache-dependency-path: frontend/package-lock.json + images: ${{ env.IMAGE_PREFIX }}-cpu + tags: | + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + + - name: Build & push + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile + platforms: linux/amd64,linux/arm64 + push: ${{ inputs.push == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cpu + cache-to: type=gha,mode=max,scope=cpu + + + build-cuda-image: + name: Build — cuda + if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all') + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - uses: actions/checkout@v4 - - name: Install frontend dependencies - working-directory: frontend - run: npm ci + - uses: docker/setup-buildx-action@v3 - - name: Build frontend - working-directory: frontend - run: npm run build + - name: Login to GHCR + if: inputs.push == 'true' + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: ${{ env.IMAGE_PREFIX }}-cuda + tags: | + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + + - name: Build & push + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.backend + platforms: linux/amd64 + push: ${{ inputs.push == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cuda + cache-to: type=gha,mode=max,scope=cuda \ No newline at end of file diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml index 1431739..ca9493f 100644 --- a/.github/workflows/docker-publish.yml +++ b/.github/workflows/docker-publish.yml @@ -1,73 +1,132 @@ -name: Publish Docker image +name: Release + on: - workflow_dispatch: -concurrency: - group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: true + workflow_dispatch: + inputs: + version: + description: "Version tag (e.g. 1.2.3)" + required: true + env: REGISTRY: ghcr.io + IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix + jobs: - build-and-push: - name: Build & push (${{ matrix.variant }}) - runs-on: ubuntu-latest - permissions: - contents: read - packages: write + + build-binaries: + name: Binary (${{ matrix.os }}) + runs-on: ${{ matrix.os }} strategy: - fail-fast: false matrix: + os: [ubuntu-22.04, macos-14] include: - - variant: cpu - dockerfile: Dockerfile - tag_suffix: "" - - variant: cuda - dockerfile: Dockerfile.cuda - tag_suffix: "-cuda" + - os: ubuntu-22.04 + artifact_name: quadtrix-linux-x64 + binary: quadtrix + - os: macos-14 + artifact_name: quadtrix-macos-arm64 + binary: quadtrix steps: - - name: Checkout repository - uses: actions/checkout@v4 - - name: Set lowercase image name - id: image + - uses: actions/checkout@v4 + + - name: Compile (Linux) + if: runner.os == 'Linux' + run: | + sudo apt-get update && sudo apt-get install -y g++ + g++ -std=c++17 -O3 -march=native \ + -I. -Iinclude \ + -o ${{ matrix.binary }} main.cpp + strip ${{ matrix.binary }} + + - name: Compile (macOS) + if: runner.os == 'macOS' + run: | + g++ -std=c++17 -O3 -march=native \ + -I. -Iinclude \ + -o ${{ matrix.binary }} main.cpp + + - name: Package run: | - echo "name=$(echo '${{ github.repository }}' | tr '[:upper:]' '[:lower:]')" >> $GITHUB_OUTPUT - - name: Set up QEMU - uses: docker/setup-qemu-action@v3 - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - name: Log in to ghcr.io + mkdir dist + cp ${{ matrix.binary }} dist/ + cp README.md LICENSE dist/ + tar -czf ${{ matrix.artifact_name }}.tar.gz -C dist . + + - name: Upload to Release + uses: softprops/action-gh-release@v2 + with: + tag_name: v${{ github.event.inputs.version }} + files: ${{ matrix.artifact_name }}.tar.gz + generate_release_notes: true + + publish-images: + name: Publish Docker images + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - uses: actions/checkout@v4 + + - uses: docker/setup-qemu-action@v3 + - uses: docker/setup-buildx-action@v3 + + - name: Login to GHCR uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - - name: Extract Docker metadata - id: meta - uses: docker/metadata-action@v5 + + - name: Parse tag + id: tag + run: echo "VERSION=${{ github.event.inputs.version }}" >> $GITHUB_OUTPUT + + - name: Build & push backend + uses: docker/build-push-action@v6 with: - images: ${{ env.REGISTRY }}/${{ steps.image.outputs.name }} + context: . + file: .devops/Dockerfile.backend + platforms: linux/amd64,linux/arm64 + push: true tags: | - type=raw,value=latest${{ matrix.tag_suffix }},enable={{is_default_branch}} - type=semver,pattern={{version}},suffix=${{ matrix.tag_suffix }} - type=semver,pattern={{major}}.{{minor}},suffix=${{ matrix.tag_suffix }} - type=ref,event=pr,suffix=${{ matrix.tag_suffix }} - - name: Free disk space - if: matrix.variant == 'cuda' - run: | - sudo rm -rf /usr/share/dotnet - sudo rm -rf /opt/ghc - sudo rm -rf /usr/local/share/boost - df -h - - name: Build and push Docker image + ${{ env.IMAGE_PREFIX }}-backend:latest + ${{ env.IMAGE_PREFIX }}-backend:${{ steps.tag.outputs.VERSION }} + cache-from: type=gha,scope=backend + cache-to: type=gha,mode=max,scope=backend + + - name: Build & push frontend uses: docker/build-push-action@v6 with: context: . - file: ./${{ matrix.dockerfile }} + file: .devops/Dockerfile.frontend + platforms: linux/amd64,linux/arm64 push: true - tags: ${{ steps.meta.outputs.tags }} - labels: ${{ steps.meta.outputs.labels }} - cache-from: type=gha,scope=${{ matrix.variant }} - cache-to: type=gha,mode=max,scope=${{ matrix.variant }} - - name: Image published + tags: | + ${{ env.IMAGE_PREFIX }}-frontend:latest + ${{ env.IMAGE_PREFIX }}-frontend:${{ steps.tag.outputs.VERSION }} + cache-from: type=gha,scope=frontend + cache-to: type=gha,mode=max,scope=frontend + + - name: Build & push cpp + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.cpp + platforms: linux/amd64,linux/arm64 + push: true + tags: | + ${{ env.IMAGE_PREFIX }}-cpp:latest + ${{ env.IMAGE_PREFIX }}-cpp:${{ steps.tag.outputs.VERSION }} + cache-from: type=gha,scope=cpp + cache-to: type=gha,mode=max,scope=cpp + + - name: Create Release summary run: | - echo "[${{ matrix.variant }}] published:" - echo " docker pull ${{ env.REGISTRY }}/${{ steps.image.outputs.name }}:latest${{ matrix.tag_suffix }}" + echo "## Docker images published" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "| Image | Tags |" >> $GITHUB_STEP_SUMMARY + echo "|-------|------|" >> $GITHUB_STEP_SUMMARY + echo "| \`quadtrix-backend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY + echo "| \`quadtrix-frontend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY + echo "| \`quadtrix-cpp\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml new file mode 100644 index 0000000..c52ae09 --- /dev/null +++ b/.github/workflows/pr-check.yml @@ -0,0 +1,238 @@ +name: PR Checks + +on: + issue_comment: + types: [created] + +jobs: + slash-command: + name: Parse /run-checks + if: | + github.event.issue.pull_request != null && + contains(github.event.comment.body, '/run-checks') + runs-on: ubuntu-latest + outputs: + pr-sha: ${{ steps.get-sha.outputs.sha }} + steps: + - name: Check commenter permission + uses: actions/github-script@v7 + with: + script: | + const { data } = await github.rest.repos.getCollaboratorPermissionLevel({ + owner: context.repo.owner, + repo: context.repo.repo, + username: context.actor, + }); + if (!['admin', 'write'].includes(data.permission)) { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + body: `@${context.actor} Only maintainers can trigger checks.`, + }); + core.setFailed('Unauthorized'); + } + + - name: React with rocket + uses: actions/github-script@v7 + with: + script: | + await github.rest.reactions.createForIssueComment({ + owner: context.repo.owner, + repo: context.repo.repo, + comment_id: ${{ github.event.comment.id }}, + content: 'rocket', + }); + + - name: Get PR head SHA + id: get-sha + uses: actions/github-script@v7 + with: + script: | + const { data: pr } = await github.rest.pulls.get({ + owner: context.repo.owner, + repo: context.repo.repo, + pull_number: context.issue.number, + }); + core.setOutput('sha', pr.head.sha); + + + lint: + name: Lint + needs: slash-command + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ needs.slash-command.outputs.pr-sha }} + + - name: C++ format check + run: | + sudo apt-get install -y clang-format + find . -name "*.cpp" -o -name "*.h" | grep -v "build/" | \ + xargs clang-format --dry-run --Werror --style=LLVM || true + + - name: Python lint (ruff) + uses: chartboost/ruff-action@v1 + with: + args: "check engine/ --ignore E501 --exit-zero" + + - name: TypeScript lint (eslint) + working-directory: frontend + run: | + npm ci --prefer-offline + npx eslint src/ --ext .ts,.tsx --max-warnings 20 || true + + + build-cpp: + name: Build C++ (${{ matrix.os }}) + needs: slash-command + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false + matrix: + os: [ubuntu-22.04, ubuntu-24.04, macos-14] + include: + - os: ubuntu-22.04 + artifact: quadtrix-linux-x64 + - os: ubuntu-24.04 + artifact: quadtrix-linux-x64-noble + - os: macos-14 + artifact: quadtrix-macos-arm64 + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ needs.slash-command.outputs.pr-sha }} + + - name: Install GCC (Linux) + if: runner.os == 'Linux' + run: sudo apt-get update && sudo apt-get install -y g++ ccache + + - name: Cache ccache + uses: actions/cache@v4 + with: + path: ~/.ccache + key: ccache-${{ matrix.os }}-${{ hashFiles('**/*.cpp', '**/*.h') }} + restore-keys: ccache-${{ matrix.os }}- + + - name: Compile main.cpp + run: | + g++ -std=c++17 -O3 -march=native \ + -I. -Iinclude \ + -o quadtrix main.cpp + + - name: Smoke test + run: ./quadtrix --help || true + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ matrix.artifact }} + path: quadtrix + retention-days: 7 + + + validate-dockerfiles: + name: Validate Dockerfiles + needs: slash-command + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ needs.slash-command.outputs.pr-sha }} + + + - name: Check required files exist + run: | + echo "Checking files referenced by Dockerfiles..." + files=( + "main.cpp" + "engine/main.py" + "requirements.txt" + ) + failed=0 + for f in "${files[@]}"; do + if [ -f "$f" ]; then + echo "✅ $f" + else + echo "❌ $f — MISSING" + failed=1 + fi + done + exit $failed + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Build check — Dockerfile.cpp (C++ engine) + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.cpp + platforms: linux/amd64 + push: false + cache-from: type=gha,scope=cpp + cache-to: type=gha,mode=max,scope=cpp + + + - name: Build check — Dockerfile (PyTorch CPU) + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile + platforms: linux/amd64 + push: false + cache-from: type=gha,scope=cpu + cache-to: type=gha,mode=max,scope=cpu + + - name: Skip CUDA build check + run: echo "CUDA build skipped on PR checks — run publish-docker workflow to build cuda image." + + + test-frontend: + name: Frontend Tests + needs: [slash-command, lint] + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + with: + ref: ${{ needs.slash-command.outputs.pr-sha }} + + - uses: actions/setup-node@v4 + with: + node-version: "20" + cache: npm + cache-dependency-path: frontend/package-lock.json + + - name: Install + working-directory: frontend + run: npm ci --prefer-offline + + - name: Type-check + working-directory: frontend + run: npx tsc --noEmit + + - name: Build check + working-directory: frontend + run: npm run build + + + post-result: + name: Post result + needs: [slash-command, lint, build-cpp, validate-dockerfiles, test-frontend] + runs-on: ubuntu-latest + if: always() + steps: + - uses: actions/github-script@v7 + with: + script: | + const jobs = ${{ toJSON(needs) }}; + const failed = Object.values(jobs).some(j => j.result === 'failure'); + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + body: failed + ? ' Some checks failed — see Actions for details.' + : ' All checks passed!', + }); \ No newline at end of file diff --git a/CUDA/main.cu b/CUDA/main.cu new file mode 100644 index 0000000..4b24fec --- /dev/null +++ b/CUDA/main.cu @@ -0,0 +1,2070 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "llmcpp/utils.h" + +#include "llmcpp/tokenizer.h" + +#include "llmcpp/dataloader.h" + +#include "llmcpp/rand.h" + +#include "llmcpp/schedulers.h" + +#include "llmcpp/sampler.h" + +#include "llmcpp/logger.h" + +#include "llmcpp/mfu.h" + +#include "llmcpp/outlier_detector.h" + +#include "llmcpp/cuda_common.h" + +#include "llmcpp/cuda_utils.cuh" + +#include "llmcpp/cublas_common.h" + +#include "llmcpp/encoder.cuh" + +#include "llmcpp/layernorm.cuh" + +#include "llmcpp/matmul.cuh" +#ifdef ENABLE_CUDNN + +#include "llmcpp/cudnn_att.h" +#else + +#include "llmcpp/attention.cuh" +#endif + +#include "llmcpp/fused_classifier.cuh" + +#include "llmcpp/adamw.cuh" + +#include "llmcpp/global_norm.cuh" + +#include "llmcpp/zero.cuh" + +char filename_buffer[512]; + +cudaDeviceProp deviceProp; +cudaStream_t main_stream; + +constexpr const size_t IO_BUF_SIZE = 32 * 1024 * 1024; + +typedef struct +{ + int max_seq_len; + int vocab_size; + int padded_vocab_size; + int num_layers; + int num_heads; + int channels; +} GPT2Config; + +constexpr const int NUM_PARAMETER_TENSORS = 16; +typedef struct +{ + floatX *wte; + floatX *wpe; + floatX *ln1w; + floatX *ln1b; + floatX *qkvw; + floatX *qkvb; + floatX *attprojw; + floatX *attprojb; + floatX *ln2w; + floatX *ln2b; + floatX *fcw; + floatX *fcb; + floatX *fcprojw; + floatX *fcprojb; + floatX *lnfw; + floatX *lnfb; +} ParameterTensors; +static_assert(sizeof(ParameterTensors) == NUM_PARAMETER_TENSORS * sizeof(void *), "Inconsistent sizes!"); + +void fill_in_parameter_sizes(size_t *param_sizes, size_t *param_sizeof, GPT2Config config) +{ + size_t Vp = config.padded_vocab_size; + size_t C = config.channels; + size_t maxT = config.max_seq_len; + size_t L = config.num_layers; + param_sizes[0] = Vp * C; + param_sizes[1] = maxT * C; + param_sizes[2] = L * C; + param_sizes[3] = L * C; + param_sizes[4] = L * (3 * C) * C; + param_sizes[5] = L * (3 * C); + param_sizes[6] = L * C * C; + param_sizes[7] = L * C; + param_sizes[8] = L * C; + param_sizes[9] = L * C; + param_sizes[10] = L * (4 * C) * C; + param_sizes[11] = L * (4 * C); + param_sizes[12] = L * C * (4 * C); + param_sizes[13] = L * C; + param_sizes[14] = C; + param_sizes[15] = C; + + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + param_sizeof[i] = sizeof(floatX); + } +} + +void *malloc_and_point_parameters(ParameterTensors *params, size_t *param_elements, size_t *param_sizeof) +{ + + size_t num_parameters_bytes = 0; + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + num_parameters_bytes += param_elements[i] * param_sizeof[i]; + } + + void *params_memory; + cudaCheck(cudaMalloc((void **)¶ms_memory, num_parameters_bytes)); + + floatX **ptrs[] = { + ¶ms->wte, ¶ms->wpe, ¶ms->ln1w, ¶ms->ln1b, ¶ms->qkvw, ¶ms->qkvb, + ¶ms->attprojw, ¶ms->attprojb, ¶ms->ln2w, ¶ms->ln2b, ¶ms->fcw, ¶ms->fcb, + ¶ms->fcprojw, ¶ms->fcprojb, ¶ms->lnfw, ¶ms->lnfb}; + char *params_memory_iterator = (char *)params_memory; + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + *(ptrs[i]) = (floatX *)params_memory_iterator; + params_memory_iterator += param_elements[i] * param_sizeof[i]; + } + return params_memory; +} + +constexpr int NUM_ACTIVATION_TENSORS = 21; +typedef struct +{ + floatX *encoded; + floatX *ln1; + float *ln1_mean; + float *ln1_rstd; + floatX *atty; + +#if ENABLE_CUDNN + float *att; +#else + floatX *att; +#endif + + floatX *residual2; + floatX *ln2; + float *ln2_mean; + float *ln2_rstd; + floatX *fch; + floatX *fch_gelu; + floatX *residual3; + floatX *lnf; + float *lnf_mean; + float *lnf_rstd; + float *losses; + + floatX *qkvr; + + floatX *output; + + floatX *scratch_bt4c; + floatX *scratch_btc; +} ActivationTensors; + +struct TensorSpec +{ + void **ptr; + size_t size; + DType type; +}; + +#define TENSOR_SPEC(pointer, size) TensorSpec{(void **)(&pointer), (size), dtype_of(pointer)}; + +void fill_in_activation_sizes(const ActivationTensors *data, TensorSpec (&tensors)[NUM_ACTIVATION_TENSORS], size_t B, size_t T, GPT2Config config, int recompute) +{ + size_t Vp = config.padded_vocab_size; + size_t L = config.num_layers; + size_t NH = config.num_heads; + size_t C = config.channels; + tensors[0] = TENSOR_SPEC(data->encoded, B * T * C); + + tensors[1] = TENSOR_SPEC(data->ln1, (recompute < 2) ? L * B * T * C : 0); + tensors[2] = TENSOR_SPEC(data->ln1_mean, L * B * T); + tensors[3] = TENSOR_SPEC(data->ln1_rstd, L * B * T); + tensors[4] = TENSOR_SPEC(data->atty, L * B * T * C); +#ifdef ENABLE_CUDNN + + tensors[5] = TENSOR_SPEC(data->att, L * B * NH * T); +#else + tensors[5] = TENSOR_SPEC(data->att, L * B * NH * T * T); +#endif + tensors[6] = TENSOR_SPEC(data->residual2, L * B * T * C); + + tensors[7] = TENSOR_SPEC(data->ln2, (recompute < 2) ? L * B * T * C : 0); + tensors[8] = TENSOR_SPEC(data->ln2_mean, L * B * T); + tensors[9] = TENSOR_SPEC(data->ln2_rstd, L * B * T); + tensors[10] = TENSOR_SPEC(data->fch, L * B * T * 4 * C); + + tensors[11] = TENSOR_SPEC(data->fch_gelu, (recompute < 1) ? L * B * T * 4 * C : B * T * 4 * C); + tensors[12] = TENSOR_SPEC(data->residual3, L * B * T * C); + tensors[13] = TENSOR_SPEC(data->lnf, B * T * C); + tensors[14] = TENSOR_SPEC(data->lnf_mean, B * T); + tensors[15] = TENSOR_SPEC(data->lnf_rstd, B * T); + tensors[16] = TENSOR_SPEC(data->losses, B * T); + tensors[17] = TENSOR_SPEC(data->qkvr, L * B * T * 3 * C); + tensors[18] = TENSOR_SPEC(data->output, B * T * max(3 * C, max(NH * T, Vp))); + + tensors[19] = TENSOR_SPEC(data->scratch_bt4c, B * T * 4 * C); + tensors[20] = TENSOR_SPEC(data->scratch_btc, B * T * C); +} + +void *malloc_and_point_activations(TensorSpec (&tensors)[NUM_ACTIVATION_TENSORS]) +{ + size_t bytes = 0; + for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) + { + bytes += tensors[i].size * sizeof_dtype(tensors[i].type); + } + + printf0("allocating %d MiB for activations\n", (int)round(bytes / (1024 * 1024))); + + void *acts_memory; + cudaCheck(cudaMalloc((void **)&acts_memory, bytes)); + + cudaCheck(cudaMemset(acts_memory, 0, bytes)); + + char *acts_memory_iterator = (char *)acts_memory; + for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) + { + + if (tensors[i].size == 0) + { + *(tensors[i].ptr) = NULL; + } + else + { + *(tensors[i].ptr) = acts_memory_iterator; + acts_memory_iterator += tensors[i].size * sizeof_dtype(tensors[i].type); + } + } + return acts_memory; +} + +typedef struct +{ + GPT2Config config; + + ParameterTensors params; + size_t param_elements[NUM_PARAMETER_TENSORS]; + size_t param_sizeof[NUM_PARAMETER_TENSORS]; + void *params_memory; + size_t num_parameters; + size_t num_parameters_bytes; + + ParameterTensors grads; + void *grads_memory; + + float *m_memory; + float *v_memory; + float *master_weights; + + ActivationTensors acts; + TensorSpec acts_specs[NUM_ACTIVATION_TENSORS]; + void *acts_memory; + + int batch_size; + int seq_len; + int *inputs; + int *targets; + float mean_loss; + float *accumulated_mean_loss; + float *cpu_losses; + unsigned long long rng_state; + unsigned long long rng_state_last_update; + int use_master_weights; + bool init_state; + int gelu_fusion; + int recompute; + + int *workload_indices; + int4 *bucket_info; +} GPT2; + +void gpt2_init_common(GPT2 *model) +{ + + model->acts_memory = NULL; + model->inputs = NULL; + model->targets = NULL; + model->accumulated_mean_loss = NULL; + model->cpu_losses = NULL; + + model->batch_size = 0; + model->seq_len = 0; + model->mean_loss = -1.0f; + model->params_memory = NULL; + + model->grads_memory = NULL; + model->workload_indices = NULL; + model->bucket_info = NULL; + + model->m_memory = NULL; + model->v_memory = NULL; + model->master_weights = NULL; + + model->rng_state = 13371337 + multi_gpu_config.process_rank; + model->use_master_weights = 1; + model->init_state = true; + model->recompute = 1; + model->gelu_fusion = 0; +} + +void gpt2_allocate_weights(GPT2 *model) +{ + + fill_in_parameter_sizes(model->param_elements, model->param_sizeof, model->config); + model->num_parameters = 0; + model->num_parameters_bytes = 0; + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + model->num_parameters += model->param_elements[i]; + model->num_parameters_bytes += model->param_elements[i] * model->param_sizeof[i]; + } + + assert(model->params_memory == nullptr); + model->params_memory = malloc_and_point_parameters(&model->params, model->param_elements, model->param_sizeof); +} + +void gpt2_allocate_state(GPT2 *model, int B, int T) +{ + printf0("allocating %d MiB for parameter gradients\n", (int)round(model->num_parameters * sizeof(floatX) / (1024 * 1024))); + assert(model->grads_memory == nullptr); + model->grads_memory = malloc_and_point_parameters(&model->grads, model->param_elements, model->param_sizeof); + + model->batch_size = B; + model->seq_len = T; + + fill_in_activation_sizes(&model->acts, model->acts_specs, B, T, model->config, model->recompute); + model->acts_memory = malloc_and_point_activations(model->acts_specs); + + cudaCheck(cudaMalloc((void **)&model->inputs, B * T * sizeof(int))); + cudaCheck(cudaMalloc((void **)&model->targets, B * T * sizeof(int))); + cudaCheck(cudaMalloc(((void **)&model->accumulated_mean_loss), sizeof(float))); + cudaCheck(cudaMallocHost((void **)&model->cpu_losses, B * T * sizeof(float))); + + size_t num_c_groups = CEIL_DIV(model->config.channels, (WARP_SIZE * x128::size)); + assert((size_t)(model->batch_size * model->seq_len) * num_c_groups < (1ULL << 31ULL)); + model->workload_indices = (int *)mallocCheck(sizeof(int) * model->batch_size * model->seq_len * num_c_groups); + model->bucket_info = (int4 *)mallocCheck(sizeof(int4) * model->batch_size * model->seq_len * num_c_groups); + + int memory_status = 0; + + size_t shard_num_parameters = multi_gpu_config.shard_num_parameters; + printf0("allocating %zu MiB for AdamW optimizer state m\n", (shard_num_parameters * sizeof(float)) >> 20); + printf0("allocating %zu MiB for AdamW optimizer state v\n", (shard_num_parameters * sizeof(float)) >> 20); + assert(model->m_memory == nullptr); + assert(model->v_memory == nullptr); + memory_status |= cudaMallocConditionallyManaged((void **)&model->m_memory, shard_num_parameters * sizeof(float)); + memory_status |= cudaMallocConditionallyManaged((void **)&model->v_memory, shard_num_parameters * sizeof(float)); + + if (model->use_master_weights == 1) + { + assert(model->master_weights == nullptr); + printf0("allocating %zu MiB for master copy of params\n", (shard_num_parameters * sizeof(float)) >> 20); + memory_status |= cudaMallocConditionallyManaged((void **)&model->master_weights, shard_num_parameters * sizeof(float)); + } + + int reduced_memory_status = (int)multi_gpu_cpu_float_sum((float)memory_status, &multi_gpu_config); + if (reduced_memory_status >= 1) + { + printf0("WARNING: Fell back to cudaMallocManaged when initializing m,v,master_weights on %d GPUs\n", reduced_memory_status); + printf0(" Prevents an OOM, but code may run much slower due to device <-> host memory movement\n"); + } + + size_t free, total; + cudaCheck(cudaMemGetInfo(&free, &total)); + printf0("device memory usage: %zd MiB / %zd MiB\n", (total - free) / 1024 / 1024, total / 1024 / 1024); + + size_t bytes_per_sequence = 0; + for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) + { + bytes_per_sequence += model->acts_specs[i].size * sizeof_dtype(model->acts_specs[i].type) / B; + } + printf0("memory per sequence: %zu MiB\n", bytes_per_sequence / 1024 / 1024); + printf0(" -> estimated maximum batch size: %zu\n", B + free / bytes_per_sequence); +} + +void gpt2_write_to_checkpoint(GPT2 *model, const char *checkpoint_path) +{ + + printf0("Writing model to %s\n", checkpoint_path); + FILE *model_file = fopenCheck(checkpoint_path, "wb"); + + int model_header[256]; + memset(model_header, 0, sizeof(model_header)); + model_header[0] = 20240326; + assert(PRECISION_MODE == PRECISION_FP32 || PRECISION_MODE == PRECISION_BF16); + model_header[1] = PRECISION_MODE == PRECISION_FP32 ? 3 : 5; + model_header[2] = model->config.max_seq_len; + model_header[3] = model->config.vocab_size; + model_header[4] = model->config.num_layers; + model_header[5] = model->config.num_heads; + model_header[6] = model->config.channels; + model_header[7] = model->config.padded_vocab_size; + fwriteCheck(model_header, sizeof(int), 256, model_file); + + device_to_file(model_file, model->params_memory, model->num_parameters_bytes, + IO_BUF_SIZE, main_stream); + + fcloseCheck(model_file); +} + +void gpt2_build_from_checkpoint(GPT2 *model, const char *checkpoint_path, bool weight_init = true) +{ + + if (PRECISION_MODE == PRECISION_FP16) + { + + fprintf(stderr, "build_from_checkpoint() does not support fp16 right now.\n"); + exit(EXIT_FAILURE); + } + + FILE *model_file = fopenCheck(checkpoint_path, "rb"); + int model_header[256]; + freadCheck(model_header, sizeof(int), 256, model_file); + if (model_header[0] != 20240326) + { + printf("Bad magic model file\n"); + exit(EXIT_FAILURE); + } + int version = model_header[1]; + if (!(version == 3 || version == 5)) + { + + fprintf(stderr, "Bad version in model file\n"); + fprintf(stderr, "---> HINT: try to re-run `python train_gpt2.py`\n"); + exit(EXIT_FAILURE); + } + + if (weight_init) + { + if (PRECISION_MODE == PRECISION_BF16 && version != 5) + { + fprintf(stderr, "Precision is configured as BF16 but model at %s is not.\n", checkpoint_path); + fprintf(stderr, "---> HINT: are you sure you're loading a _bf16.bin file?\n"); + exit(EXIT_FAILURE); + } + if (PRECISION_MODE == PRECISION_FP32 && version != 3) + { + fprintf(stderr, "Precision is configured as FP32 but model at %s is not.\n", checkpoint_path); + fprintf(stderr, "---> HINT: to turn on FP32 you have to compile like: `make train_gpt2cu PRECISION=FP32`\n"); + fprintf(stderr, "---> HINT: are you sure you're loading a .bin file without any _bf16 in the name?\n"); + exit(EXIT_FAILURE); + } + } + + model->config.max_seq_len = model_header[2]; + model->config.vocab_size = model_header[3]; + model->config.num_layers = model_header[4]; + model->config.num_heads = model_header[5]; + model->config.channels = model_header[6]; + model->config.padded_vocab_size = model_header[7]; + + gpt2_allocate_weights(model); + + if (weight_init) + { + assert(model->params_memory != NULL); + file_to_device(model->params_memory, model_file, model->num_parameters_bytes, IO_BUF_SIZE, main_stream); + } + fcloseCheck(model_file); + + cudaCheck(cudaDeviceSynchronize()); +} + +void gpt2_set_hyperparameters(GPT2Config *config, const char *depth_str) +{ + int depth = atoi(depth_str); + assert(depth > 0); + int channels, num_heads; + if (depth == 6) + { + channels = 384; + num_heads = 6; + } + else if (depth == 12) + { + channels = 768; + num_heads = 12; + } + else if (depth == 24) + { + channels = 1024; + num_heads = 16; + } + else if (depth == 36) + { + channels = 1280; + num_heads = 20; + } + else if (depth == 48) + { + channels = 1600; + num_heads = 25; + } + else if (depth == 60) + { + channels = 1920; + num_heads = 30; + } + else if (depth == 72) + { + channels = 2880; + num_heads = 30; + } + else if (depth == 84) + { + channels = 3456; + num_heads = 36; + } + else + { + fprintf(stderr, "Unsupported GPT-2 depth: %d\n", depth); + exit(EXIT_FAILURE); + } + config->num_layers = depth; + config->channels = channels; + config->num_heads = num_heads; + config->max_seq_len = 1024; +} + +void gpt3_set_hyperparameters(GPT2Config *config, const char *channels_str) +{ + + int channels = atoi(channels_str); + assert(channels > 0); + int depth, head_size; + if (channels == 384) + { + depth = 6; + head_size = 64; + } + else if (channels == 768) + { + depth = 12; + head_size = 64; + } + else if (channels == 1024) + { + depth = 24; + head_size = 64; + } + else if (channels == 1536) + { + depth = 24; + head_size = 96; + } + else if (channels == 2048) + { + depth = 24; + head_size = 128; + } + else if (channels == 2560) + { + depth = 32; + head_size = 80; + } + else if (channels == 4096) + { + depth = 32; + head_size = 128; + } + else if (channels == 5140) + { + depth = 40; + head_size = 128; + } + else if (channels == 12288) + { + depth = 96; + head_size = 128; + } + else + { + fprintf(stderr, "Unsupported GPT-3 channels: %d\n", channels); + exit(EXIT_FAILURE); + } + assert(channels % head_size == 0); + config->num_layers = depth; + config->channels = channels; + config->num_heads = channels / head_size; + config->max_seq_len = 2048; +} + +void gpt_build_from_descriptor(GPT2 *model, const char *descriptor) +{ + + assert(descriptor != NULL); + size_t len = strlen(descriptor); + if (len > 1 && descriptor[0] == 'd') + { + gpt2_set_hyperparameters(&model->config, descriptor + 1); + } + else if (len > 6 && strncmp(descriptor, "gpt2:d", 6) == 0) + { + gpt2_set_hyperparameters(&model->config, descriptor + 6); + } + else if (len > 6 && strncmp(descriptor, "gpt3:c", 6) == 0) + { + gpt3_set_hyperparameters(&model->config, descriptor + 6); + } + else + { + fprintf(stderr, "Unsupported model descriptor: %s\n", descriptor); + exit(EXIT_FAILURE); + } + + model->config.vocab_size = 50257; + model->config.padded_vocab_size = 50304; + + gpt2_allocate_weights(model); + + mt19937_state init_rng; + manual_seed(&init_rng, 42); + floatX *params_memory_cpu = (floatX *)mallocCheck(model->num_parameters_bytes); + memset(params_memory_cpu, 0, model->num_parameters_bytes); + + float residual_scale = 1.0f / sqrtf(2.0f * model->config.num_layers); + + size_t L = model->config.num_layers; + size_t offset = 0; + for (int l = 0; l < L; l++) + { + offset = 0; + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + + if (l == 0 && (i == 2 || i == 8 || i == 14)) + { + for (size_t j = 0; j < model->param_elements[i]; j++) + { + params_memory_cpu[offset + j] = 1.0f; + } + } + + if ((l == 0 && (i == 0 || i == 1)) || i == 4 || i == 6 || i == 10 || i == 12) + { + size_t n = model->param_elements[i]; + size_t layer_offset = 0; + if (i == 0) + { + + n = model->config.vocab_size * model->config.channels; + } + if (i == 4 || i == 6 || i == 10 || i == 12) + { + + assert(n % L == 0); + n = n / L; + layer_offset = l * n; + } + + float scale = (i == 6 || i == 12) ? 0.02f * residual_scale : 0.02f; + + float *fp32_buffer = (float *)mallocCheck(n * sizeof(float)); + normal_(fp32_buffer, n, 0.0f, scale, &init_rng); + for (size_t j = 0; j < n; j++) + { + params_memory_cpu[offset + layer_offset + j] = (floatX)fp32_buffer[j]; + } + free(fp32_buffer); + } + offset += model->param_elements[i]; + } + } + + cudaCheck(cudaMemcpy(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice)); + free(params_memory_cpu); +} + +void gpt2_forward(GPT2 *model, const int *inputs, size_t B, size_t T) +{ + NVTX_RANGE_FN(); + + if (model->params_memory == NULL) + { + printf("Error: model was not initialized properly.\n"); + exit(EXIT_FAILURE); + } + + const size_t V = model->config.vocab_size; + const size_t Vp = model->config.padded_vocab_size; + const size_t L = model->config.num_layers; + const size_t NH = model->config.num_heads; + const size_t C = model->config.channels; + + if (B > model->batch_size || T > model->seq_len) + { + printf("Model: B=%d T=%d, Desired: B=%d T=%d\n", model->batch_size, model->seq_len, (int)B, (int)T); + exit(EXIT_FAILURE); + } + + cudaCheck(cudaMemcpy(model->inputs, inputs, B * T * sizeof(int), cudaMemcpyHostToDevice)); + + tokenCheck(inputs, B * T, V); + + ParameterTensors params = model->params; + ActivationTensors acts = model->acts; + encoder_forward(acts.encoded, model->inputs, params.wte, params.wpe, B, T, C, main_stream); + + layernorm_forward((model->recompute < 2) ? acts.ln1 : acts.lnf, acts.ln1_mean, acts.ln1_rstd, acts.encoded, params.ln1w, params.ln1b, B, T, C, main_stream); + + for (int l = 0; l < L; l++) + { + NvtxRange layer_range("Layer", l); + + floatX *residual = l == 0 ? acts.encoded : acts.residual3 + (l - 1) * B * T * C; + + floatX *l_qkvw = params.qkvw + l * 3 * C * C; + floatX *l_qkvb = params.qkvb + l * 3 * C; + floatX *l_attprojw = params.attprojw + l * C * C; + floatX *l_attprojb = params.attprojb + l * C; + floatX *l_ln2w = params.ln2w + l * C; + floatX *l_ln2b = params.ln2b + l * C; + floatX *l_fcw = params.fcw + l * 4 * C * C; + floatX *l_fcb = params.fcb + l * 4 * C; + floatX *l_fcprojw = params.fcprojw + l * C * 4 * C; + floatX *l_fcprojb = params.fcprojb + l * C; + + floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + l * B * T * C : acts.lnf; + floatX *l_qkvr = acts.qkvr + l * B * T * 3 * C; + floatX *l_atty = acts.atty + l * B * T * C; + floatX *l_residual2 = acts.residual2 + l * B * T * C; + floatX *l_ln2 = (model->recompute < 2) ? acts.ln2 + l * B * T * C : acts.lnf; + float *l_ln2_mean = acts.ln2_mean + l * B * T; + float *l_ln2_rstd = acts.ln2_rstd + l * B * T; + floatX *l_fch = acts.fch + l * B * T * 4 * C; + + floatX *l_fch_gelu = (model->recompute < 1) ? acts.fch_gelu + l * B * T * 4 * C : acts.fch_gelu; + floatX *l_residual3 = acts.residual3 + l * B * T * C; + floatX *scratch = (floatX *)acts.output; + +#ifdef ENABLE_CUDNN + float *l_att = (float *)acts.att + l * B * NH * T; + matmul_forward_cublaslt(l_qkvr, l_ln1, l_qkvw, l_qkvb, B, T, C, 3 * C, main_stream); + attention_forward_cudnn(l_atty, (float *)l_att, l_qkvr, B, T, NH, C, main_stream); +#else + floatX *l_att = acts.att + l * B * NH * T * T; + if (T != model->seq_len) + { + cudaCheck(cudaMemset(l_att, 0, B * NH * T * T * sizeof(floatX))); + } + + matmul_forward_cublaslt(scratch, l_ln1, l_qkvw, l_qkvb, B, T, C, 3 * C, main_stream); + attention_forward(l_atty, l_qkvr, l_att, scratch, B, T, C, NH, main_stream); +#endif + + matmul_forward_cublaslt(scratch, l_atty, l_attprojw, l_attprojb, B, T, C, C, main_stream); + fused_residual_forward5(l_residual2, l_ln2, l_ln2_mean, l_ln2_rstd, residual, scratch, l_ln2w, l_ln2b, B * T, C, main_stream); + matmul_forward_cublaslt(l_fch_gelu, l_ln2, l_fcw, l_fcb, B, T, C, 4 * C, main_stream, l_fch, model->gelu_fusion); + matmul_forward_cublaslt(scratch, l_fch_gelu, l_fcprojw, l_fcprojb, B, T, 4 * C, C, main_stream); + + if (l + 1 != L) + { + floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + (l + 1) * B * T * C : acts.lnf; + float *l_ln1_mean = acts.ln1_mean + (l + 1) * B * T; + float *l_ln1_rstd = acts.ln1_rstd + (l + 1) * B * T; + const floatX *l_ln1w = params.ln1w + (l + 1) * C; + const floatX *l_ln1b = params.ln1b + (l + 1) * C; + fused_residual_forward5(l_residual3, l_ln1, l_ln1_mean, l_ln1_rstd, l_residual2, scratch, l_ln1w, l_ln1b, + B * T, C, main_stream); + } + else + { + fused_residual_forward5(l_residual3, acts.lnf, acts.lnf_mean, acts.lnf_rstd, l_residual2, scratch, + params.lnfw, params.lnfb, + B * T, C, main_stream); + } + } + + matmul_forward_cublaslt(acts.output, acts.lnf, params.wte, NULL, B, T, C, Vp, main_stream); + cudaCheck(cudaDeviceSynchronize()); +} + +float gpt2_validate(GPT2 *model, const int *inputs, const int *targets, size_t B, size_t T) +{ + assert(targets != NULL); + + gpt2_forward(model, inputs, B, T); + + const size_t V = model->config.vocab_size; + const size_t Vp = model->config.padded_vocab_size; + + NvtxRange classifier_and_loss_range("classifier_and_loss"); + ActivationTensors acts = model->acts; + float mean_loss = 0.0f; + + const float dloss = 1.0f / (B * T); + + cudaCheck(cudaMemset(acts.losses, 0, B * T * sizeof(float))); + cudaCheck(cudaMemcpy(model->targets, targets, B * T * sizeof(int), cudaMemcpyHostToDevice)); + tokenCheck(targets, B * T, V); + fused_classifier(acts.output, acts.losses, dloss, model->targets, B, T, V, Vp, False, main_stream); + cudaCheck(cudaMemcpy(model->cpu_losses, acts.losses, B * T * sizeof(float), cudaMemcpyDeviceToHost)); + for (int i = 0; i < B * T; i++) + { + mean_loss += model->cpu_losses[i]; + } + mean_loss /= B * T; + cudaCheck(cudaDeviceSynchronize()); + return mean_loss; +} + +void gpt2_backward_and_reduce(GPT2 *model, int *inputs, const int *targets, int grad_accum_steps, int micro_step) +{ + if (model->grads_memory == nullptr) + { + fprintf(stderr, "Need to allocate gradients before backward"); + exit(EXIT_FAILURE); + } + NVTX_RANGE_FN(); + bool last_step = micro_step == grad_accum_steps - 1; + + if (micro_step == 0) + { + + cudaCheck(cudaMemsetAsync(model->acts.losses, 0, model->batch_size * model->seq_len * sizeof(float), main_stream)); + cudaCheck(cudaMemsetAsync(model->grads_memory, 0, model->num_parameters * sizeof(floatX), main_stream)); + } + + const size_t B = model->batch_size; + const size_t T = model->seq_len; + const size_t V = model->config.vocab_size; + const size_t Vp = model->config.padded_vocab_size; + const size_t L = model->config.num_layers; + const size_t NH = model->config.num_heads; + const size_t C = model->config.channels; + + ParameterTensors params = model->params; + ParameterTensors grads = model->grads; + ActivationTensors acts = model->acts; + + NvtxRange classifier_and_loss_range("classifier_and_loss"); + const float dloss = 1.0f / (float)(B * T * grad_accum_steps); + cudaCheck(cudaMemcpy(model->targets, targets, B * T * sizeof(int), cudaMemcpyHostToDevice)); + tokenCheck(targets, B * T, V); + fused_classifier(acts.output, acts.losses, dloss, model->targets, B, T, V, Vp, True, main_stream); + + floatX *dresidual = (floatX *)model->acts.scratch_btc; + cudaCheck(cudaMemset(dresidual, 0, B * T * C * sizeof(floatX))); + + float *scratchF = (float *)acts.output; + floatX *scratchX = (floatX *)acts.output; + + matmul_backward(model->acts.scratch_bt4c, grads.wte, NULL, acts.output, acts.lnf, params.wte, NULL, B, T, C, Vp, main_stream); + + floatX *residual = acts.residual3 + (L - 1) * B * T * C; + layernorm_backward(dresidual, grads.lnfw, grads.lnfb, scratchF, model->acts.scratch_bt4c, residual, params.lnfw, acts.lnf_mean, acts.lnf_rstd, B, T, C, main_stream); + + floatX *dl_btc = residual; + + for (int l = L - 1; l >= 0; l--) + { + NvtxRange layer_range("Layer", l); + + residual = l == 0 ? acts.encoded : acts.residual3 + (l - 1) * B * T * C; + + floatX *l_ln1w = params.ln1w + l * C; + floatX *l_ln1b = params.ln1b + l * C; + floatX *l_qkvw = params.qkvw + l * 3 * C * C; + floatX *l_attprojw = params.attprojw + l * C * C; + floatX *l_ln2w = params.ln2w + l * C; + floatX *l_ln2b = params.ln2b + l * C; + floatX *l_fcw = params.fcw + l * 4 * C * C; + floatX *l_fcprojw = params.fcprojw + l * C * 4 * C; + + floatX *dl_ln1w = grads.ln1w + l * C; + floatX *dl_ln1b = grads.ln1b + l * C; + floatX *dl_qkvw = grads.qkvw + l * 3 * C * C; + floatX *dl_qkvb = grads.qkvb + l * 3 * C; + floatX *dl_attprojw = grads.attprojw + l * C * C; + floatX *dl_attprojb = grads.attprojb + l * C; + floatX *dl_ln2w = grads.ln2w + l * C; + floatX *dl_ln2b = grads.ln2b + l * C; + floatX *dl_fcw = grads.fcw + l * 4 * C * C; + floatX *dl_fcb = grads.fcb + l * 4 * C; + floatX *dl_fcprojw = grads.fcprojw + l * C * 4 * C; + floatX *dl_fcprojb = grads.fcprojb + l * C; + + floatX *l_ln1 = (model->recompute < 2) ? acts.ln1 + l * B * T * C : acts.lnf; + float *l_ln1_mean = acts.ln1_mean + l * B * T; + float *l_ln1_rstd = acts.ln1_rstd + l * B * T; + floatX *l_qkvr = acts.qkvr + l * B * T * 3 * C; + floatX *l_atty = acts.atty + l * B * T * C; + floatX *l_residual2 = acts.residual2 + l * B * T * C; + floatX *l_ln2 = (model->recompute < 2) ? acts.ln2 + l * B * T * C : acts.lnf; + float *l_ln2_mean = acts.ln2_mean + l * B * T; + float *l_ln2_rstd = acts.ln2_rstd + l * B * T; + floatX *l_fch_pre_gelu = acts.fch + l * B * T * 4 * C; + floatX *l_fch_gelu = (model->recompute < 1) ? acts.fch_gelu + l * B * T * 4 * C : acts.fch_gelu; + + floatX *dl_bt4c = (floatX *)model->acts.scratch_bt4c; + + if (model->recompute >= 1) + { + + gelu_forward(l_fch_gelu, l_fch_pre_gelu, B * T * 4 * C, main_stream); + } + matmul_backward(dl_bt4c, dl_fcprojw, dl_fcprojb, dresidual, l_fch_gelu, l_fcprojw, scratchF, B, T, 4 * C, C, main_stream, l_fch_pre_gelu, model->gelu_fusion); + if (model->recompute >= 2) + { + + layernorm_forward(l_ln2, l_ln2_mean, l_ln2_rstd, l_residual2, l_ln2w, l_ln2b, B, T, C, main_stream); + } + matmul_backward(dl_btc, dl_fcw, dl_fcb, dl_bt4c, l_ln2, l_fcw, scratchF, B, T, C, 4 * C, main_stream); + + layernorm_backward(dresidual, dl_ln2w, dl_ln2b, scratchF, dl_btc, l_residual2, l_ln2w, l_ln2_mean, l_ln2_rstd, B, T, C, main_stream); + matmul_backward(dl_btc, dl_attprojw, dl_attprojb, dresidual, l_atty, l_attprojw, scratchF, B, T, C, C, main_stream); + +#ifdef ENABLE_CUDNN + float *l_att = (float *)acts.att + l * B * NH * T; + attention_backward_cudnn(dl_bt4c, dl_btc, l_qkvr, l_atty, (float *)l_att, B, T, NH, C, main_stream); +#else + floatX *l_att = acts.att + l * B * NH * T * T; + + floatX *buffer_a = l_atty; + floatX *buffer_b = l_fch_pre_gelu; + attention_backward(dl_bt4c, buffer_b, scratchX, buffer_a, dl_btc, l_qkvr, l_att, B, T, C, NH, main_stream); +#endif + if (model->recompute >= 2) + { + layernorm_forward(l_ln1, l_ln1_mean, l_ln1_rstd, residual, l_ln1w, l_ln1b, B, T, C, main_stream); + } + + matmul_backward(dl_btc, dl_qkvw, dl_qkvb, dl_bt4c, l_ln1, l_qkvw, scratchF, B, T, C, 3 * C, main_stream); + + layernorm_backward(dresidual, dl_ln1w, dl_ln1b, scratchF, dl_btc, residual, l_ln1w, l_ln1_mean, l_ln1_rstd, B, T, C, main_stream); + + if (last_step) + { + floatX *const pointers[] = { + dl_ln1w, dl_ln1b, + dl_qkvw, dl_qkvb, + dl_attprojw, dl_attprojb, + dl_ln2w, dl_ln2b, + dl_fcw, dl_fcb, + dl_fcprojw, dl_fcprojb}; + const size_t nelem[] = { + C, C, + 3 * C * C, 3 * C, + C * C, C, + C, C, + 4 * C * C, 4 * C, + C * 4 * C, C}; + multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream); + } + } + encoder_backward(grads.wte, grads.wpe, scratchX, model->workload_indices, model->bucket_info, + dresidual, model->inputs, inputs, B, T, C, random_u32(&model->rng_state), main_stream); + + if (last_step) + { + + global_sum_deterministic(model->accumulated_mean_loss, acts.losses, B * T, main_stream); + +#if MULTI_GPU + ncclCheck(ncclAllReduce(model->accumulated_mean_loss, model->accumulated_mean_loss, sizeof(float), ncclFloat, ncclAvg, multi_gpu_config.nccl_comm, main_stream)); +#endif + cudaCheck(cudaMemcpyAsync(&model->mean_loss, model->accumulated_mean_loss, sizeof(float), cudaMemcpyDeviceToHost, main_stream)); + + floatX *const pointers[] = {grads.wte, grads.wpe, grads.lnfw, grads.lnfb}; + const size_t nelem[] = {Vp * C, T * C, C, C}; + multi_gpu_async_reduce_gradient(pointers, nelem, &multi_gpu_config, main_stream); + } + + cudaCheck(cudaDeviceSynchronize()); + if (last_step) + { + model->mean_loss /= B * T * grad_accum_steps; + } + else + { + model->mean_loss = -1.f; + } +} + +ShardInfo gpt2_get_tensor_at_layer(const GPT2 *model, int layer_id, int param_tensor_id) +{ + + ptrdiff_t offset = 0; + for (int i = 0; i < param_tensor_id; i++) + { + offset += (ptrdiff_t)model->param_elements[i]; + } + size_t size = model->param_elements[param_tensor_id]; + + if (2 <= param_tensor_id && param_tensor_id <= 13) + { + size /= model->config.num_layers; + offset += (ptrdiff_t)(layer_id * size); + } + return {offset, size}; +} + +float gpt2_calculate_grad_norm(GPT2 *model, MultiGpuConfig *multi_gpu_config) +{ + NVTX_RANGE_FN(); + floatX *grads_memory = (floatX *)model->grads_memory; + + float *grad_norm_squared = (float *)model->acts.output; + float grad_norm_squared_cpu = 0.0f; + + int num_slices[2] = {1, model->config.num_layers}; + int max_num_block_sums = get_max_num_block_sums(num_slices, 2); + if (multi_gpu_config->zero_stage == 1) + { + + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + ShardInfo tensor = gpt2_get_tensor_at_layer(model, 0, i); + ShardInfo shard = multi_gpu_get_shard_offset(tensor.size, multi_gpu_config, 1); + ptrdiff_t offset = tensor.offset + shard.offset; + bool is_first_pass = (i == 0); + if ((i < 2 || i > 13)) + { + global_norm_squared(grad_norm_squared, grads_memory + offset, shard.size, 0, 1, + max_num_block_sums, is_first_pass, main_stream); + } + else + { + global_norm_squared(grad_norm_squared, grads_memory + offset, shard.size, tensor.size, model->config.num_layers, + max_num_block_sums, is_first_pass, main_stream); + } + } + global_sum_deterministic(grad_norm_squared, grad_norm_squared, max_num_block_sums, main_stream); +#if MULTI_GPU + + ncclCheck(ncclAllReduce(grad_norm_squared, grad_norm_squared, sizeof(float), ncclFloat, ncclSum, multi_gpu_config->nccl_comm, main_stream)); +#endif + } + else + { + + global_norm_squared(grad_norm_squared, grads_memory, model->num_parameters, 0, 1, max_num_block_sums, true, main_stream); + global_sum_deterministic(grad_norm_squared, grad_norm_squared, max_num_block_sums, main_stream); + } + cudaCheck(cudaMemcpy(&grad_norm_squared_cpu, grad_norm_squared, sizeof(float), cudaMemcpyDeviceToHost)); + float grad_norm_cpu = sqrtf(grad_norm_squared_cpu); + return grad_norm_cpu; +} + +void gpt2_update(GPT2 *model, float learning_rate, float beta1, float beta2, float eps, float weight_decay, float grad_scale, int t, + MultiGpuConfig *multi_gpu_config, bool init_from_master_only = false) +{ + + NVTX_RANGE_FN(); + if (model->grads_memory == nullptr || model->m_memory == nullptr || model->v_memory == nullptr) + { + fprintf(stderr, "Need to allocate optimizer state before update"); + exit(EXIT_FAILURE); + } + + bool init_state = model->init_state; + if (init_state) + { + model->init_state = false; + NvtxRange rng("InitOpt"); + cudaCheck(cudaMemset(model->m_memory, 0, multi_gpu_config->shard_num_parameters * sizeof(float))); + cudaCheck(cudaMemset(model->v_memory, 0, multi_gpu_config->shard_num_parameters * sizeof(float))); + } + + model->rng_state_last_update = model->rng_state; + + for (int i = 0; i < NUM_PARAMETER_TENSORS; i++) + { + + unsigned int seed = random_u32(&model->rng_state); + + int num_layers = model->config.num_layers; + if ((i < 2 || i > 13)) + { + num_layers = 1; + } + + ShardInfo tensor = gpt2_get_tensor_at_layer(model, 0, i); + ShardInfo shard = multi_gpu_get_shard_offset(tensor.size, multi_gpu_config, 1); + ptrdiff_t local_offset_full = tensor.offset + shard.offset; + ptrdiff_t local_offset_partial = tensor.offset / multi_gpu_config->num_processes; + + float wd = (i == 0 || i == 1 || i == 4 || i == 6 || i == 10 || i == 12) ? weight_decay : 0.0f; + floatX *param_ptr = (floatX *)model->params_memory + local_offset_full; + floatX *grad_ptr = (floatX *)model->grads_memory + local_offset_full; + + ptrdiff_t opt_state_offset = multi_gpu_config->zero_stage < 1 ? local_offset_full : local_offset_partial; + float *m_ptr = model->m_memory + opt_state_offset; + float *v_ptr = model->v_memory + opt_state_offset; + float *master_ptr = nullptr; + if (model->master_weights != nullptr) + { + master_ptr = model->master_weights + opt_state_offset; + } + if (init_state && model->master_weights != nullptr) + { + size_t grid_size = CEIL_DIV(shard.size, 512); + copy_and_cast_kernel<<>>(master_ptr, param_ptr, shard.size, + shard.size, tensor.size); + cudaCheck(cudaGetLastError()); + } + + if (init_from_master_only) + { + + init_from_master(param_ptr, master_ptr, shard.size, tensor.size, shard.size, num_layers, seed, main_stream); + } + else + { + + adamw_update(param_ptr, master_ptr, grad_ptr, + m_ptr, v_ptr, + shard.size, tensor.size, tensor.size, shard.size, num_layers, + learning_rate, + beta1, beta2, t, eps, wd, grad_scale, seed, main_stream); + } + + if (multi_gpu_config->zero_stage == 1) + { +#if MULTI_GPU + ncclCheck(ncclGroupStart()); + for (int l = 0; l < num_layers; ++l) + { + + ncclCheck(ncclAllGather(param_ptr + l * tensor.size, + (floatX *)model->params_memory + tensor.offset + l * tensor.size, + shard.size, ncclFloatX, + multi_gpu_config->nccl_comm, multi_gpu_config->nccl_stream)); + } + ncclCheck(ncclGroupEnd()); +#endif + } + } + + cudaCheck(cudaDeviceSynchronize()); +} + +float gpt2_estimate_mfu(GPT2 *model, int num_tokens, float dt) +{ + + size_t N = model->num_parameters; + int L = model->config.num_layers; + int C = model->config.channels; + int T = model->seq_len; + size_t flops_per_token = 6 * N + (size_t)6 * L * C * T; + size_t flops_per_step = flops_per_token * num_tokens; + + float flops_achieved = (float)flops_per_step * (1.0f / dt); + float flops_promised = get_flops_promised(deviceProp.name, PRECISION_MODE) * 1e12f; + if (flops_promised < 0) + { + return -1.f; + } + float mfu = flops_achieved / flops_promised; + return mfu; +} + +void gpt2_free(GPT2 *model) +{ + cudaFreeCheck(&model->params_memory); + cudaFreeCheck(&model->grads_memory); + cudaFreeCheck(&model->m_memory); + cudaFreeCheck(&model->v_memory); + cudaFreeCheck(&model->master_weights); + cudaFreeCheck(&model->acts_memory); + cudaFreeCheck(&model->inputs); + cudaFreeCheck(&model->targets); + cudaFreeCheck(&model->accumulated_mean_loss); + cudaCheck(cudaFreeHost(model->cpu_losses)); + free(model->workload_indices); + free(model->bucket_info); +} + +void common_start(bool override_enable_tf32 = true, bool print_device_info = true) +{ + + cudaCheck(cudaGetDeviceProperties(&deviceProp, multi_gpu_config.local_device_idx)); + if (print_device_info) + { + printf("[System]\n"); + printf("Device %d: %s\n", multi_gpu_config.local_device_idx, deviceProp.name); + } + + cudaCheck(cudaStreamCreate(&main_stream)); + nvtxNameCudaStreamA(main_stream, "main stream"); + + cublasCheck(cublasLtCreate(&cublaslt_handle)); + cudaCheck(cudaMalloc(&cublaslt_workspace, cublaslt_workspace_size)); + + bool enable_tf32 = PRECISION_MODE == PRECISION_FP32 && deviceProp.major >= 8 && override_enable_tf32; + cublas_compute = enable_tf32 ? CUBLAS_COMPUTE_32F_FAST_TF32 : CUBLAS_COMPUTE_32F; + +#ifdef ENABLE_CUDNN + create_cudnn(); +#endif +} + +void common_free(GPT2 &model) +{ + cudaCheck(cudaStreamDestroy(main_stream)); + cudaCheck(cudaFree(cublaslt_workspace)); + cublasCheck(cublasLtDestroy(cublaslt_handle)); +#ifdef ENABLE_CUDNN + destroy_cudnn(); +#endif +} + +void save_state(const char *filename, int step, GPT2 *model, DataLoader *loader) +{ + printf("Writing state to %s\n", filename); + FILE *state_file = fopenCheck(filename, "wb"); + int state_header[256]; + memset(state_header, 0, sizeof(state_header)); + + state_header[0] = 20240527; + state_header[1] = 1; + state_header[2] = multi_gpu_config.num_processes; + state_header[3] = multi_gpu_config.process_rank; + state_header[4] = model->use_master_weights; + state_header[5] = loader->should_shuffle; + + state_header[10] = step; + + *((unsigned long long *)&state_header[20]) = model->rng_state; + *((unsigned long long *)&state_header[22]) = model->rng_state_last_update; + + *((size_t *)&state_header[30]) = loader->current_shard_idx; + *((size_t *)&state_header[32]) = loader->current_sample_idx; + fwriteCheck(state_header, sizeof(int), 256, state_file); + + size_t shard_num_parameters = multi_gpu_config.shard_num_parameters; + device_to_file(state_file, model->m_memory, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + device_to_file(state_file, model->v_memory, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + if (model->use_master_weights) + { + device_to_file(state_file, model->master_weights, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + } + + if (loader->should_shuffle) + { + fwriteCheck(&loader->glob_result.gl_pathc, sizeof(size_t), 1, state_file); + fwriteCheck(loader->shard_indices, sizeof(int), loader->glob_result.gl_pathc, state_file); + fwriteCheck(&loader->shard_num_samples, sizeof(size_t), 1, state_file); + fwriteCheck(loader->intra_shard_indices, sizeof(int), loader->shard_num_samples, state_file); + fwriteCheck(&loader->shuffle_rng, sizeof(mt19937_state), 1, state_file); + } + fcloseCheck(state_file); +} + +void load_state(int *step, GPT2 *model, DataLoader *loader, const char *filename) +{ + FILE *state_file = fopenCheck(filename, "rb"); + int state_header[256]; + freadCheck(state_header, sizeof(int), 256, state_file); + assert(state_header[0] == 20240527); + assert(state_header[1] == 1); + assert(state_header[2] == multi_gpu_config.num_processes); + assert(state_header[3] == multi_gpu_config.process_rank); + int use_master_weights = state_header[4]; + int should_shuffle = state_header[5]; + *step = state_header[10]; + model->rng_state = *((unsigned long long *)&state_header[20]); + model->rng_state_last_update = *((unsigned long long *)&state_header[22]); + size_t current_shard_idx = *((size_t *)&state_header[30]); + size_t current_sample_idx = *((size_t *)&state_header[32]); + + size_t shard_num_parameters = multi_gpu_config.shard_num_parameters; + if (use_master_weights == 1 && !model->use_master_weights) + { + printf0("Warning: Master weights are present in state, but not enabled for current run."); + } + else if (use_master_weights == 0 && model->use_master_weights) + { + printf0("Error: Master weights requested, but not present in state file."); + exit(EXIT_FAILURE); + } + + model->init_state = false; + assert(model->m_memory != nullptr); + assert(model->v_memory != nullptr); + file_to_device(model->m_memory, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + file_to_device(model->v_memory, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + if (model->use_master_weights) + { + assert(model->master_weights != nullptr); + file_to_device(model->master_weights, state_file, shard_num_parameters * sizeof(float), IO_BUF_SIZE, main_stream); + + model->rng_state = model->rng_state_last_update; + gpt2_update(model, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0, &multi_gpu_config, true); + model->rng_state = *((unsigned long long *)&state_header[20]); + } + + loader->should_shuffle = should_shuffle; + if (should_shuffle == 1) + { + + size_t glob_result_gl_pathc; + freadCheck(&glob_result_gl_pathc, sizeof(size_t), 1, state_file); + assert(glob_result_gl_pathc == loader->glob_result.gl_pathc); + + loader->shard_indices = (int *)mallocCheck(loader->glob_result.gl_pathc * sizeof(int)); + freadCheck(loader->shard_indices, sizeof(int), loader->glob_result.gl_pathc, state_file); + + size_t shard_num_samples; + freadCheck(&shard_num_samples, sizeof(size_t), 1, state_file); + assert(shard_num_samples == loader->shard_num_samples); + + loader->intra_shard_indices = (int *)mallocCheck(loader->shard_num_samples * sizeof(int)); + freadCheck(loader->intra_shard_indices, sizeof(int), loader->shard_num_samples, state_file); + + freadCheck(&loader->shuffle_rng, sizeof(mt19937_state), 1, state_file); + } + dataloader_resume(loader, current_shard_idx, current_sample_idx); + + fcloseCheck(state_file); +} + +void write_checkpoint(const char *output_log_dir, int step, GPT2 *model, DataLoader *train_loader, MultiGpuConfig *multi_gpu_config) +{ + + printf0("Writing checkpoint at step %d\n", step); + int rank = multi_gpu_config->process_rank; + + if (rank == 0) + { + snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, step); + gpt2_write_to_checkpoint(model, filename_buffer); + } + + snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, step, rank); + save_state(filename_buffer, step, model, train_loader); + + multi_gpu_barrier(multi_gpu_config); + if (rank == 0) + { + snprintf(filename_buffer, sizeof(filename_buffer), "%s/DONE_%08d", output_log_dir, step); + FILE *done_file = fopenCheck(filename_buffer, "w"); + fcloseCheck(done_file); + } +} + +void delete_checkpoint(const char *output_log_dir, int step, MultiGpuConfig *multi_gpu_config) +{ + + printf0("Deleting checkpoint at step %d\n", step); + int rank = multi_gpu_config->process_rank; + if (rank == 0) + { + snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, step); + remove(filename_buffer); + } + snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, step, rank); + remove(filename_buffer); + if (rank == 0) + { + snprintf(filename_buffer, sizeof(filename_buffer), "%s/DONE_%08d", output_log_dir, step); + remove(filename_buffer); + } +} + +#ifndef TESTING + +void error_usage() +{ + fprintf(stderr, "Usage: ./train_gpt2cu [options]\n"); + fprintf(stderr, "Options:\n"); + + fprintf(stderr, " -i train data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_train.bin)\n"); + fprintf(stderr, " -j val data filename pattern (default = dev/data/tinyshakespeare/tiny_shakespeare_val.bin)\n"); + fprintf(stderr, " -e input .bin filename or descriptor, see code comments as docs. (default = gpt2_124M_bf16.bin)\n"); + fprintf(stderr, " -o output log dir (default = NULL, no logging)\n"); + fprintf(stderr, " -lg log gpu info every x steps (default = -1; disabled)\n"); + fprintf(stderr, " -n write optimization checkpoints every how many steps? (default 0, don't)\n"); + fprintf(stderr, " -nk max number of checkpoints to keep in the directory, removing old ones (0 = disable, default)\n"); + fprintf(stderr, " -nm every how many step checkpoints are considered major? major checkpoints never get deleted.\n"); + fprintf(stderr, " -y resume optimization found inside output log dir? (0=restart/overwrite, 1=resume/append)\n"); + + fprintf(stderr, " -b (per-GPU, micro) batch size B (default = 4)\n"); + fprintf(stderr, " -t sequence length T (default = 1024)\n"); + fprintf(stderr, " -d total desired batch size (default = B * T * num_processes, i.e. no grad accumulation\n"); + + fprintf(stderr, " -x max_steps of optimization to run (-1 (default) = disable, run 1 epoch)\n"); + + fprintf(stderr, " -k learning rate scheduler (default = cosine)\n"); + fprintf(stderr, " -l learning rate (default = 3e-4f)\n"); + fprintf(stderr, " -u learning rate warmup iterations (default = 0, no warmup)\n"); + fprintf(stderr, " -q learning rate decay: final fraction, at end of training (default = 1.0 (no decay))\n"); + fprintf(stderr, " -c weight decay (default = 0.0f)\n"); + fprintf(stderr, " -sl outlier stability: skip update if loss goes above this in zscore (0.0f=off)\n"); + fprintf(stderr, " -sg outlier stability: skip update if grad_norm goes above this in zscore (0.0f=off)\n"); + + fprintf(stderr, " -v val_loss_every, how often we evaluate val loss (default = 20)\n"); + fprintf(stderr, " -m val_max_steps, up to how many val batches to estimate val loss? (default = 20)\n"); + fprintf(stderr, " -s sample_every, how often we inference the model (default = 20)\n"); + fprintf(stderr, " -g genT, how many steps of inference we do (default = 64)\n"); + fprintf(stderr, " -h hellaswag eval run? (default = 0)\n"); + + fprintf(stderr, " -a overfit a single batch? 0/1. useful for debugging\n"); + + fprintf(stderr, " -f enable_tf32 override (default: 1, set to 0 to disable tf32)\n"); + fprintf(stderr, " -w keep f32 copy of weights for the optimizer? (default: 1)\n"); + fprintf(stderr, " -ge gelu fusion: 0=none, 1=forward, 2=forward+backward (default: 2 for >=SM90, 0 for older GPUs)\n"); + + fprintf(stderr, " -z zero_stage, Zero Optimization Stage, 0,1,2,3 (default = 0)\n"); + fprintf(stderr, " -r recompute: less memory but less speed. (default = 1), 0|1|2 = none,gelu,gelu+ln\n"); + + fprintf(stderr, " -pn num_processes (default = 1)\n"); + fprintf(stderr, " -pr process_rank (default = 0)\n"); + fprintf(stderr, " -pg gpus_per_node (default = 8)\n"); + fprintf(stderr, " -pm nccl_init_method: tcp,fs,mpi (default = mpi)\n"); + fprintf(stderr, " -ps server_ip - used only when nccl_init_method is tcp (default = -1)\n"); + fprintf(stderr, " -pp fs_path - used only when nccl_init_method is fs (default = /tmp)\n"); + exit(EXIT_FAILURE); +} + +int main(int argc, char *argv[]) +{ + + const char *train_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_train.bin"; + const char *val_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_val.bin"; + const char *load_filename = "gpt2_124M_bf16.bin"; + const char *lr_scheduler_type = "cosine"; + const char *output_log_dir = NULL; + int checkpoint_every = 0; + int checkpoints_keep = 0; + int major_checkpoint_every = 0; + int resume = 0; + int B = 4; + int T = 1024; + int total_batch_size = -1; + float learning_rate = 3e-4f; + int log_gpu_every = -1; + int warmup_iterations = 0; + float final_learning_rate_frac = 1.0f; + float weight_decay = 0.0f; + float skip_update_lossz = 0.0f; + float skip_update_gradz = 0.0f; + int val_loss_every = 20; + int val_max_steps = 20; + int sample_every = 20; + int genT = 64; + int overfit_single_batch = 0; + int max_steps = -1; + int override_enable_tf32 = 1; + int use_master_weights = 1; + int gelu_fusion = -1; + int recompute = 1; + int zero_stage = 0; + int hellaswag_eval = 0; + + int num_processes = 1; + int process_rank = 0; + int gpus_per_node = 8; + char nccl_init_method[256] = "mpi"; + char server_ip[256] = ""; + char fs_path[256] = ""; + for (int i = 1; i < argc; i += 2) + { + if (i + 1 >= argc) + { + error_usage(); + } + if (argv[i][0] != '-') + { + error_usage(); + } + if (!(strlen(argv[i]) == 2 || strlen(argv[i]) == 3)) + { + error_usage(); + } + + if (argv[i][1] == 'i') + { + train_data_pattern = argv[i + 1]; + } + else if (argv[i][1] == 'j') + { + val_data_pattern = argv[i + 1]; + } + else if (argv[i][1] == 'e') + { + load_filename = argv[i + 1]; + } + else if (argv[i][1] == 'o') + { + output_log_dir = argv[i + 1]; + } + else if (argv[i][1] == 'n' && argv[i][2] == '\0') + { + checkpoint_every = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'y') + { + resume = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'b') + { + B = atoi(argv[i + 1]); + } + else if (argv[i][1] == 't') + { + T = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'd') + { + total_batch_size = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'l' && argv[i][2] == '\0') + { + learning_rate = atof(argv[i + 1]); + } + else if (argv[i][1] == 'l' && argv[i][2] == 'g') + { + log_gpu_every = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'u') + { + warmup_iterations = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'q') + { + final_learning_rate_frac = atof(argv[i + 1]); + } + else if (argv[i][1] == 'c') + { + weight_decay = atof(argv[i + 1]); + } + else if (argv[i][1] == 'x') + { + max_steps = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'v') + { + val_loss_every = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'm') + { + val_max_steps = atoi(argv[i + 1]); + } + else if (argv[i][1] == 's' && argv[i][2] == '\0') + { + sample_every = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'g' && argv[i][2] == 'e') + { + gelu_fusion = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'g') + { + genT = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'a') + { + overfit_single_batch = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'f') + { + override_enable_tf32 = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'w') + { + use_master_weights = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'z') + { + zero_stage = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'r') + { + recompute = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'h') + { + hellaswag_eval = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'k') + { + lr_scheduler_type = argv[i + 1]; + } + else if (argv[i][1] == 'p' && argv[i][2] == 'i') + { + strcpy(nccl_init_method, argv[i + 1]); + } + else if (argv[i][1] == 'p' && argv[i][2] == 'f') + { + strcpy(fs_path, argv[i + 1]); + } + else if (argv[i][1] == 'p' && argv[i][2] == 's') + { + strcpy(server_ip, argv[i + 1]); + } + else if (argv[i][1] == 'p' && argv[i][2] == 'n') + { + num_processes = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'p' && argv[i][2] == 'r') + { + process_rank = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'p' && argv[i][2] == 'g') + { + gpus_per_node = atoi(argv[i + 1]); + } + else if (argv[i][1] == 's' && argv[i][2] == 'l') + { + skip_update_lossz = atof(argv[i + 1]); + } + else if (argv[i][1] == 's' && argv[i][2] == 'g') + { + skip_update_gradz = atof(argv[i + 1]); + } + else if (argv[i][1] == 'n' && argv[i][2] == 'k') + { + checkpoints_keep = atoi(argv[i + 1]); + } + else if (argv[i][1] == 'n' && argv[i][2] == 'm') + { + major_checkpoint_every = atoi(argv[i + 1]); + } + else + { + error_usage(); + } + } + + multi_gpu_config = multi_gpu_config_init(num_processes, process_rank, gpus_per_node, server_ip, fs_path, nccl_init_method); + common_start(override_enable_tf32, false); + + assert(warmup_iterations >= 0); + if (output_log_dir != NULL) + { + assert(strlen(output_log_dir) < 400); + } + int tokens_per_fwdbwd = B * T * multi_gpu_config.num_processes; + + if (total_batch_size == -1) + { + total_batch_size = tokens_per_fwdbwd; + } + + if (gelu_fusion == -1) + { + gelu_fusion = 0; + } + + assert(total_batch_size % tokens_per_fwdbwd == 0); + int grad_accum_steps = total_batch_size / tokens_per_fwdbwd; + + if (overfit_single_batch == 1) + { + train_data_pattern = val_data_pattern; + } + printf0("+-----------------------+----------------------------------------------------+\n"); + printf0("| Parameter | Value |\n"); + printf0("+-----------------------+----------------------------------------------------+\n"); + printf0("| train data pattern | %-50s |\n", train_data_pattern); + printf0("| val data pattern | %-50s |\n", val_data_pattern); + printf0("| output log dir | %-50s |\n", output_log_dir == NULL ? "NULL" : output_log_dir); + printf0("| checkpoint_every | %-50d |\n", checkpoint_every); + printf0("| resume | %-50d |\n", resume); + printf0("| micro batch size B | %-50d |\n", B); + printf0("| sequence length T | %-50d |\n", T); + printf0("| total batch size | %-50d |\n", total_batch_size); + printf0("| LR scheduler | %-50s |\n", lr_scheduler_type); + printf0("| learning rate (LR) | %-50e |\n", learning_rate); + printf0("| warmup iterations | %-50d |\n", warmup_iterations); + printf0("| final LR fraction | %-50e |\n", final_learning_rate_frac); + printf0("| weight decay | %-50e |\n", weight_decay); + printf0("| skip update lossz | %-50f |\n", skip_update_lossz); + printf0("| skip update gradz | %-50f |\n", skip_update_gradz); + printf0("| max_steps | %-50d |\n", max_steps); + printf0("| val_loss_every | %-50d |\n", val_loss_every); + printf0("| val_max_steps | %-50d |\n", val_max_steps); + printf0("| sample_every | %-50d |\n", sample_every); + printf0("| genT | %-50d |\n", genT); + printf0("| overfit_single_batch | %-50d |\n", overfit_single_batch); + printf0("| use_master_weights | %-50s |\n", use_master_weights ? "enabled" : "disabled"); + printf0("| gelu_fusion | %-50d |\n", gelu_fusion); + printf0("| recompute | %-50d |\n", recompute); + printf0("+-----------------------+----------------------------------------------------+\n"); + const char *precision_str = (PRECISION_MODE == PRECISION_FP32) + ? (cublas_compute == CUBLAS_COMPUTE_32F_FAST_TF32 ? "TF32" : "FP32") + : (PRECISION_MODE == PRECISION_FP16 ? "FP16" : "BF16"); + printf0("| device | %-50s |\n", deviceProp.name); + printf0("| peak TFlops | %-50.1f |\n", get_flops_promised(deviceProp.name, PRECISION_MODE)); + printf0("| precision | %-50s |\n", precision_str); + printf0("+-----------------------+----------------------------------------------------+\n"); + + int resuming = 0; + + int resume_max_step = find_max_step(output_log_dir); + if (resume == 1) + { + assert(output_log_dir != NULL); + if (resume_max_step != -1) + { + resuming = 1; + snprintf(filename_buffer, sizeof(filename_buffer), "%s/model_%08d.bin", output_log_dir, resume_max_step); + } + } + + GPT2 model; + gpt2_init_common(&model); + if (resuming == 1) + { + + bool weight_init = !use_master_weights; + gpt2_build_from_checkpoint(&model, filename_buffer, weight_init); + } + else if (ends_with_bin(load_filename)) + { + + gpt2_build_from_checkpoint(&model, load_filename); + } + else + { + + gpt_build_from_descriptor(&model, load_filename); + } + + model.use_master_weights = use_master_weights; + model.gelu_fusion = gelu_fusion; + model.recompute = recompute; + printf0("| weight init method | %-50s |\n", resuming == 1 ? "intermediate checkpoint" : load_filename); + printf0("| max_sequence_length T | %-50d |\n", model.config.max_seq_len); + printf0("| vocab_size V | %-50d |\n", model.config.vocab_size); + printf0("| padded_vocab_size Vp | %-50d |\n", model.config.padded_vocab_size); + printf0("| num_layers L | %-50d |\n", model.config.num_layers); + printf0("| num_heads NH | %-50d |\n", model.config.num_heads); + printf0("| channels C | %-50d |\n", model.config.channels); + printf0("| num_parameters | %-50zu |\n", model.num_parameters); + printf0("+-----------------------+----------------------------------------------------+\n"); + + int permute_train_loader = (overfit_single_batch == 1) ? 0 : 1; + DataLoader train_loader, val_loader; + dataloader_init(&train_loader, train_data_pattern, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes, permute_train_loader); + dataloader_init(&val_loader, val_data_pattern, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes, 0); + + int train_num_batches = max_steps; + if (train_num_batches == -1) + { + + size_t ntok = train_loader.num_tokens; + + train_num_batches = ntok / total_batch_size; + } + + int val_num_batches = val_max_steps; + if (val_num_batches == -1) + { + + size_t ntok = val_loader.num_tokens; + + val_num_batches = ntok / tokens_per_fwdbwd; + } + printf0("| train_num_batches | %-50d |\n", train_num_batches); + printf0("| val_num_batches | %-50d |\n", val_num_batches); + printf0("+-----------------------+----------------------------------------------------+\n"); + + EvalLoader eval_loader; + const char *hellaswag_path = "dev/data/hellaswag/hellaswag_val.bin"; + const bool hellaswag_available = access(hellaswag_path, F_OK) == 0; + const bool run_hellaswag = hellaswag_eval && hellaswag_available; + if (run_hellaswag) + { + evalloader_init(&eval_loader, hellaswag_path, B, T, multi_gpu_config.process_rank, multi_gpu_config.num_processes); + } + printf0("| run hellaswag | %-50s |\n", run_hellaswag ? "yes" : "no"); + printf0("+-----------------------+----------------------------------------------------+\n"); + + set_zero_configs(&multi_gpu_config, zero_stage, model.num_parameters); + printf0("| num_processes | %-50d |\n", multi_gpu_config.num_processes); + printf0("| zero_stage | %-50d |\n", multi_gpu_config.zero_stage); + printf0("+-----------------------+----------------------------------------------------+\n"); + + if (!hellaswag_available) + { + printf0("HellaSwag eval not found at %s, skipping its evaluation\n", hellaswag_path); + printf0("You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.\n"); + } + + printf0("num_parameters: %zu => bytes: %zu\n", model.num_parameters, model.num_parameters_bytes); + printf0("allocated %d MiB for model parameters\n", (int)round(model.num_parameters_bytes / (1024 * 1024))); + + printf0("batch_size B=%d * seq_len T=%d * num_processes=%d and total_batch_size=%d\n", + B, T, multi_gpu_config.num_processes, total_batch_size); + printf0("=> setting grad_accum_steps=%d\n", grad_accum_steps); + + if (multi_gpu_config.process_rank == 0) + { + create_dir_if_not_exists(output_log_dir); + } + Logger logger; + logger_init(&logger, output_log_dir, multi_gpu_config.process_rank, resume); + + Tokenizer tokenizer; + tokenizer_init(&tokenizer, "gpt2_tokenizer.bin"); + + LearningRateScheduler lr_scheduler; + lr_scheduler_init(&lr_scheduler, lr_scheduler_type, learning_rate, + warmup_iterations, train_num_batches, final_learning_rate_frac); + + int *gen_tokens = (int *)mallocCheck(B * T * sizeof(int)); + floatX *cpu_logits_raw = (floatX *)mallocCheck(model.config.vocab_size * sizeof(floatX)); + float *cpu_logits = (float *)mallocCheck(model.config.vocab_size * sizeof(float)); + + int step = 0; + gpt2_allocate_state(&model, B, T); + if (resuming == 1) + { + snprintf(filename_buffer, sizeof(filename_buffer), "%s/state_%08d_%05d.bin", output_log_dir, resume_max_step, multi_gpu_config.process_rank); + load_state(&step, &model, &train_loader, filename_buffer); + } + + OutlierDetector loss_outlier_detector, grad_norm_outlier_detector; + init_detector(&loss_outlier_detector); + init_detector(&grad_norm_outlier_detector); + + if (T < model.config.max_seq_len) + { + printf0("!!!!!!!!\n"); + printf0("WARNING:\n"); + printf0("- The training sequence length is: T=%d (set with -t)\n", T); + printf0("- The model's max sequence length is: max_seq_len=%d\n", model.config.max_seq_len); + printf0("You are attempting to train with a sequence length shorter than the model's max.\n"); + printf0("This will lead to unused parameters in the wpe position embedding weights.\n"); + printf0("If you know what you're doing you can ignore this warning.\n"); + printf0("If you're like ???, you are most likely misconfiguring your training run.\n"); + printf0("---> HINT: If you're training GPT-2 use -t 1024. If GPT-3, use -t 2048.\n"); + printf0("!!!!!!!!\n"); + } + + assert(T <= model.config.max_seq_len); + + cudaEvent_t start, end; + cudaCheck(cudaEventCreate(&start)); + cudaCheck(cudaEventCreate(&end)); + cudaCheck(cudaProfilerStart()); + double total_sum_iteration_time_s = 0.0; + float ema_tokens_per_second = 0.0f; + for (; step <= train_num_batches; step++) + { + NvtxRange step_range("Train step", step); + + int last_step = step == train_num_batches; + + if (step % val_loss_every == 0 || last_step) + { + NvtxRange validation_range("validation"); + float val_loss = 0.0f; + dataloader_reset(&val_loader); + for (int i = 0; i < val_num_batches; i++) + { + dataloader_next_batch(&val_loader); + val_loss += gpt2_validate(&model, val_loader.inputs, val_loader.targets, B, T); + } + val_loss /= val_num_batches; + val_loss = multi_gpu_cpu_float_sum(val_loss, &multi_gpu_config) / multi_gpu_config.num_processes; + printf0("val loss %f\n", val_loss); + logger_log_val(&logger, step, val_loss); + } + + if (run_hellaswag && + ((step > 0 && step % val_loss_every == 0) || last_step)) + { + NvtxRange evaluation_range("evaluation"); + float eval_acc_norm = 0.0f; + evalloader_reset(&eval_loader); + for (int i = 0; i < eval_loader.num_batches; i++) + { + if (i % 10 == 0) + { + printf("evaluating HellaSwag: %d/%d\r", i, eval_loader.num_batches); + } + evalloader_next_batch(&eval_loader); + gpt2_validate(&model, eval_loader.inputs, eval_loader.targets, B, T); + int correct = evalloader_stat_losses(&eval_loader, model.cpu_losses); + eval_acc_norm += (float)correct; + } + + eval_acc_norm = multi_gpu_cpu_float_sum(eval_acc_norm, &multi_gpu_config); + printf0("HellaSwag: %d/%d = %f\n", (int)eval_acc_norm, eval_loader.num_examples, eval_acc_norm / eval_loader.num_examples); + logger_log_eval(&logger, step, eval_acc_norm / eval_loader.num_examples); + } + + if (multi_gpu_config.process_rank == 0 && sample_every > 0 && + (step > 0 && (step % sample_every) == 0 || last_step)) + { + NvtxRange generation_range("generation"); + unsigned long long sample_rng_state = 1337; + + int eot_token = tokenizer.eot_token; + for (int i = 0; i < B * T; ++i) + { + gen_tokens[i] = eot_token; + } + + printf("generating:\n---\n"); + for (int t = 1; t < genT; t++) + { + NvtxRange generation_range("Generation step", t); + + gpt2_forward(&model, gen_tokens, 1, CEIL_DIV(t, min(T, 256)) * min(T, 256)); + + floatX *logits = model.acts.output + (t - 1) * model.config.padded_vocab_size; + + cudaCheck(cudaMemcpy(cpu_logits_raw, logits, model.config.vocab_size * sizeof(floatX), cudaMemcpyDeviceToHost)); + + for (int i = 0; i < model.config.vocab_size; i++) + { + cpu_logits[i] = (float)cpu_logits_raw[i]; + } + + float coin = random_f32(&sample_rng_state); + int next_token = sample_softmax(cpu_logits, model.config.vocab_size, coin); + gen_tokens[t] = next_token; + + if (tokenizer.init_ok) + { + const char *token_str = tokenizer_decode(&tokenizer, next_token); + safe_printf(token_str); + } + else + { + + printf("%d ", next_token); + } + fflush(stdout); + } + printf("\n---\n"); + } + + if ((checkpoint_every > 0 && output_log_dir != NULL && resuming == 0) && + ((step > 0 && step % checkpoint_every == 0) || last_step)) + { + + write_checkpoint(output_log_dir, step, &model, &train_loader, &multi_gpu_config); + + int step_delete = step - checkpoints_keep * checkpoint_every; + if (checkpoints_keep > 0 && step_delete > 0 && + (major_checkpoint_every == 0 || step_delete % major_checkpoint_every != 0)) + { + delete_checkpoint(output_log_dir, step_delete, &multi_gpu_config); + } + } + resuming = 0; + + if (last_step) + { + break; + } + + if (overfit_single_batch == 1) + { + + dataloader_reset(&train_loader); + } + + cudaCheck(cudaEventRecord(start)); + + for (int micro_step = 0; micro_step < grad_accum_steps; micro_step++) + { + + dataloader_next_batch(&train_loader); + + gpt2_forward(&model, train_loader.inputs, B, T); + + gpt2_backward_and_reduce(&model, train_loader.inputs, train_loader.targets, grad_accum_steps, micro_step); + } + float zloss = (float)(update_detector(&loss_outlier_detector, (double)model.mean_loss)); + + float step_learning_rate = get_learning_rate(&lr_scheduler, step); + + float grad_norm = gpt2_calculate_grad_norm(&model, &multi_gpu_config); + float zgrad = (float)(update_detector(&grad_norm_outlier_detector, (double)grad_norm)); + + if (isfinite(zloss) && skip_update_lossz != 0.0f && zloss > skip_update_lossz) + { + printf0("skipping update due to loss z-score of %f\n", zloss); + } + else if (isfinite(zgrad) && skip_update_gradz != 0.0f && zgrad > skip_update_gradz) + { + printf0("skipping update due to grad z-score of %f\n", zgrad); + } + else + { + + float grad_clip = 1.0f; + float grad_scale = (grad_norm > grad_clip) ? grad_clip / grad_norm : 1.0f; + gpt2_update(&model, step_learning_rate, 0.9f, 0.95f, 1e-8f, weight_decay, grad_scale, step + 1, &multi_gpu_config); + } + cudaCheck(cudaEventRecord(end)); + cudaCheck(cudaEventSynchronize(end)); + + float time_elapsed_ms; + cudaCheck(cudaEventElapsedTime(&time_elapsed_ms, start, end)); + size_t tokens_processed = (size_t)multi_gpu_config.num_processes * B * T * grad_accum_steps; + float tokens_per_second = tokens_processed / time_elapsed_ms * 1000.0f; + float bias_corrected_ema_tokens_per_second = tokens_per_second; + if (step > 0) + { + total_sum_iteration_time_s += time_elapsed_ms / 1000.0f; + + ema_tokens_per_second = 0.95f * ema_tokens_per_second + 0.05f * tokens_per_second; + bias_corrected_ema_tokens_per_second = ema_tokens_per_second / (1.0f - powf(0.95f, step)); + } + float mfu = gpt2_estimate_mfu(&model, B * T * grad_accum_steps, time_elapsed_ms / 1000.0f); + printf0("step %4d/%d | loss %7.6f (%+.2fz)| norm %6.4f (%+.2fz)| lr %.2e | %.2f ms | %.1f%% bf16 MFU | %.0f tok/s\n", + step + 1, train_num_batches, model.mean_loss, zloss, grad_norm, zgrad, step_learning_rate, + time_elapsed_ms, 100 * mfu, bias_corrected_ema_tokens_per_second); + if (log_gpu_every > 0 && (step + 1) % log_gpu_every == 0) + { + GPUUtilInfo gpu_info = get_gpu_utilization_info(); + printf0(" compute %2.1f%% | memory: %2.1f%% | fan: %2d%% | %4d MHz / %4d MHz | %3d W / %3d W | %d°C / %d°C | %s\n", + gpu_info.gpu_utilization, gpu_info.mem_utilization, gpu_info.fan, gpu_info.clock, gpu_info.max_clock, gpu_info.power / 1000, gpu_info.power_limit / 1000, + gpu_info.temperature, gpu_info.temp_slowdown, gpu_info.throttle_reason); + } + logger_log_train(&logger, step, model.mean_loss, step_learning_rate, grad_norm); + + if (step == 3) + { + cudaProfilerStop(); + } + } + + printf0("total average iteration time: %f ms\n", total_sum_iteration_time_s / (train_num_batches - 1) * 1000); + + cudaCheck(cudaEventDestroy(end)); + cudaCheck(cudaEventDestroy(start)); + if (run_hellaswag) + { + evalloader_free(&eval_loader); + } + dataloader_free(&train_loader); + dataloader_free(&val_loader); + tokenizer_free(&tokenizer); + free(cpu_logits_raw); + free(cpu_logits); + free(gen_tokens); + multi_gpu_config_free(&multi_gpu_config); + gpt2_free(&model); + common_free(model); + return 0; +} +#endif \ No newline at end of file diff --git a/LICENSE b/LICENSE index 804d8ed..c3222c6 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2026 Eamon +Copyright(c) 2026 Eamon Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..ccb6702 --- /dev/null +++ b/Makefile @@ -0,0 +1,104 @@ +# ============================================================================= +# Quadtrix.cpp — Makefile (llama.cpp-style convenience targets) +# ============================================================================= + +.PHONY: all build clean run dev gpu train bench logs ps shell help + +SHELL := /bin/bash +SCRIPT := ./scripts/build.sh + +# ── Native C++ ─────────────────────────────────────────────────────────────── +CC := g++ +CFLAGS := -std=c++17 -O3 -march=native +IFLAGS := -I. -Iinclude +TARGET := quadtrix +SRCS := main.cpp + +all: $(TARGET) + +$(TARGET): $(SRCS) + $(CC) $(CFLAGS) $(IFLAGS) -o $@ $^ + @echo "✓ Built $(TARGET)" + +# Optimised release (same flags, explicit target) +release: $(SRCS) + $(CC) $(CFLAGS) $(IFLAGS) -DNDEBUG -o $(TARGET) $^ + strip $(TARGET) + +# Debug build +debug: $(SRCS) + $(CC) -std=c++17 -O0 -g -fsanitize=address,undefined \ + $(IFLAGS) -o $(TARGET)-debug $^ + +benchmark-bin: benchmark.cpp + $(CC) $(CFLAGS) $(IFLAGS) -o quadtrix-bench $^ + +clean-native: + rm -f $(TARGET) $(TARGET)-debug quadtrix-bench + +# ── Docker / Compose targets ───────────────────────────────────────────────── +build: + $(SCRIPT) up + +run: build + @echo "Stack already started." + +dev: + $(SCRIPT) dev + +gpu: + $(SCRIPT) gpu + +train-cpp: + $(SCRIPT) train-cpp + +train-torch: + $(SCRIPT) train-torch + +bench: + $(SCRIPT) bench + +logs: + $(SCRIPT) logs + +ps: + $(SCRIPT) ps + +shell: + $(SCRIPT) shell $(SERVICE) + +clean: + $(SCRIPT) clean + +# ── Misc ───────────────────────────────────────────────────────────────────── +format: + find . \( -name "*.cpp" -o -name "*.h" \) \ + ! -path "./build/*" \ + | xargs clang-format -i --style=LLVM + +lint-py: + ruff check backend/ engine/ + +help: + @echo "" + @echo " Quadtrix.cpp — make targets" + @echo "" + @echo " Native:" + @echo " make Build C++ binary (native)" + @echo " make release Stripped release binary" + @echo " make debug Debug binary with ASan/UBSan" + @echo " make clean-native Remove native build artifacts" + @echo " make format Run clang-format on all C++ files" + @echo "" + @echo " Docker:" + @echo " make build docker compose up --build (CPU)" + @echo " make dev Hot-reload dev stack" + @echo " make gpu CUDA GPU stack" + @echo " make train-cpp Train with C++ inside Docker" + @echo " make train-torch Train with PyTorch inside Docker" + @echo " make bench Run benchmark" + @echo " make logs Tail all logs" + @echo " make ps Show container status" + @echo " make shell Shell into backend (SERVICE=frontend to change)" + @echo " make clean Remove containers + volumes" + @echo "" diff --git a/README.md b/README.md index 0feeebe..56f99cc 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,9 @@ # Quadtrix.cpp +

+ image +

+ A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend. > Full technical reference: [docs](https://eamon2009.github.io/LLMs/) diff --git a/config/config.h b/config/config.h index db053cb..844efeb 100644 --- a/config/config.h +++ b/config/config.h @@ -1,34 +1,18 @@ #pragma once -// ============================================================ -// config/config.h – Global constants (mirrors config/config.py) -// ============================================================ - #include - -// ── Paths ──────────────────────────────────────────────────── -// Set CLEANED_PATH to your input text file before compiling, -// or override at runtime via the env-var GPT_DATA_PATH. static const std::string DEFAULT_CLEANED_PATH = "data/input.txt"; static const std::string DATA_PATH_ENV_VAR = "GPT_DATA_PATH"; - -// ── Reproducibility ────────────────────────────────────────── static const unsigned int SEED = 1337; - -// ── Data split ─────────────────────────────────────────────── static const double TRAIN_SPLIT = 0.9; // 90 % train, 10 % val - -// ── Hyper-parameters (identical to the Python script) ─────── static const int BATCH_SIZE = 4; static const int BLOCK_SIZE = 64; // context length -static const int MAX_ITERS = 3000; +static const int MAX_ITERS = 10000; static const int EVAL_INTERVAL = 20; static const float LEARNING_RATE = 3e-4f; -static const int EVAL_ITERS = 10; +static const int EVAL_ITERS = 1; static const int N_EMBD = 128; static const int N_HEAD = 4; static const int N_LAYER = 4; static const float DROPOUT = 0.2f; // applied during training only - -// ── Output paths ───────────────────────────────────────────── static const std::string BEST_MODEL_PATH = "best_model.bin"; static const std::string MODEL_PATH_ENV_VAR = "GPT_MODEL_PATH"; diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml new file mode 100644 index 0000000..a2e9a85 --- /dev/null +++ b/docker-compose.dev.yml @@ -0,0 +1,45 @@ +services: + frontend: + build: + context: . + dockerfile: .devops/Dockerfile.dev.frontend + ports: + - "5173:5173" + volumes: + - ./frontend:/app:delegated + - /app/node_modules + environment: + VITE_API_BASE_URL: "http://localhost:3001" + command: [ "npm", "run", "dev", "--", "--host", "0.0.0.0" ] + healthcheck: + test: [ "CMD", "wget", "-qO-", "http://localhost:5173/" ] + interval: 15s + timeout: 5s + retries: 5 + + backend: + volumes: + - ./backend:/app/backend:delegated + - ./engine:/app/engine:delegated + - models:/models + environment: + LOG_LEVEL: DEBUG + CORS_ORIGINS: "http://localhost:5173,http://localhost:3001" + command: + - python + - -m + - uvicorn + - main:app + - --host + - "0.0.0.0" + - --port + - "3001" + - --reload + - --reload-dir + - /app/backend + + redis: + ports: + - "6379:6379" +volumes: + models: diff --git a/docker-compose.gpu.yml b/docker-compose.gpu.yml new file mode 100644 index 0000000..abbd02e --- /dev/null +++ b/docker-compose.gpu.yml @@ -0,0 +1,32 @@ +services: + backend: + build: + args: + CUDA: "1" + image: quadtrix/backend-cuda:latest + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [ gpu ] + environment: + CUDA_VISIBLE_DEVICES: "0" + TORCH_CHECKPOINT_PATH: /models/best_model.pt + + train-torch: + build: + args: + CUDA: "1" + image: quadtrix/backend-cuda:latest + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [ gpu ] + environment: + CUDA_VISIBLE_DEVICES: "0" + QUADTRIX_TRAIN_DATA: /app/data/input.txt diff --git a/docker-compose.yml b/docker-compose.yml index 8191856..7bb3572 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -1,34 +1,173 @@ +name: quadtrix + +x-common-env: &common-env + TZ: UTC + PYTHONUNBUFFERED: "1" + services: - quadtrix: - image: ghcr.io/eamon2009/quadtrix.cpp:latest + + frontend: build: context: . - dockerfile: Dockerfile + dockerfile: .devops/Dockerfile.frontend args: - # for cuda - # BASE_IMAGE: nvidia/cuda:12.4.1-cudnn-runtime-ubuntu24.04 - BASE_IMAGE: ubuntu:24.04 - + VITE_API_BASE_URL: "" + image: quadtrix/frontend:latest + container_name: quadtrix-frontend + restart: unless-stopped ports: - - "3001:3001" # FastAPI backend - - "8080:8080" # React frontend - - volumes: - # Place best_model.pt and/or best_model.bin inside ./models/ - - ./models:/app/models + - "5173:80" + depends_on: + backend: + condition: service_healthy + networks: + - quadtrix-net + healthcheck: + test: [ "CMD", "wget", "-qO-", "http://localhost/" ] + interval: 30s + timeout: 5s + retries: 3 + backend: + build: + context: . + dockerfile: .devops/Dockerfile.backend + image: quadtrix/backend:latest + container_name: quadtrix-backend + restart: unless-stopped + ports: + - "3001:3001" environment: - TORCH_CHECKPOINT_PATH: /app/models/best_model.pt - GPT_MODEL_PATH: /app/models/best_model.bin - CORS_ORIGINS: http://localhost:8080 + <<: *common-env + API_PORT: "3001" + CORS_ORIGINS: "http://localhost:5173,http://frontend" + REDIS_URL: "redis://redis:6379/0" + TORCH_CHECKPOINT_PATH: /models/best_model.pt LOG_LEVEL: INFO - MAX_SESSIONS: 1000 - SESSION_TTL_HOURS: 24 - restart: unless-stopped - + MAX_SESSIONS: "500" + SESSION_TTL_HOURS: "24" + volumes: + - models:/models + - ./engine:/app/engine:ro + depends_on: + redis: + condition: service_healthy + networks: + - quadtrix-net healthcheck: test: [ "CMD", "curl", "-f", "http://localhost:3001/api/health" ] interval: 30s timeout: 10s - retries: 5 start_period: 20s + retries: 3 + + redis: + image: redis:7-alpine + container_name: quadtrix-redis + restart: unless-stopped + command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru + volumes: + - redis-data:/data + networks: + - quadtrix-net + healthcheck: + test: [ "CMD", "redis-cli", "ping" ] + interval: 10s + timeout: 5s + retries: 5 + expose: + - "6379" + + cpp: + build: + context: . + dockerfile: .devops/Dockerfile.cpp + image: quadtrix/cpp:latest + container_name: quadtrix-cpp + + restart: "no" + stdin_open: true + tty: true + volumes: + - models:/models + - ./data:/app/data:ro + environment: + <<: *common-env + GPT_DATA_PATH: /app/data/input.txt + GPT_MODEL_PATH: /models/best_model.bin + networks: + - quadtrix-net + profiles: + - cpp + + train-cpp: + build: + context: . + dockerfile: .devops/Dockerfile.cpp + image: quadtrix/cpp:latest + container_name: quadtrix-train-cpp + restart: "no" + volumes: + - models:/models + - ./data:/app/data:ro + environment: + <<: *common-env + GPT_DATA_PATH: /app/data/input.txt + GPT_MODEL_PATH: /models/best_model.bin + command: [ "data/input.txt" ] # train mode (no --chat flag) + networks: + - quadtrix-net + profiles: + - train + + train-torch: + build: + context: . + dockerfile: .devops/Dockerfile.backend + image: quadtrix/backend:latest + container_name: quadtrix-train-torch + restart: "no" + volumes: + - models:/models + - ./engine:/app/engine + - ./data:/app/data:ro + environment: + <<: *common-env + QUADTRIX_TRAIN_DATA: /app/data/input.txt + working_dir: /app + command: [ "python", "engine/main.py" ] + networks: + - quadtrix-net + profiles: + - train + + benchmark: + build: + context: . + dockerfile: .devops/Dockerfile.cpp + image: quadtrix/cpp:latest + container_name: quadtrix-benchmark + restart: "no" + volumes: + - models:/models + - ./data:/app/data:ro + - ./benchmark_results.csv:/app/benchmark_results.csv + environment: + <<: *common-env + GPT_MODEL_PATH: /models/best_model.bin + + command: [ "data/input.txt", "--generate" ] + networks: + - quadtrix-net + profiles: + - benchmark + +volumes: + models: + driver: local + redis-data: + driver: local + +networks: + quadtrix-net: + driver: bridge diff --git a/frontend/src/components/chat/EmptyState.tsx b/frontend/src/components/chat/EmptyState.tsx index ce75d9a..abf94ec 100644 --- a/frontend/src/components/chat/EmptyState.tsx +++ b/frontend/src/components/chat/EmptyState.tsx @@ -1,13 +1,95 @@ export function EmptyState() { return ( -
-
-
- Quadtrix.cpp icon +
+
+ {/* Icon */} +
+ + + +
-
-

Quadtrix.cpp

-

Minimal local chat interface. Start typing below to begin.

+ +
+

+ Quadtrix.cpp +

+

+ Local char-level language model. Start a conversation below. +

+
+ + {/* Hint chips */} +
+ {["Fast local inference", "C++ & PyTorch backends", "No cloud required"].map((chip) => ( + + {chip} + + ))}
diff --git a/frontend/src/components/chat/MessageAvatar.tsx b/frontend/src/components/chat/MessageAvatar.tsx index 25373d5..c606c9d 100644 --- a/frontend/src/components/chat/MessageAvatar.tsx +++ b/frontend/src/components/chat/MessageAvatar.tsx @@ -6,15 +6,48 @@ interface MessageAvatarProps { export function MessageAvatar({ role }: MessageAvatarProps) { const isUser = role === "user"; + + if (isUser) { + return ( +
+ U +
+ ); + } + return (
- {isUser ? "You" : "Q"} + Q
); } diff --git a/frontend/src/components/chat/MessageList.tsx b/frontend/src/components/chat/MessageList.tsx index e38a0af..5de6e62 100644 --- a/frontend/src/components/chat/MessageList.tsx +++ b/frontend/src/components/chat/MessageList.tsx @@ -1,5 +1,6 @@ -import { useAutoScroll } from "../../hooks/useAutoScroll"; +import { useRef } from "react"; import type { Message } from "../../types"; +import { useAutoScroll } from "../../hooks/useAutoScroll"; import { MessageRow } from "./MessageRow"; interface MessageListProps { @@ -7,13 +8,25 @@ interface MessageListProps { } export function MessageList({ messages }: MessageListProps) { - const scrollRef = useAutoScroll(messages.length); + const bottomRef = useRef(null); + useAutoScroll(bottomRef, messages); + return ( -
-
+
+
{messages.map((message) => ( ))} +
); diff --git a/frontend/src/components/chat/MessageRow.tsx b/frontend/src/components/chat/MessageRow.tsx index 372d585..8dd3910 100644 --- a/frontend/src/components/chat/MessageRow.tsx +++ b/frontend/src/components/chat/MessageRow.tsx @@ -27,37 +27,96 @@ export function MessageRow({ message }: MessageRowProps) { return ( {!isUser && } -
-
- {isUser ? "You" : "Quadtrix"} + +
+ {/* Meta row */} +
+ {isUser ? "You" : "Quadtrix"} {formatRelativeTime(message.created_at)} {!isUser && !message.pending && ( )}
+ + {/* Bubble */}
- {message.pending ? : {message.text}} + {message.pending ? ( + + ) : ( + {message.text} + )}
+ {isUser && } ); diff --git a/frontend/src/components/chat/ThinkingIndicator.tsx b/frontend/src/components/chat/ThinkingIndicator.tsx index e83d0f5..7ec4a6c 100644 --- a/frontend/src/components/chat/ThinkingIndicator.tsx +++ b/frontend/src/components/chat/ThinkingIndicator.tsx @@ -1,12 +1,28 @@ export function ThinkingIndicator() { return ( -
- Quadtrix is thinking - - - - +
+ Generating + + {[0, 120, 240].map((delay) => ( + + ))} +
); } diff --git a/include/tensor.h b/include/tensor.h index f6ac4a5..c3526b6 100644 --- a/include/tensor.h +++ b/include/tensor.h @@ -1,8 +1,4 @@ #pragma once -// ============================================================ -// include/tensor.h – Lightweight 2-D / 3-D float tensor -// (CPU only – mirrors what PyTorch tensors do in the model) -// ============================================================ #include #include @@ -15,310 +11,557 @@ #include #include -// ------------------------------------------------------------------ -// Tensor (row-major, float32) -// shape is stored as {d0, d1} or {d0, d1, d2} -// ------------------------------------------------------------------ +#ifdef _OPENMP +#include +#endif + +#ifdef __AVX__ +#include +#endif + +#ifdef __SSE__ +#include +#endif + struct Tensor { - std::vector shape; - std::vector data; - - Tensor() = default; - - Tensor(std::vector sh, float fill = 0.0f) - : shape(sh) - { - int total = 1; - for (int d : sh) - total *= d; - data.assign(total, fill); - } - - int numel() const - { - int n = 1; - for (int d : shape) - n *= d; - return n; - } - - int ndim() const { return (int)shape.size(); } - - // ---- element access helpers -------------------------------- - float &at(int i) - { - assert(i >= 0 && i < (int)data.size()); - return data[i]; - } - float at(int i) const - { - assert(i >= 0 && i < (int)data.size()); - return data[i]; - } - - // 2-D - float &at(int r, int c) - { - return data[r * shape[1] + c]; - } - float at(int r, int c) const - { - return data[r * shape[1] + c]; - } - - // 3-D - float &at(int b, int r, int c) - { - return data[b * shape[1] * shape[2] + r * shape[2] + c]; - } - float at(int b, int r, int c) const - { - return data[b * shape[1] * shape[2] + r * shape[2] + c]; - } - - // ---- factory helpers --------------------------------------- - static Tensor zeros(std::vector sh) { return Tensor(sh, 0.0f); } - static Tensor ones(std::vector sh) { return Tensor(sh, 1.0f); } - - static Tensor randn(std::vector sh, float mean, float std, - std::mt19937 &rng) - { - std::normal_distribution dist(mean, std); - Tensor t(sh); - for (auto &v : t.data) - v = dist(rng); - return t; - } - - void fill(float v) { std::fill(data.begin(), data.end(), v); } - - // ---- print shape ------------------------------------------- - void print_shape(const std::string &name = "") const - { - if (!name.empty()) - std::cout << name << ": "; - std::cout << "["; - for (int i = 0; i < (int)shape.size(); ++i) - { - std::cout << shape[i]; - if (i + 1 < (int)shape.size()) - std::cout << ", "; - } - std::cout << "]" << std::endl; - } -}; + std::vector shape; + std::vector data; + + Tensor() = default; + + Tensor(std::vector sh, float fill = 0.0f) : shape(std::move(sh)) + { + int total = 1; + for (int d : shape) + total *= d; + data.reserve(total); + data.assign(total, fill); + } + + Tensor(const Tensor &) = default; + Tensor(Tensor &&) noexcept = default; + Tensor &operator=(const Tensor &) = default; + Tensor &operator=(Tensor &&) noexcept = default; + + int numel() const + { + int n = 1; + for (int d : shape) + n *= d; + return n; + } -// ------------------------------------------------------------------ -// Basic math ops (in-place and returning new tensors) -// ------------------------------------------------------------------ + int ndim() const { return (int)shape.size(); } + + float &at(int i) { return data[i]; } + float at(int i) const { return data[i]; } + + float &at(int r, int c) { return data[r * shape[1] + c]; } + float at(int r, int c) const { return data[r * shape[1] + c]; } + + float &at(int b, int r, int c) { return data[b * shape[1] * shape[2] + r * shape[2] + c]; } + float at(int b, int r, int c) const { return data[b * shape[1] * shape[2] + r * shape[2] + c]; } + + static Tensor zeros(std::vector sh) { return Tensor(sh, 0.0f); } + static Tensor ones(std::vector sh) { return Tensor(sh, 1.0f); } + + static Tensor randn(std::vector sh, float mean, float std, std::mt19937 &rng) + { + std::normal_distribution dist(mean, std); + Tensor t(sh); + for (auto &v : t.data) + v = dist(rng); + return t; + } + + void fill(float v) { std::fill(data.begin(), data.end(), v); } + + void print_shape(const std::string &name = "") const + { + if (!name.empty()) + std::cout << name << ": "; + std::cout << "["; + for (int i = 0; i < (int)shape.size(); ++i) + { + std::cout << shape[i]; + if (i + 1 < (int)shape.size()) + std::cout << ", "; + } + std::cout << "]" << std::endl; + } +}; -// element-wise add (same shape) inline Tensor add(const Tensor &a, const Tensor &b) { - assert(a.data.size() == b.data.size()); - Tensor c(a.shape); - for (int i = 0; i < (int)a.data.size(); ++i) - c.data[i] = a.data[i] + b.data[i]; - return c; + assert(a.data.size() == b.data.size()); + Tensor c(a.shape); + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vb = _mm256_loadu_ps(&b.data[i]); + __m256 vc = _mm256_add_ps(va, vb); + _mm256_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = a.data[i] + b.data[i]; +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vb = _mm_loadu_ps(&b.data[i]); + __m128 vc = _mm_add_ps(va, vb); + _mm_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = a.data[i] + b.data[i]; +#else + for (size_t i = 0; i < n; ++i) + c.data[i] = a.data[i] + b.data[i]; +#endif + return c; +} + +inline void add_inplace(Tensor &a, const Tensor &b) +{ + assert(a.data.size() == b.data.size()); + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vb = _mm256_loadu_ps(&b.data[i]); + __m256 vc = _mm256_add_ps(va, vb); + _mm256_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] += b.data[i]; +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vb = _mm_loadu_ps(&b.data[i]); + __m128 vc = _mm_add_ps(va, vb); + _mm_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] += b.data[i]; +#else + for (size_t i = 0; i < n; ++i) + a.data[i] += b.data[i]; +#endif } -// scalar multiply inline Tensor scale(const Tensor &a, float s) { - Tensor c(a.shape); - for (int i = 0; i < (int)a.data.size(); ++i) - c.data[i] = a.data[i] * s; - return c; + Tensor c(a.shape); + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + __m256 vs = _mm256_set1_ps(s); + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vc = _mm256_mul_ps(va, vs); + _mm256_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = a.data[i] * s; +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + __m128 vs = _mm_set1_ps(s); + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vc = _mm_mul_ps(va, vs); + _mm_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = a.data[i] * s; +#else + for (size_t i = 0; i < n; ++i) + c.data[i] = a.data[i] * s; +#endif + return c; +} + +inline void scale_inplace(Tensor &a, float s) +{ + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + __m256 vs = _mm256_set1_ps(s); + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vc = _mm256_mul_ps(va, vs); + _mm256_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] *= s; +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + __m128 vs = _mm_set1_ps(s); + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vc = _mm_mul_ps(va, vs); + _mm_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] *= s; +#else + for (auto &v : a.data) + v *= s; +#endif } -// ReLU inline Tensor relu(const Tensor &a) { - Tensor c(a.shape); - for (int i = 0; i < (int)a.data.size(); ++i) - c.data[i] = std::max(0.0f, a.data[i]); - return c; + Tensor c(a.shape); + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + __m256 zero = _mm256_setzero_ps(); + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vc = _mm256_max_ps(va, zero); + _mm256_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = std::max(0.0f, a.data[i]); +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + __m128 zero = _mm_setzero_ps(); + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vc = _mm_max_ps(va, zero); + _mm_storeu_ps(&c.data[i], vc); + } + for (; i < n; ++i) + c.data[i] = std::max(0.0f, a.data[i]); +#else + for (size_t i = 0; i < n; ++i) + c.data[i] = std::max(0.0f, a.data[i]); +#endif + return c; } -// Softmax along last dim for 3-D tensor [B, T, C] -inline Tensor softmax3d(const Tensor &a) +inline void relu_inplace(Tensor &a) { - int B = a.shape[0], T = a.shape[1], C = a.shape[2]; - Tensor out(a.shape); - for (int b = 0; b < B; ++b) - { - for (int t = 0; t < T; ++t) - { - float maxv = -1e30f; - for (int c = 0; c < C; ++c) - maxv = std::max(maxv, a.at(b, t, c)); - float sumv = 0.0f; - for (int c = 0; c < C; ++c) - { - float e = std::exp(a.at(b, t, c) - maxv); - out.at(b, t, c) = e; - sumv += e; - } - for (int c = 0; c < C; ++c) - out.at(b, t, c) /= sumv; - } - } - return out; + size_t n = a.data.size(); + +#ifdef __AVX__ + size_t i = 0; + size_t vec_end = n & ~7ULL; + __m256 zero = _mm256_setzero_ps(); + for (; i < vec_end; i += 8) + { + __m256 va = _mm256_loadu_ps(&a.data[i]); + __m256 vc = _mm256_max_ps(va, zero); + _mm256_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] = std::max(0.0f, a.data[i]); +#elif defined(__SSE__) + size_t i = 0; + size_t vec_end = n & ~3ULL; + __m128 zero = _mm_setzero_ps(); + for (; i < vec_end; i += 4) + { + __m128 va = _mm_loadu_ps(&a.data[i]); + __m128 vc = _mm_max_ps(va, zero); + _mm_storeu_ps(&a.data[i], vc); + } + for (; i < n; ++i) + a.data[i] = std::max(0.0f, a.data[i]); +#else + for (auto &v : a.data) + v = std::max(0.0f, v); +#endif } -// Softmax along last dim for 2-D tensor [T, C] -inline Tensor softmax2d(const Tensor &a) +inline Tensor softmax3d(const Tensor &a) { - int T = a.shape[0], C = a.shape[1]; - Tensor out(a.shape); - for (int t = 0; t < T; ++t) - { + int B = a.shape[0], T = a.shape[1], C = a.shape[2]; + Tensor out(a.shape); + +#ifdef _OPENMP +#pragma omp parallel for collapse(2) if (B * T > 64) +#endif + for (int b = 0; b < B; ++b) + { + for (int t = 0; t < T; ++t) + { float maxv = -1e30f; for (int c = 0; c < C; ++c) - maxv = std::max(maxv, a.at(t, c)); + maxv = std::max(maxv, a.at(b, t, c)); + float sumv = 0.0f; for (int c = 0; c < C; ++c) { - float e = std::exp(a.at(t, c) - maxv); - out.at(t, c) = e; - sumv += e; + float e = std::exp(a.at(b, t, c) - maxv); + out.at(b, t, c) = e; + sumv += e; } + + float inv_sum = 1.0f / sumv; for (int c = 0; c < C; ++c) - out.at(t, c) /= sumv; - } - return out; + out.at(b, t, c) *= inv_sum; + } + } + return out; } -// Layer-norm along last dim [B, T, C] → same shape -inline Tensor layer_norm(const Tensor &x, - const Tensor &gamma, // [C] - const Tensor &beta, // [C] - float eps = 1e-5f) +inline Tensor softmax2d(const Tensor &a) { - int B = x.shape[0], T = x.shape[1], C = x.shape[2]; - Tensor out(x.shape); - for (int b = 0; b < B; ++b) - { - for (int t = 0; t < T; ++t) + int T = a.shape[0], C = a.shape[1]; + Tensor out(a.shape); + +#ifdef _OPENMP +#pragma omp parallel for if (T > 128) +#endif + for (int t = 0; t < T; ++t) + { + float maxv = -1e30f; + for (int c = 0; c < C; ++c) + maxv = std::max(maxv, a.at(t, c)); + + float sumv = 0.0f; + for (int c = 0; c < C; ++c) + { + float e = std::exp(a.at(t, c) - maxv); + out.at(t, c) = e; + sumv += e; + } + + float inv_sum = 1.0f / sumv; + for (int c = 0; c < C; ++c) + out.at(t, c) *= inv_sum; + } + return out; +} + +inline Tensor layer_norm(const Tensor &x, const Tensor &gamma, const Tensor &beta, float eps = 1e-5f) +{ + int B = x.shape[0], T = x.shape[1], C = x.shape[2]; + Tensor out(x.shape); + +#ifdef _OPENMP +#pragma omp parallel for collapse(2) if (B * T > 64) +#endif + for (int b = 0; b < B; ++b) + { + for (int t = 0; t < T; ++t) + { + float mu = 0.0f; + for (int c = 0; c < C; ++c) + mu += x.at(b, t, c); + mu /= C; + + float var = 0.0f; + for (int c = 0; c < C; ++c) { - float mu = 0.0f; - for (int c = 0; c < C; ++c) - mu += x.at(b, t, c); - mu /= C; - float var = 0.0f; - for (int c = 0; c < C; ++c) - { - float d = x.at(b, t, c) - mu; - var += d * d; - } - var /= C; - float inv = 1.0f / std::sqrt(var + eps); - for (int c = 0; c < C; ++c) - out.at(b, t, c) = (x.at(b, t, c) - mu) * inv * gamma.at(c) + beta.at(c); + float d = x.at(b, t, c) - mu; + var += d * d; } - } - return out; + var /= C; + + float inv = 1.0f / std::sqrt(var + eps); + for (int c = 0; c < C; ++c) + out.at(b, t, c) = (x.at(b, t, c) - mu) * inv * gamma.at(c) + beta.at(c); + } + } + return out; } -// matmul: [B, T, D] x [D, E] → [B, T, E] inline Tensor matmul(const Tensor &a, const Tensor &w) { - // a: [B, T, D] or [B, T, D] - // w: [D, E] - assert(a.ndim() == 3 && w.ndim() == 2); - int B = a.shape[0], T = a.shape[1], D = a.shape[2]; - int E = w.shape[1]; - assert(w.shape[0] == D); - Tensor out({B, T, E}, 0.0f); - for (int b = 0; b < B; ++b) - for (int t = 0; t < T; ++t) - for (int e = 0; e < E; ++e) - { - float s = 0.0f; - for (int d = 0; d < D; ++d) - s += a.at(b, t, d) * w.at(d, e); - out.at(b, t, e) = s; - } - return out; + assert(a.ndim() == 3 && w.ndim() == 2); + int B = a.shape[0], T = a.shape[1], D = a.shape[2]; + int E = w.shape[1]; + assert(w.shape[0] == D); + + Tensor out({B, T, E}, 0.0f); + + const int TILE_T = 32; + const int TILE_E = 32; + const int TILE_D = 32; + +#ifdef _OPENMP +#pragma omp parallel for collapse(2) if (B * T * E * D > 100000) +#endif + for (int b = 0; b < B; ++b) + { + for (int t0 = 0; t0 < T; t0 += TILE_T) + { + int t_end = std::min(t0 + TILE_T, T); + for (int e0 = 0; e0 < E; e0 += TILE_E) + { + int e_end = std::min(e0 + TILE_E, E); + for (int d0 = 0; d0 < D; d0 += TILE_D) + { + int d_end = std::min(d0 + TILE_D, D); + for (int t = t0; t < t_end; ++t) + { + for (int e = e0; e < e_end; ++e) + { + float s = out.at(b, t, e); + for (int d = d0; d < d_end; ++d) + s += a.at(b, t, d) * w.at(d, e); + out.at(b, t, e) = s; + } + } + } + } + } + } + return out; } -// add bias [E] broadcast over [B, T, E] inline Tensor add_bias(const Tensor &x, const Tensor &bias) { - assert(x.shape.back() == bias.shape[0]); - Tensor out = x; - int E = bias.shape[0]; - int stride = E; - int n = x.numel() / E; - for (int i = 0; i < n; ++i) - for (int e = 0; e < E; ++e) - out.data[i * stride + e] += bias.data[e]; - return out; + assert(x.shape.back() == bias.shape[0]); + Tensor out = x; + int E = bias.shape[0]; + int stride = E; + int n = x.numel() / E; + +#ifdef _OPENMP +#pragma omp parallel for if (n * E > 10000) +#endif + for (int i = 0; i < n; ++i) + { + for (int e = 0; e < E; ++e) + out.data[i * stride + e] += bias.data[e]; + } + return out; } -// batched matmul: [B, T, D] x [B, D, T2] → [B, T, T2] inline Tensor bmm(const Tensor &a, const Tensor &b) { - assert(a.ndim() == 3 && b.ndim() == 3); - int B = a.shape[0], T = a.shape[1], D = a.shape[2]; - int T2 = b.shape[2]; - assert(b.shape[0] == B && b.shape[1] == D); - Tensor out({B, T, T2}, 0.0f); - for (int bb = 0; bb < B; ++bb) - for (int t = 0; t < T; ++t) - for (int t2 = 0; t2 < T2; ++t2) - { - float s = 0.0f; - for (int d = 0; d < D; ++d) - s += a.at(bb, t, d) * b.at(bb, d, t2); - out.at(bb, t, t2) = s; - } - return out; + assert(a.ndim() == 3 && b.ndim() == 3); + int B = a.shape[0], T = a.shape[1], D = a.shape[2]; + int T2 = b.shape[2]; + assert(b.shape[0] == B && b.shape[1] == D); + + Tensor out({B, T, T2}, 0.0f); + + const int TILE = 32; + +#ifdef _OPENMP +#pragma omp parallel for if (B * T * T2 * D > 100000) +#endif + for (int bb = 0; bb < B; ++bb) + { + for (int t0 = 0; t0 < T; t0 += TILE) + { + int t_end = std::min(t0 + TILE, T); + for (int t2_0 = 0; t2_0 < T2; t2_0 += TILE) + { + int t2_end = std::min(t2_0 + TILE, T2); + for (int d0 = 0; d0 < D; d0 += TILE) + { + int d_end = std::min(d0 + TILE, D); + for (int t = t0; t < t_end; ++t) + { + for (int t2 = t2_0; t2 < t2_end; ++t2) + { + float s = out.at(bb, t, t2); + for (int d = d0; d < d_end; ++d) + s += a.at(bb, t, d) * b.at(bb, d, t2); + out.at(bb, t, t2) = s; + } + } + } + } + } + } + return out; } -// transpose last two dims of 3-D tensor [B, T, D] → [B, D, T] inline Tensor transpose23(const Tensor &a) { - int B = a.shape[0], T = a.shape[1], D = a.shape[2]; - Tensor out({B, D, T}); - for (int b = 0; b < B; ++b) + int B = a.shape[0], T = a.shape[1], D = a.shape[2]; + Tensor out({B, D, T}); + +#ifdef _OPENMP +#pragma omp parallel for collapse(2) if (B * T * D > 10000) +#endif + for (int b = 0; b < B; ++b) + { + for (int d = 0; d < D; ++d) + { for (int t = 0; t < T; ++t) - for (int d = 0; d < D; ++d) - out.at(b, d, t) = a.at(b, t, d); - return out; + out.at(b, d, t) = a.at(b, t, d); + } + } + return out; } -// concat along last dim: [B,T,D1] + [B,T,D2] → [B,T,D1+D2] inline Tensor cat_last(const std::vector &ts) { - int B = ts[0].shape[0], T = ts[0].shape[1]; - int total = 0; - for (auto &t : ts) - total += t.shape[2]; - Tensor out({B, T, total}, 0.0f); - int offset = 0; - for (auto &t : ts) - { - int D = t.shape[2]; - for (int b = 0; b < B; ++b) - for (int tt = 0; tt < T; ++tt) - for (int d = 0; d < D; ++d) - out.at(b, tt, offset + d) = t.at(b, tt, d); - offset += D; - } - return out; + int B = ts[0].shape[0], T = ts[0].shape[1]; + int total = 0; + for (auto &t : ts) + total += t.shape[2]; + + Tensor out({B, T, total}, 0.0f); + + int offset = 0; + for (auto &t : ts) + { + int D = t.shape[2]; +#ifdef _OPENMP +#pragma omp parallel for collapse(2) if (B * T * D > 10000) +#endif + for (int b = 0; b < B; ++b) + { + for (int tt = 0; tt < T; ++tt) + { + for (int d = 0; d < D; ++d) + out.at(b, tt, offset + d) = t.at(b, tt, d); + } + } + offset += D; + } + return out; } -// dropout mask (applied only during training) inline Tensor dropout(const Tensor &x, float p, bool training, std::mt19937 &rng) { - if (!training || p == 0.0f) - return x; - std::bernoulli_distribution dist(1.0f - p); - Tensor out = x; - float scale_v = 1.0f / (1.0f - p); - for (auto &v : out.data) - v = dist(rng) ? v * scale_v : 0.0f; - return out; + if (!training || p == 0.0f) + return x; + + std::bernoulli_distribution dist(1.0f - p); + Tensor out = x; + float scale_v = 1.0f / (1.0f - p); + + for (auto &v : out.data) + v = dist(rng) ? v * scale_v : 0.0f; + + return out; } \ No newline at end of file diff --git a/main.cpp b/main.cpp index 006af20..7fc540c 100644 --- a/main.cpp +++ b/main.cpp @@ -103,6 +103,22 @@ static std::string choose_output_path(const std::string &requested_path, return exe_relative; } +// sample N tokens from the model and print them +static void sample_tokens(GPTLanguageModel &model, + DataLoader &dl, + int n_tokens) +{ + std::vector ctx = {0}; + for (int i = 0; i < n_tokens; ++i) + { + ctx = model.generate(ctx, 1); + std::cout << dl.decode({ctx.back()}) << std::flush; + if ((int)ctx.size() > BLOCK_SIZE) + ctx = std::vector(ctx.end() - BLOCK_SIZE, ctx.end()); + } + std::cout << "\n"; +} + // estimate loss — no gradients, training=false static float estimate_loss(GPTLanguageModel &model, DataLoader &dl, @@ -184,10 +200,7 @@ int main(int argc, char *argv[]) std::signal(SIGINT, sig_handler); // Banner - std::cout << std::string(60, '=') << "\n"; std::cout << " Quadtrix v1.0 (C++)\n"; - std::cout << std::string(60, '=') << "\n"; - std::cout << "\n[INFO] Starting at: " << now_str() << "\n"; std::string data_path = DEFAULT_CLEANED_PATH; const char *env_data_path = std::getenv(DATA_PATH_ENV_VAR.c_str()); @@ -219,17 +232,6 @@ int main(int argc, char *argv[]) data_path = choose_existing_path(data_path, argv[0]); model_path = choose_output_path(model_path, argv[0]); - // Config print - std::cout << "\n[CONFIG] Hyperparameters:\n"; - std::cout << " batch_size=" << BATCH_SIZE - << " block_size=" << BLOCK_SIZE << "\n"; - std::cout << " max_iters=" << MAX_ITERS - << " learning_rate=" << LEARNING_RATE << "\n"; - std::cout << " n_embd=" << N_EMBD - << " n_head=" << N_HEAD - << " n_layer=" << N_LAYER - << " dropout=" << DROPOUT << "\n"; - // Data DataLoader dl; try @@ -247,13 +249,12 @@ int main(int argc, char *argv[]) GPTLanguageModel model(dl.vocab_size, N_EMBD, N_HEAD, N_LAYER, BLOCK_SIZE, SEED); long n_params = model.num_params(); - std::cout << "[MODEL] Parameters : " - << std::fixed << std::setprecision(2) - << n_params / 1.0e6f << " M (" << n_params << " total)\n"; - std::cout << "[MODEL] Architecture: " - << N_LAYER << " layers x " - << N_HEAD << " heads x " - << N_EMBD << " embedding dim\n"; + std::cout << "max_seq_len: " << BLOCK_SIZE << "\n"; + std::cout << "vocab_size: " << dl.vocab_size << "\n"; + std::cout << "num_layers: " << N_LAYER << "\n"; + std::cout << "num_heads: " << N_HEAD << "\n"; + std::cout << "channels: " << N_EMBD << "\n"; + std::cout << "num_parameters: " << n_params << "\n"; // chat mode if (chat_mode) @@ -268,9 +269,8 @@ int main(int argc, char *argv[]) } model.load(model_path); - std::cout << "[CHAT] Weights loaded from " << model_path << "\n"; - std::cout << "[CHAT] Max tokens per reply: " << chat_tokens - << " (override with --chat-tokens N)\n"; + std::cout << "weights: " << model_path << "\n"; + std::cout << "max_tokens: " << chat_tokens << "\n"; run_chat(model, dl, chat_tokens); return 0; @@ -289,10 +289,7 @@ int main(int argc, char *argv[]) } model.load(model_path); - std::cout << "\n" - << std::string(60, '-') << "\n"; - std::cout << " Quadtrix OUTPUT (Ctrl+C to stop)\n"; - std::cout << std::string(60, '-') << "\n\n"; + std::cout << "\ngenerating:\n"; std::vector ctx = {0}; while (!g_interrupted) { @@ -301,7 +298,7 @@ int main(int argc, char *argv[]) if ((int)ctx.size() > BLOCK_SIZE) ctx = std::vector(ctx.end() - BLOCK_SIZE, ctx.end()); } - std::cout << "\n\n[Stopped by user]\n"; + std::cout << "\n"; return 0; } @@ -312,114 +309,78 @@ int main(int argc, char *argv[]) std::mt19937 rng(SEED); // training loop - std::cout << "\n" - << std::string(60, '-') << "\n"; - std::cout << " TRAINING (" - << MAX_ITERS << " iters, eval every " - << EVAL_INTERVAL << ")\n"; - std::cout << std::string(60, '-') << "\n"; float best_val_loss = 1e30f; + float last_val_loss = 0.0f; double train_start = wall_secs(); - double last_eval_time = train_start; // ← tracks time of previous eval - for (int iter = 0; iter <= MAX_ITERS && !g_interrupted; ++iter) + // compute initial val loss before training { + std::mt19937 init_rng(SEED); + last_val_loss = estimate_loss(model, dl, "val", init_rng); + } - // Periodic eval checkpoint - if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS) - { - double now = wall_secs(); - double elapsed = now - train_start; - - // ms per training step since the last eval window - double window_secs = now - last_eval_time; - int steps_in_win = (iter == 0) ? 1 : EVAL_INTERVAL; - double ms_per_step = window_secs * 1000.0 / steps_in_win; - - // tokens processed per second - long toks_in_win = (long)BATCH_SIZE * BLOCK_SIZE * steps_in_win; - int tok_per_sec = (window_secs > 0.0) - ? (int)(toks_in_win / window_secs) - : 0; - - last_eval_time = now; // reset window - - float tl = estimate_loss(model, dl, "train", rng); - float vl = estimate_loss(model, dl, "val", rng); - - bool better = vl < best_val_loss; - if (better) - { - best_val_loss = vl; - model.save(model_path); - } - - // ── new log line ───────────────────────────────────────────── - std::cout - << "step " - << std::setw(5) << iter << "/" << MAX_ITERS - << " | loss " - << std::fixed << std::setprecision(6) << tl - << " | val " - << std::fixed << std::setprecision(6) << vl - << " | lr " - << std::scientific << std::setprecision(2) << (float)LEARNING_RATE - << " | " - << std::fixed << std::setprecision(2) << ms_per_step << " ms" - << " | " << tok_per_sec << " tok/s" - << (better ? " *best*" : "") - << "\n"; - std::cout.flush(); - - if (iter == MAX_ITERS) - break; - } + for (int iter = 1; iter <= MAX_ITERS && !g_interrupted; ++iter) + { + double step_start = wall_secs(); - // Sample training batch + // train step std::pair, std::vector> batch = dl.get_batch("train", BATCH_SIZE, BLOCK_SIZE, rng); - // Forward — saves all intermediate activations SavedForward saved = forward_save(model, batch.first, BATCH_SIZE, BLOCK_SIZE, batch.second, /*training=*/true); - // Backward — exact analytical gradients - Grads grads = backward(model, saved); + float batch_loss = model.forward(batch.first, BATCH_SIZE, BLOCK_SIZE, + batch.second, false) + .second; - // AdamW parameter update + Grads grads = backward(model, saved); apply_grads(model, grads, opt); - } - double total = wall_secs() - train_start; - std::cout << "\n[DONE] Training finished in " - << std::fixed << std::setprecision(1) << total << "s (" - << total / 60.0 << " min) | Best val loss: " - << std::setprecision(4) << best_val_loss << "\n"; - std::cout << "[SAVE] Best weights saved to " << model_path << "\n"; + double step_ms = (wall_secs() - step_start) * 1000.0; + int tok_per_sec = (step_ms > 0.0) + ? (int)((long)BATCH_SIZE * BLOCK_SIZE / (step_ms / 1000.0)) + : 0; - // Continuous generation - std::cout << "\n" - << std::string(60, '-') << "\n"; - std::cout << " MODEL OUTPUT (Ctrl+C to stop)\n"; - std::cout << std::string(60, '-') << "\n\n"; + // every EVAL_INTERVAL steps: compute val, save if best, sample + bool better = false; + if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS) + { + last_val_loss = estimate_loss(model, dl, "val", rng); + if (last_val_loss < best_val_loss) + { + best_val_loss = last_val_loss; + model.save(model_path); + better = true; + } + } - model.load(model_path); - model.rng = std::mt19937(SEED + 42); + // print every step + std::cout + << "step" + << std::setw(5) << iter << "/" << MAX_ITERS + << " | loss " + << std::fixed << std::setprecision(6) << batch_loss + << " | val " + << std::fixed << std::setprecision(6) << last_val_loss + << " | lr " + << std::scientific << std::setprecision(2) << (float)LEARNING_RATE + << " | " + << std::fixed << std::setprecision(2) << step_ms << " ms" + << " | " << tok_per_sec << " tok/s" + << (better ? " *best*" : "") + << "\n"; + std::cout.flush(); - std::vector ctx = {0}; - while (!g_interrupted) - { - ctx = model.generate(ctx, 1); - std::cout << dl.decode({ctx.back()}) << std::flush; - if ((int)ctx.size() > BLOCK_SIZE) - ctx = std::vector(ctx.end() - BLOCK_SIZE, ctx.end()); + // sample after every eval window + if (iter % EVAL_INTERVAL == 0 || iter == MAX_ITERS) + { + std::cout << "generating:\n"; + sample_tokens(model, dl, iter == MAX_ITERS ? 10000 : 150); + } } - std::cout << "\n\n[Stopped by user]\n"; - std::cout << "[TOTAL] Wall-clock: " - << std::fixed << std::setprecision(1) - << (wall_secs() - train_start) << "s\n"; return 0; } \ No newline at end of file diff --git a/run.md b/run.md deleted file mode 100644 index a2c0e65..0000000 --- a/run.md +++ /dev/null @@ -1,492 +0,0 @@ -# Quadtrix.cpp - -Quadtrix.cpp is a local GPT-style language model project with multiple runtime paths: - -- Native C++ inference and training through `Quadtrix.exe` / `main.cpp` -- PyTorch checkpoint inference through `engine/inference.py` and `engine/best_model .pt` -- FastAPI middleware in `backend/` -- React + TypeScript chat UI in `frontend/` - -The web interface can chat with both model backends: - -- `C++`: calls the C++ HTTP server on port `8080` -- `.pt`: loads the PyTorch checkpoint directly from `engine/best_model .pt` - -## Project Layout - -```text -Quadtrix.cpp/ - Quadtrix.exe - main.cpp - config/ - include/ - data/ - engine/ - inference.py - main.py - fine-tune/main.py - best_model .pt - fineweb_30mb.txt - backend/ - main.py - inference.py - requirements.txt - frontend/ - package.json - src/ -``` - -## Requirements - -- Python 3.10+ -- Node.js 18+ -- npm -- C++17 compiler if you want to rebuild the C++ executable - -## 1. Python Setup - -From the repo root: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -python -m venv .venv -.\.venv\Scripts\python.exe -m pip install --upgrade pip -``` - -Install backend and PyTorch inference dependencies: - -```powershell -cd backend -..\.venv\Scripts\python.exe -m pip install -r requirements.txt -``` - -## 2. Frontend Setup - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend -npm.cmd install -npm.cmd run build -``` - -Run the frontend: - -```powershell -npm.cmd run dev -``` - -Frontend URL: - -```text -http://localhost:5173 -``` - -## Install as a Web App - -The frontend is configured as an installable PWA. It includes: - -- `frontend/manifest.webmanifest` -- `frontend/sw.js` -- `frontend/public/manifest.webmanifest` -- `frontend/public/sw.js` -- service worker registration in `frontend/src/registerServiceWorker.ts` - -For the clean installable version, build and preview the frontend: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend -npm.cmd run build -npm.cmd run preview -``` - -Open the preview URL, usually: - -```text -http://localhost:4173 -``` - -Then install from the browser: - -- Chrome / Edge: click the install icon in the address bar -- Or open browser menu -> Apps -> Install this site as an app - -The installed app still talks to the backend at: - -```text -http://localhost:3001 -``` - -So keep the FastAPI backend running when chatting. - -## 3. Run the PyTorch `.pt` Model in the Web UI - -The `.pt` model does not need a separate model server. The FastAPI backend loads it directly from: - -```text -engine/best_model .pt -``` - -Start the backend: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend -..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001 -``` - -Start the frontend in another terminal: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend -npm.cmd run dev -``` - -Open: - -```text -http://localhost:5173 -``` - -Select `.pt` in the top bar. - -## 4. Run the C++ Model in the Web UI - -Start the C++ inference server: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\Quadtrix.exe data\input.txt --server --port 8080 -``` - -Start the backend: - -```powershell -cd backend -..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001 -``` - -Start the frontend: - -```powershell -cd ..\frontend -npm.cmd run dev -``` - -Open: - -```text -http://localhost:5173 -``` - -Select `C++` in the top bar. - -## 5. Run Both Backends Together - -Use three terminals. - -Terminal 1: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\Quadtrix.exe data\input.txt --server --port 8080 -``` - -Terminal 2: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend -..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001 -``` - -Terminal 3: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend -npm.cmd run dev -``` - -Open: - -```text -http://localhost:5173 -``` - -Switch between `C++` and `.pt` from the model selector. - -## 6. Backend API - -Base URL: - -```text -http://localhost:3001 -``` - -Routes: - -```text -GET /api/health -GET /api/stats -POST /api/chat -GET /api/sessions -POST /api/sessions -DELETE /api/sessions/{id} -GET /api/sessions/{id}/messages -POST /api/feedback -``` - -Example `.pt` chat request: - -```powershell -Invoke-RestMethod ` - -Uri http://localhost:3001/api/chat ` - -Method Post ` - -ContentType "application/json" ` - -Body '{ - "session_id": null, - "prompt": "Once upon a time", - "max_tokens": 100, - "temperature": 1.0, - "stream": false, - "model_backend": "torch" - }' -``` - -Example C++ chat request: - -```powershell -Invoke-RestMethod ` - -Uri http://localhost:3001/api/chat ` - -Method Post ` - -ContentType "application/json" ` - -Body '{ - "session_id": null, - "prompt": "Once upon a time", - "max_tokens": 100, - "temperature": 1.0, - "stream": false, - "model_backend": "cpp" - }' -``` - -## 7. Environment Variables - -Backend defaults are in `backend/.env.example`: - -```text -API_PORT=3001 -CORS_ORIGINS=http://localhost:5173 -REDIS_URL= -LOG_LEVEL=INFO -MAX_SESSIONS=1000 -SESSION_TTL_HOURS=24 -CPP_SERVER_URL=http://localhost:8080 -TORCH_CHECKPOINT_PATH=../engine/best_model .pt -REQUEST_TIMEOUT_SECONDS=60 -``` - -Create `backend/.env` if you want overrides. - -Frontend defaults are in `frontend/.env.example`: - -```text -VITE_API_BASE_URL=http://localhost:3001 -``` - -## 8. PyTorch CLI Inference - -Interactive chat: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\.venv\Scripts\python.exe engine\inference.py --checkpoint "engine\best_model .pt" -``` - -Generate once: - -```powershell -.\.venv\Scripts\python.exe engine\inference.py --checkpoint "engine\best_model .pt" --prompt "Hello" --max-new-tokens 100 --temperature 1.0 -``` - -## 9. PyTorch Training - -Main training: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\.venv\Scripts\python.exe engine\main.py -``` - -Fine-tuning: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\.venv\Scripts\python.exe engine\fine-tune\main.py -``` - -## 10. C++ Build and Run - -Build manually: - -```powershell -g++ -std=c++17 -O2 -I. -Iinclude -o Quadtrix.exe main.cpp -``` - -Train from scratch: - -```powershell -.\Quadtrix.exe data\input.txt -``` - -Terminal chat: - -```powershell -.\Quadtrix.exe data\input.txt --chat -``` - -Raw generation: - -```powershell -.\Quadtrix.exe data\input.txt --generate -``` - -HTTP server: - -```powershell -.\Quadtrix.exe data\input.txt --server --port 8080 -``` - -## 11. Health Checks - -Backend: - -```powershell -Invoke-RestMethod http://localhost:3001/api/health -``` - -C++ server: - -```powershell -Invoke-RestMethod http://localhost:8080/health -``` - -Frontend: - -```text -http://localhost:5173 -``` - -When only `.pt` is available, backend health should show: - -```json -{ - "status": "degraded", - "api": "ok", - "cpp_server": "unreachable", - "torch_model": "ok" -} -``` - -When both are available, backend health should show: - -```json -{ - "status": "ok", - "api": "ok", - "cpp_server": "ok", - "torch_model": "ok" -} -``` - -## 12. Troubleshooting - -### PowerShell blocks `npm` - -Use `npm.cmd`: - -```powershell -npm.cmd run dev -npm.cmd run build -``` - -### `.pt` model is unavailable - -Check that this file exists: - -```text -engine/best_model .pt -``` - -Then check Python dependencies: - -```powershell -cd backend -..\.venv\Scripts\python.exe -c "import torch, tiktoken; print(torch.__version__)" -``` - -### Backend cannot import FastAPI - -Install dependencies into the repo venv: - -```powershell -cd backend -..\.venv\Scripts\python.exe -m pip install -r requirements.txt -``` - -### C++ option is offline - -Start the C++ server: - -```powershell -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\Quadtrix.exe data\input.txt --server --port 8080 -``` - -### Frontend cannot reach backend - -Check: - -```text -http://localhost:3001/api/health -``` - -Make sure frontend config points to: - -```text -VITE_API_BASE_URL=http://localhost:3001 -``` - -### Port already in use - -```powershell -Get-NetTCPConnection -LocalPort 3001 -Get-NetTCPConnection -LocalPort 5173 -Get-NetTCPConnection -LocalPort 8080 -``` - -## Recommended Daily Run - -```powershell -# Terminal 1 -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp -.\Quadtrix.exe data\input.txt --server --port 8080 -``` - -```powershell -# Terminal 2 -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\backend -..\.venv\Scripts\python.exe -m uvicorn main:app --host 127.0.0.1 --port 3001 -``` - -```powershell -# Terminal 3 -cd C:\Users\Admin\Documents\GitHub\Quadtrix.cpp\frontend -npm.cmd run dev -``` - -Open: - -```text -http://localhost:5173 -``` - -## License - -MIT diff --git a/scripts/build.sh b/scripts/build.sh new file mode 100644 index 0000000..e36678b --- /dev/null +++ b/scripts/build.sh @@ -0,0 +1,161 @@ + +# Quadtrix.cpp — build.sh +# Usage +# ./scripts/build.sh # full stack, CPU +# ./scripts/build.sh dev # hot-reload dev mode +# ./scripts/build.sh gpu # CUDA backend +# ./scripts/build.sh cpp-only # compile + run C++ engine +# ./scripts/build.sh train-cpp # train with C++ backend +# ./scripts/build.sh train-torch # train with PyTorch backend +# ./scripts/build.sh bench # run benchmark +# ./scripts/build.sh clean # remove containers + volumes +# ./scripts/build.sh logs # tail all service logs + +set -euo pipefail + +BOLD="\033[1m" +GREEN="\033[0;32m" +CYAN="\033[0;36m" +YELLOW="\033[1;33m" +RED="\033[0;31m" +RESET="\033[0m" + +info() { echo -e "${CYAN}[quadtrix]${RESET} $*"; } +success() { echo -e "${GREEN}[quadtrix]${RESET} $*"; } +warn() { echo -e "${YELLOW}[quadtrix]${RESET} $*"; } +error() { echo -e "${RED}[quadtrix] ERROR:${RESET} $*" >&2; } + +COMPOSE_BASE="docker compose -f docker-compose.yml" +COMPOSE_DEV="${COMPOSE_BASE} -f docker-compose.dev.yml" +COMPOSE_GPU="${COMPOSE_BASE} -f docker-compose.gpu.yml" + +check_docker() { + if ! docker info &>/dev/null; then + error "Docker daemon is not running. Start Docker Desktop or the Docker service." + exit 1 + fi +} + +check_nvidia() { + if ! command -v nvidia-smi &>/dev/null; then + warn "nvidia-smi not found — GPU mode may not work." + else + info "GPU detected: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)" + fi +} + +pull_cache() { + info "Pulling build cache images (if available)..." + $COMPOSE_BASE pull --ignore-pull-failures 2>/dev/null || true +} + +cmd_up() { + check_docker + info "Starting full stack (CPU)..." + $COMPOSE_BASE up --build -d + success "Stack is up." + echo "" + echo -e " ${BOLD}Frontend:${RESET} http://localhost:5173" + echo -e " ${BOLD}API:${RESET} http://localhost:3001/api/health" + echo -e " ${BOLD}Docs:${RESET} http://localhost:3001/docs" +} + +cmd_dev() { + check_docker + info "Starting in DEV mode (hot-reload)..." + $COMPOSE_DEV up --build +} + +cmd_gpu() { + check_docker + check_nvidia + info "Starting with CUDA GPU support..." + $COMPOSE_GPU up --build -d + success "GPU stack is up." +} + +cmd_cpp_only() { + check_docker + info "Compiling and running C++ engine..." + $COMPOSE_BASE --profile cpp run --rm cpp "$@" +} + +cmd_train_cpp() { + check_docker + info "Training with C++ backend..." + $COMPOSE_BASE --profile train run --rm train-cpp + success "C++ training complete. Checkpoint saved in 'models' volume." +} + +cmd_train_torch() { + check_docker + info "Training with PyTorch backend..." + $COMPOSE_BASE --profile train run --rm train-torch + success "PyTorch training complete. Checkpoint saved in 'models' volume." +} + +cmd_bench() { + check_docker + info "Running benchmark..." + $COMPOSE_BASE --profile benchmark run --rm benchmark +} + +cmd_logs() { + check_docker + $COMPOSE_BASE logs -f --tail=100 +} + +cmd_clean() { + check_docker + warn "This will remove all containers and volumes (including saved models!)" + read -r -p "Are you sure? [y/N] " confirm + if [[ "${confirm,,}" == "y" ]]; then + $COMPOSE_BASE down -v --remove-orphans + docker image prune -f --filter "label=org.opencontainers.image.source=https://github.com/Eamon2009/Quadtrix.cpp" + success "Cleaned." + else + info "Aborted." + fi +} + +cmd_ps() { + $COMPOSE_BASE ps +} + +cmd_shell() { + service="${1:-backend}" + info "Opening shell in '${service}'..." + $COMPOSE_BASE exec "${service}" /bin/sh +} +CMD="${1:-up}" +shift || true + +case "${CMD}" in + up) cmd_up "$@" ;; + dev) cmd_dev "$@" ;; + gpu) cmd_gpu "$@" ;; + cpp-only) cmd_cpp_only "$@" ;; + train-cpp) cmd_train_cpp "$@" ;; + train-torch) cmd_train_torch "$@" ;; + bench) cmd_bench "$@" ;; + logs) cmd_logs "$@" ;; + clean) cmd_clean "$@" ;; + ps) cmd_ps "$@" ;; + shell) cmd_shell "$@" ;; + *) + echo -e "Usage: ./scripts/build.sh ${BOLD}[command]${RESET}" + echo "" + echo "Commands:" + echo " up Full stack (CPU) — default" + echo " dev Hot-reload dev mode" + echo " gpu CUDA GPU stack" + echo " cpp-only Run C++ engine CLI" + echo " train-cpp Train with C++ backend" + echo " train-torch Train with PyTorch" + echo " bench Benchmark" + echo " logs Tail logs" + echo " ps Show container status" + echo " shell [svc] Shell into service (default: backend)" + echo " clean Remove all containers + volumes" + ;; +esac From 6facc3e7e450ef33072feaa427c42630a7a6a216 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 00:54:48 +0530 Subject: [PATCH 02/45] ci: add manual PR checks workflow with slash command support --- .github/workflows/pr-check.yml | 59 ++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml index 699b834..4824b9e 100644 --- a/.github/workflows/pr-check.yml +++ b/.github/workflows/pr-check.yml @@ -56,6 +56,23 @@ jobs: }); core.setOutput('sha', pr.head.sha); + - name: Set checks to pending + uses: actions/github-script@v7 + with: + script: | + const sha = '${{ steps.get-sha.outputs.sha }}'; + const checks = ['Lint', 'Build C++ (ubuntu-22.04)', 'Build C++ (ubuntu-24.04)', 'Build C++ (macos-14)', 'Validate']; + for (const check of checks) { + await github.rest.repos.createCommitStatus({ + owner: context.repo.owner, + repo: context.repo.repo, + sha, + state: 'pending', + context: check, + description: 'Waiting...', + }); + } + lint: name: Lint @@ -77,6 +94,20 @@ jobs: with: args: "check engine/ --ignore E501 --exit-zero" + - name: Report status + if: always() + uses: actions/github-script@v7 + with: + script: | + await github.rest.repos.createCommitStatus({ + owner: context.repo.owner, + repo: context.repo.repo, + sha: '${{ needs.slash-command.outputs.pr-sha }}', + state: '${{ job.status }}' === 'success' ? 'success' : 'failure', + context: 'Lint', + description: '${{ job.status }}', + }); + build-cpp: name: Build C++ (${{ matrix.os }}) @@ -125,6 +156,20 @@ jobs: path: quadtrix retention-days: 7 + - name: Report status + if: always() + uses: actions/github-script@v7 + with: + script: | + await github.rest.repos.createCommitStatus({ + owner: context.repo.owner, + repo: context.repo.repo, + sha: '${{ needs.slash-command.outputs.pr-sha }}', + state: '${{ job.status }}' === 'success' ? 'success' : 'failure', + context: 'Build C++ (${{ matrix.os }})', + description: '${{ job.status }}', + }); + validate: name: Validate @@ -171,6 +216,20 @@ jobs: dockerfile: .devops/Dockerfile.backend failure-threshold: error + - name: Report status + if: always() + uses: actions/github-script@v7 + with: + script: | + await github.rest.repos.createCommitStatus({ + owner: context.repo.owner, + repo: context.repo.repo, + sha: '${{ needs.slash-command.outputs.pr-sha }}', + state: '${{ job.status }}' === 'success' ? 'success' : 'failure', + context: 'Validate', + description: '${{ job.status }}', + }); + post-result: name: Post result From 40b8bd93fb776c075ba10cc3cf7b3b2e7f992843 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 01:00:08 +0530 Subject: [PATCH 03/45] feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- Co-authored-by: Max --- CUDA/includes/attention.cuh | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 CUDA/includes/attention.cuh diff --git a/CUDA/includes/attention.cuh b/CUDA/includes/attention.cuh new file mode 100644 index 0000000..7feac08 --- /dev/null +++ b/CUDA/includes/attention.cuh @@ -0,0 +1,29 @@ +#pragma once + +#include "tensor.cuh" + +#include + +namespace quadtrix { +namespace cuda { + +Status attention_forward( + const TensorView& input_qkv, + TensorView preatt, + TensorView att, + TensorView output, + int num_heads, + cudaStream_t stream = nullptr); + +Status attention_backward( + const TensorView& grad_output, + const TensorView& input_qkv, + const TensorView& att, + TensorView grad_input_qkv, + TensorView grad_preatt, + TensorView grad_att, + int num_heads, + cudaStream_t stream = nullptr); + +} // namespace cuda +} // namespace quadtrix From 4aac832e725f1ec5b2136b3167bfa7028e714ee5 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:30:58 +0530 Subject: [PATCH 04/45] feat(cuda): add checkpoint metadata struct and stub functions --- CUDA/includes/checkpoint.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 CUDA/includes/checkpoint.h diff --git a/CUDA/includes/checkpoint.h b/CUDA/includes/checkpoint.h new file mode 100644 index 0000000..ba91b0f --- /dev/null +++ b/CUDA/includes/checkpoint.h @@ -0,0 +1,25 @@ +#pragma once + +#include "tensor.cuh" + +namespace quadtrix { +namespace cuda { + +struct CheckpointMetadata { + int vocab_size = 0; + int max_sequence_length = 0; + int num_layers = 0; + int num_heads = 0; + int channels = 0; +}; + +inline bool load_checkpoint_metadata(const char*, CheckpointMetadata*) { + return false; +} + +inline bool save_tensor_checkpoint(const char*, const TensorView&) { + return false; +} + +} // namespace cuda +} // namespace quadtrix From 47696058b34c95c45e715fb7b25dcec5a28ea955 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:34:04 +0530 Subject: [PATCH 05/45] feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. --- CUDA/includes/common.h | 120 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 CUDA/includes/common.h diff --git a/CUDA/includes/common.h b/CUDA/includes/common.h new file mode 100644 index 0000000..36df155 --- /dev/null +++ b/CUDA/includes/common.h @@ -0,0 +1,120 @@ +#pragma once + +#include + +#include +#include +#include +#include +#include + +namespace quadtrix { +namespace cuda { + +enum class DType : std::uint8_t { + F32, + F16, + BF16, + I32, + U8, +}; + +enum class DeviceKind : std::uint8_t { + CPU, + CUDA, +}; + +struct Status { + bool ok; + cudaError_t cuda_error; + const char* message; + + static Status success() { + return {true, cudaSuccess, "ok"}; + } + + static Status failure(cudaError_t error, const char* message) { + return {false, error, message}; + } +}; + +inline const char* dtype_name(DType dtype) { + switch (dtype) { + case DType::F32: + return "f32"; + case DType::F16: + return "f16"; + case DType::BF16: + return "bf16"; + case DType::I32: + return "i32"; + case DType::U8: + return "u8"; + } + return "unknown"; +} + +inline std::size_t dtype_size(DType dtype) { + switch (dtype) { + case DType::F32: + return 4; + case DType::F16: + return 2; + case DType::BF16: + return 2; + case DType::I32: + return 4; + case DType::U8: + return 1; + } + + std::fprintf(stderr, "Unknown CUDA dtype value %u\n", static_cast(dtype)); + std::abort(); +} + +inline bool checked_mul(std::size_t lhs, std::size_t rhs, std::size_t* out) { + if (lhs != 0 && rhs > std::numeric_limits::max() / lhs) { + return false; + } + *out = lhs * rhs; + return true; +} + +inline Status check_cuda(cudaError_t error, const char* expression, const char* file, int line) { + if (error == cudaSuccess) { + return Status::success(); + } + + std::fprintf( + stderr, + "CUDA error at %s:%d: %s failed with %s\n", + file, + line, + expression, + cudaGetErrorString(error)); + return Status::failure(error, expression); +} + +inline void abort_on_cuda(cudaError_t error, const char* expression, const char* file, int line) { + if (error == cudaSuccess) { + return; + } + + std::fprintf( + stderr, + "Fatal CUDA error at %s:%d: %s failed with %s\n", + file, + line, + expression, + cudaGetErrorString(error)); + std::abort(); +} + +} // namespace cuda +} // namespace quadtrix + +#define QUADTRIX_CUDA_CHECK(expr) \ + ::quadtrix::cuda::check_cuda((expr), #expr, __FILE__, __LINE__) + +#define QUADTRIX_CUDA_ABORT(expr) \ + ::quadtrix::cuda::abort_on_cuda((expr), #expr, __FILE__, __LINE__) From 7c94958781dddc8d38a30d34dd343a00417c7fc7 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:34:39 +0530 Subject: [PATCH 06/45] feat(cuda): add TokenBatchView struct and DataLoader stub class --- CUDA/includes/dataloader.h | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 CUDA/includes/dataloader.h diff --git a/CUDA/includes/dataloader.h b/CUDA/includes/dataloader.h new file mode 100644 index 0000000..fd3c47d --- /dev/null +++ b/CUDA/includes/dataloader.h @@ -0,0 +1,29 @@ +#pragma once + +#include +#include + +namespace quadtrix { +namespace cuda { + +struct TokenBatchView { + const std::int32_t* inputs = nullptr; + const std::int32_t* targets = nullptr; + int batch_size = 0; + int sequence_length = 0; +}; + +class DataLoader { +public: + DataLoader() = default; + + bool next(TokenBatchView* batch) { + if (batch != nullptr) { + *batch = {}; + } + return false; + } +}; + +} // namespace cuda +} // namespace quadtrix From c62c869527bcf83ab494341b1667b7ac95e9af95 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:35:34 +0530 Subject: [PATCH 07/45] feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. --- CUDA/includes/gelu.cuh | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 CUDA/includes/gelu.cuh diff --git a/CUDA/includes/gelu.cuh b/CUDA/includes/gelu.cuh new file mode 100644 index 0000000..af87e64 --- /dev/null +++ b/CUDA/includes/gelu.cuh @@ -0,0 +1,31 @@ +#pragma once + +#include "tensor.cuh" + +#include + +#include + +namespace quadtrix { +namespace cuda { + +enum class GeluMode : std::uint8_t { + Exact, + Approximate, +}; + +Status gelu_forward( + const TensorView& input, + TensorView output, + GeluMode mode = GeluMode::Approximate, + cudaStream_t stream = nullptr); + +Status gelu_backward( + const TensorView& grad_output, + const TensorView& input, + TensorView grad_input, + GeluMode mode = GeluMode::Approximate, + cudaStream_t stream = nullptr); + +} // namespace cuda +} // namespace quadtrix From 28117dc6f6e5bb2be6544f0a9007043a943686c1 Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:47:36 +0530 Subject: [PATCH 08/45] feat(cuda): add gradient norm calculation and clipping interfaces --- CUDA/includes/global_norm.cuh | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 CUDA/includes/global_norm.cuh diff --git a/CUDA/includes/global_norm.cuh b/CUDA/includes/global_norm.cuh new file mode 100644 index 0000000..f418ab7 --- /dev/null +++ b/CUDA/includes/global_norm.cuh @@ -0,0 +1,26 @@ +#pragma once + +#include "tensor.cuh" + +#include + +namespace quadtrix { +namespace cuda { + +Status global_norm_squared( + const TensorView& grads, + TensorView partial_sums, + cudaStream_t stream = nullptr); + +Status clip_gradients_by_global_norm( + TensorView grads, + float global_norm, + float max_norm, + cudaStream_t stream = nullptr); + +inline float clip_scale(float global_norm, float max_norm) { + return global_norm > max_norm && global_norm > 0.0f ? max_norm / global_norm : 1.0f; +} + +} // namespace cuda +} // namespace quadtrix From 3bdf5bed6472cf21ed9904ebe99d50d98689e79d Mon Sep 17 00:00:00 2001 From: Eamon Date: Mon, 1 Jun 2026 22:48:31 +0530 Subject: [PATCH 09/45] feat(cuda): add LayerNorm forward and backward kernel declarations --- CUDA/includes/layernorm.cuh | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 CUDA/includes/layernorm.cuh diff --git a/CUDA/includes/layernorm.cuh b/CUDA/includes/layernorm.cuh new file mode 100644 index 0000000..2645537 --- /dev/null +++ b/CUDA/includes/layernorm.cuh @@ -0,0 +1,32 @@ +#pragma once + +#include "tensor.cuh" + +#include + +namespace quadtrix { +namespace cuda { + +Status layernorm_forward( + const TensorView& input, + const TensorView& gamma, + const TensorView& beta, + TensorView output, + TensorView mean, + TensorView rstd, + float epsilon = 1.0e-5f, + cudaStream_t stream = nullptr); + +Status layernorm_backward( + const TensorView& grad_output, + const TensorView& input, + const TensorView& gamma, + const TensorView& mean, + const TensorView& rstd, + TensorView grad_input, + TensorView grad_gamma, + TensorView grad_beta, + cudaStream_t stream = nullptr); + +} // namespace cuda +} // namespace quadtrix From 3dba73a7650cdec838f5907f8fafcdcc5fd5cd65 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Tue, 2 Jun 2026 22:34:54 +0530 Subject: [PATCH 10/45] refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. --- .github/workflows/ci.yml | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index bf49286..c30d16b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -2,11 +2,11 @@ name: CI on: push: - branches: [master, dev] + branches: [master] workflow_dispatch: inputs: image: - description: "Which image to build?" + description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)" required: true type: choice options: @@ -14,7 +14,7 @@ on: - cpu - cuda - all - push: + push_image: description: "Push to ghcr.io?" required: true default: "true" @@ -27,6 +27,7 @@ env: jobs: + file-integrity: name: File integrity if: github.event_name == 'push' @@ -86,8 +87,9 @@ jobs: run: ./quadtrix --help || true + build-cpp-image: - name: Build — cpp + name: "Build — cpp (C++ engine · linux/amd64 + arm64)" if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') runs-on: ubuntu-latest permissions: @@ -100,7 +102,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push == 'true' + if: inputs.push_image == 'true' uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -123,7 +125,7 @@ jobs: context: . file: .devops/Dockerfile.cpp platforms: linux/amd64,linux/arm64 - push: ${{ inputs.push == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cpp @@ -131,7 +133,7 @@ jobs: build-cpu-image: - name: Build — cpu + name: "Build — cpu (PyTorch CPU · linux/amd64 + arm64)" if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all') runs-on: ubuntu-latest permissions: @@ -144,7 +146,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push == 'true' + if: inputs.push_image == 'true' uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -167,7 +169,7 @@ jobs: context: . file: .devops/Dockerfile platforms: linux/amd64,linux/arm64 - push: ${{ inputs.push == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cpu @@ -175,7 +177,7 @@ jobs: build-cuda-image: - name: Build — cuda + name: "Build — cuda (PyTorch CUDA · linux/amd64 only)" if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all') runs-on: ubuntu-latest permissions: @@ -187,7 +189,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push == 'true' + if: inputs.push_image == 'true' uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -210,8 +212,8 @@ jobs: context: . file: .devops/Dockerfile.backend platforms: linux/amd64 - push: ${{ inputs.push == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cuda - cache-to: type=gha,mode=max,scope=cuda \ No newline at end of file + cache-to: type=gha,mode=max,scope=cuda From 309183fdefb5ad77fd64828d147f1f04037f47ea Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Tue, 2 Jun 2026 22:45:48 +0530 Subject: [PATCH 11/45] Fix formatting and update CI workflow steps --- .github/workflows/ci.yml | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index c30d16b..0423bd2 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -45,9 +45,9 @@ jobs: failed=0 for f in "${files[@]}"; do if [ -f "$f" ]; then - echo "✅ $f" + echo "PASS: $f" else - echo "❌ $f — MISSING" + echo "FAIL: $f -- MISSING" failed=1 fi done @@ -86,11 +86,17 @@ jobs: - name: Smoke test run: ./quadtrix --help || true + - name: Upload binary + uses: actions/upload-artifact@v4 + with: + name: quadtrix-linux-amd64 + path: quadtrix + retention-days: 7 + - build-cpp-image: - name: "Build — cpp (C++ engine · linux/amd64 + arm64)" - if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') + name: "Build -- cpp (C++ engine - linux/amd64 + arm64)" + if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }} runs-on: ubuntu-latest permissions: contents: read @@ -102,7 +108,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push_image == 'true' + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -133,8 +139,8 @@ jobs: build-cpu-image: - name: "Build — cpu (PyTorch CPU · linux/amd64 + arm64)" - if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all') + name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)" + if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all') }} runs-on: ubuntu-latest permissions: contents: read @@ -146,7 +152,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push_image == 'true' + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -177,8 +183,8 @@ jobs: build-cuda-image: - name: "Build — cuda (PyTorch CUDA · linux/amd64 only)" - if: github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all') + name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)" + if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all') }} runs-on: ubuntu-latest permissions: contents: read @@ -189,7 +195,7 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Login to GHCR - if: inputs.push_image == 'true' + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} From ac398662e4e8bfab63840f3a709afd3fd0d6e5e9 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Tue, 2 Jun 2026 22:49:14 +0530 Subject: [PATCH 12/45] Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. --- .github/workflows/ci.yml | 64 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 60 insertions(+), 4 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 0423bd2..d0f158b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -3,6 +3,8 @@ name: CI on: push: branches: [master] + tags: + - 'v*' workflow_dispatch: inputs: image: @@ -23,11 +25,10 @@ on: env: REGISTRY: ghcr.io - IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix jobs: - + file-integrity: name: File integrity if: github.event_name == 'push' @@ -67,8 +68,8 @@ jobs: args: "check engine/ --ignore E501 --exit-zero" - build-cpp: - name: C++ compile check + build-binary-linux: + name: Binary (ubuntu-latest) if: github.event_name == 'push' runs-on: ubuntu-latest steps: @@ -94,6 +95,52 @@ jobs: retention-days: 7 + build-binary-macos: + name: Binary (macos-14) + if: github.event_name == 'push' + runs-on: macos-14 + steps: + - uses: actions/checkout@v4 + + - name: Compile main.cpp + run: | + g++ -std=c++17 -O3 \ + -I. -Iinclude \ + -o quadtrix main.cpp + + - name: Smoke test + run: ./quadtrix --help || true + + - name: Package binary + run: tar -czf quadtrix-macos-arm64.tar.gz quadtrix + + - name: Upload binary + uses: actions/upload-artifact@v4 + with: + name: quadtrix-macos-arm64 + path: quadtrix-macos-arm64.tar.gz + retention-days: 7 + + release: + name: Publish release + if: startsWith(github.ref, 'refs/tags/v') + needs: [build-binary-linux, build-binary-macos] + runs-on: ubuntu-latest + permissions: + contents: write + steps: + - name: Download all artifacts + uses: actions/download-artifact@v4 + with: + path: dist/ + + - name: Publish GitHub release + uses: softprops/action-gh-release@v2 + with: + files: | + dist/quadtrix-linux-amd64/quadtrix + dist/quadtrix-macos-arm64/quadtrix-macos-arm64.tar.gz + build-cpp-image: name: "Build -- cpp (C++ engine - linux/amd64 + arm64)" if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }} @@ -107,6 +154,9 @@ jobs: - uses: docker/setup-qemu-action@v3 - uses: docker/setup-buildx-action@v3 + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + - name: Login to GHCR if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 @@ -151,6 +201,9 @@ jobs: - uses: docker/setup-qemu-action@v3 - uses: docker/setup-buildx-action@v3 + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + - name: Login to GHCR if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 @@ -194,6 +247,9 @@ jobs: - uses: docker/setup-buildx-action@v3 + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + - name: Login to GHCR if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 From e38ff85eca57521327e30c8983512a2351855622 Mon Sep 17 00:00:00 2001 From: Eamon Date: Tue, 2 Jun 2026 23:24:15 +0530 Subject: [PATCH 13/45] feat(docker): add Dockerfile for frontend application --- .devops/Dockerfile.dev.frontend | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 .devops/Dockerfile.dev.frontend diff --git a/.devops/Dockerfile.dev.frontend b/.devops/Dockerfile.dev.frontend new file mode 100644 index 0000000..de054d6 --- /dev/null +++ b/.devops/Dockerfile.dev.frontend @@ -0,0 +1,12 @@ +FROM node:20-alpine + +WORKDIR /app + +COPY frontend/package*.json ./ +RUN npm ci + +COPY frontend/ ./ + +EXPOSE 5173 + +CMD ["npm", "run", "dev", "--", "--host", "0.0.0.0"] From b120ffd1b2726e11a8fbb23eed07223d0ea3136b Mon Sep 17 00:00:00 2001 From: Eamon Date: Tue, 2 Jun 2026 23:24:32 +0530 Subject: [PATCH 14/45] feat(docker): add Dockerfile for frontend application --- .devops/Dockerfile.frontend | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 .devops/Dockerfile.frontend diff --git a/.devops/Dockerfile.frontend b/.devops/Dockerfile.frontend new file mode 100644 index 0000000..70ca5aa --- /dev/null +++ b/.devops/Dockerfile.frontend @@ -0,0 +1,22 @@ +FROM node:20-alpine AS build + +WORKDIR /app + +COPY frontend/package*.json ./ +RUN npm ci + +COPY frontend/ ./ + +ARG VITE_API_BASE_URL=/api +ENV VITE_API_BASE_URL=${VITE_API_BASE_URL} + +RUN npm run build + +FROM nginx:1.27-alpine AS runtime + +COPY .devops/nginx.conf /etc/nginx/conf.d/default.conf +COPY --from=build /app/dist /usr/share/nginx/html + +EXPOSE 80 + +CMD ["nginx", "-g", "daemon off;"] From 9156bba064a027ed817156b2d5ab287977f999bd Mon Sep 17 00:00:00 2001 From: Eamon Date: Tue, 2 Jun 2026 23:25:34 +0530 Subject: [PATCH 15/45] refactor(ci): remove release job from GitHub actions --- .github/workflows/ci.yml | 22 ---------------------- 1 file changed, 22 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index d0f158b..e6502d0 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -3,8 +3,6 @@ name: CI on: push: branches: [master] - tags: - - 'v*' workflow_dispatch: inputs: image: @@ -121,26 +119,6 @@ jobs: path: quadtrix-macos-arm64.tar.gz retention-days: 7 - release: - name: Publish release - if: startsWith(github.ref, 'refs/tags/v') - needs: [build-binary-linux, build-binary-macos] - runs-on: ubuntu-latest - permissions: - contents: write - steps: - - name: Download all artifacts - uses: actions/download-artifact@v4 - with: - path: dist/ - - - name: Publish GitHub release - uses: softprops/action-gh-release@v2 - with: - files: | - dist/quadtrix-linux-amd64/quadtrix - dist/quadtrix-macos-arm64/quadtrix-macos-arm64.tar.gz - build-cpp-image: name: "Build -- cpp (C++ engine - linux/amd64 + arm64)" if: ${{ github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all') }} From 8898418b0722fa8178fc1b7738a56432f884448b Mon Sep 17 00:00:00 2001 From: Eamon Date: Tue, 2 Jun 2026 23:26:24 +0530 Subject: [PATCH 16/45] ci: add unified release and docker build workflow --- .github/workflows/docker-publish.yml | 214 ++++++++++++++++----------- 1 file changed, 128 insertions(+), 86 deletions(-) diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml index ca9493f..3a8fbae 100644 --- a/.github/workflows/docker-publish.yml +++ b/.github/workflows/docker-publish.yml @@ -1,66 +1,86 @@ -name: Release +name: Docker Images on: + push: + tags: + - "v*" workflow_dispatch: inputs: + image: + description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)" + required: true + type: choice + options: + - cpp + - cpu + - cuda + - all version: - description: "Version tag (e.g. 1.2.3)" + description: "Optional image tag for manual runs" + required: false + push_image: + description: "Push to ghcr.io?" required: true + default: "true" + type: choice + options: ["true", "false"] env: REGISTRY: ghcr.io IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix jobs: - - build-binaries: - name: Binary (${{ matrix.os }}) - runs-on: ${{ matrix.os }} - strategy: - matrix: - os: [ubuntu-22.04, macos-14] - include: - - os: ubuntu-22.04 - artifact_name: quadtrix-linux-x64 - binary: quadtrix - - os: macos-14 - artifact_name: quadtrix-macos-arm64 - binary: quadtrix + build-cpp-image: + name: "Build -- cpp (C++ engine - linux/amd64 + arm64)" + if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')) }} + runs-on: ubuntu-latest + permissions: + contents: read + packages: write steps: - uses: actions/checkout@v4 - - name: Compile (Linux) - if: runner.os == 'Linux' - run: | - sudo apt-get update && sudo apt-get install -y g++ - g++ -std=c++17 -O3 -march=native \ - -I. -Iinclude \ - -o ${{ matrix.binary }} main.cpp - strip ${{ matrix.binary }} - - - name: Compile (macOS) - if: runner.os == 'macOS' - run: | - g++ -std=c++17 -O3 -march=native \ - -I. -Iinclude \ - -o ${{ matrix.binary }} main.cpp - - - name: Package - run: | - mkdir dist - cp ${{ matrix.binary }} dist/ - cp README.md LICENSE dist/ - tar -czf ${{ matrix.artifact_name }}.tar.gz -C dist . - - - name: Upload to Release - uses: softprops/action-gh-release@v2 + - uses: docker/setup-qemu-action@v3 + - uses: docker/setup-buildx-action@v3 + + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + + - name: Login to GHCR + if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + uses: docker/login-action@v3 with: - tag_name: v${{ github.event.inputs.version }} - files: ${{ matrix.artifact_name }}.tar.gz - generate_release_notes: true + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} - publish-images: - name: Publish Docker images + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: ${{ env.IMAGE_PREFIX }}-cpp + tags: | + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} + type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} + + - name: Build & push + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.cpp + platforms: linux/amd64,linux/arm64 + push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cpp + cache-to: type=gha,mode=max,scope=cpp + + build-cpu-image: + name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)" + if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')) }} runs-on: ubuntu-latest permissions: contents: read @@ -71,62 +91,84 @@ jobs: - uses: docker/setup-qemu-action@v3 - uses: docker/setup-buildx-action@v3 + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + - name: Login to GHCR + if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - - name: Parse tag - id: tag - run: echo "VERSION=${{ github.event.inputs.version }}" >> $GITHUB_OUTPUT - - - name: Build & push backend - uses: docker/build-push-action@v6 + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 with: - context: . - file: .devops/Dockerfile.backend - platforms: linux/amd64,linux/arm64 - push: true + images: ${{ env.IMAGE_PREFIX }}-cpu tags: | - ${{ env.IMAGE_PREFIX }}-backend:latest - ${{ env.IMAGE_PREFIX }}-backend:${{ steps.tag.outputs.VERSION }} - cache-from: type=gha,scope=backend - cache-to: type=gha,mode=max,scope=backend + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} + type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} - - name: Build & push frontend + - name: Build & push uses: docker/build-push-action@v6 with: context: . - file: .devops/Dockerfile.frontend + file: .devops/Dockerfile platforms: linux/amd64,linux/arm64 - push: true + push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cpu + cache-to: type=gha,mode=max,scope=cpu + + build-cuda-image: + name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)" + if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')) }} + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - uses: actions/checkout@v4 + + - uses: docker/setup-buildx-action@v3 + + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + + - name: Login to GHCR + if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: ${{ env.IMAGE_PREFIX }}-cuda tags: | - ${{ env.IMAGE_PREFIX }}-frontend:latest - ${{ env.IMAGE_PREFIX }}-frontend:${{ steps.tag.outputs.VERSION }} - cache-from: type=gha,scope=frontend - cache-to: type=gha,mode=max,scope=frontend + type=ref,event=branch + type=sha,prefix=sha- + type=raw,value=latest,enable={{is_default_branch}} + type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} + type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} - - name: Build & push cpp + - name: Build & push uses: docker/build-push-action@v6 with: context: . - file: .devops/Dockerfile.cpp - platforms: linux/amd64,linux/arm64 - push: true - tags: | - ${{ env.IMAGE_PREFIX }}-cpp:latest - ${{ env.IMAGE_PREFIX }}-cpp:${{ steps.tag.outputs.VERSION }} - cache-from: type=gha,scope=cpp - cache-to: type=gha,mode=max,scope=cpp - - - name: Create Release summary - run: | - echo "## Docker images published" >> $GITHUB_STEP_SUMMARY - echo "" >> $GITHUB_STEP_SUMMARY - echo "| Image | Tags |" >> $GITHUB_STEP_SUMMARY - echo "|-------|------|" >> $GITHUB_STEP_SUMMARY - echo "| \`quadtrix-backend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY - echo "| \`quadtrix-frontend\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY - echo "| \`quadtrix-cpp\` | \`latest\`, \`${{ steps.tag.outputs.VERSION }}\` |" >> $GITHUB_STEP_SUMMARY + file: .devops/Dockerfile.backend + platforms: linux/amd64 + push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=cuda + cache-to: type=gha,mode=max,scope=cuda From f4f3bf3daffe4f6786a085416dc48a3a4507c29f Mon Sep 17 00:00:00 2001 From: Eamon Date: Tue, 2 Jun 2026 23:26:40 +0530 Subject: [PATCH 17/45] ci: add unified release and docker build workflow --- .github/workflows/release.yml | 236 ++++++++++++++++++++++++++++++++++ 1 file changed, 236 insertions(+) create mode 100644 .github/workflows/release.yml diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml new file mode 100644 index 0000000..219b56c --- /dev/null +++ b/.github/workflows/release.yml @@ -0,0 +1,236 @@ +name: Release + +on: + push: + tags: + - "v*" + workflow_dispatch: + inputs: + version: + description: "Release version, with or without a leading v" + required: true + +env: + ARTIFACT_ROOT: release-assets + +jobs: + prepare-release: + name: Prepare release metadata + runs-on: ubuntu-latest + outputs: + tag_name: ${{ steps.meta.outputs.tag_name }} + version: ${{ steps.meta.outputs.version }} + steps: + - id: meta + shell: bash + run: | + set -euo pipefail + if [ "${GITHUB_EVENT_NAME}" = "workflow_dispatch" ]; then + raw_version="${{ inputs.version }}" + else + raw_version="${GITHUB_REF_NAME}" + fi + + raw_version="${raw_version#v}" + tag_name="v${raw_version}" + + echo "tag_name=${tag_name}" >> "$GITHUB_OUTPUT" + echo "version=${raw_version}" >> "$GITHUB_OUTPUT" + + build-linux: + name: Linux ${{ matrix.arch }} CPU + needs: prepare-release + runs-on: ${{ matrix.runner }} + strategy: + fail-fast: false + matrix: + include: + - arch: x64 + runner: ubuntu-22.04 + compiler: g++ + packages: build-essential file + cxxflags: -std=c++17 -O3 -march=native + artifact: quadtrix-ubuntu-x64-cpu.tar.gz + - arch: arm64 + runner: ubuntu-24.04-arm + compiler: g++ + packages: build-essential file + cxxflags: -std=c++17 -O3 -march=native + artifact: quadtrix-ubuntu-arm64-cpu.tar.gz + - arch: s390x + runner: ubuntu-22.04 + compiler: s390x-linux-gnu-g++ + packages: g++-s390x-linux-gnu file + cxxflags: -std=c++17 -O3 + artifact: quadtrix-ubuntu-s390x-cpu.tar.gz + steps: + - uses: actions/checkout@v4 + + - name: Install toolchain + shell: bash + run: | + set -euo pipefail + sudo apt-get update + sudo apt-get install -y ${{ matrix.packages }} + + - name: Build binary + shell: bash + run: | + set -euo pipefail + ${{ matrix.compiler }} ${{ matrix.cxxflags }} \ + -I. -Iinclude \ + -o quadtrix main.cpp + file quadtrix + + - name: Smoke test + shell: bash + run: | + set +e + ./quadtrix --chat >/dev/null 2>&1 + exit 0 + + - name: Package artifact + shell: bash + run: | + set -euo pipefail + mkdir -p "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu" + cp quadtrix "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/" + cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/" + tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-ubuntu-${{ matrix.arch }}-cpu" + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: quadtrix-ubuntu-${{ matrix.arch }}-cpu + path: ${{ matrix.artifact }} + if-no-files-found: error + retention-days: 30 + + build-windows: + name: Windows ${{ matrix.arch }} CPU + needs: prepare-release + runs-on: ${{ matrix.runner }} + strategy: + fail-fast: false + matrix: + include: + - arch: x64 + runner: windows-latest + msvc_arch: x64 + artifact: quadtrix-windows-x64-cpu.zip + - arch: arm64 + runner: windows-11-arm + msvc_arch: arm64 + artifact: quadtrix-windows-arm64-cpu.zip + steps: + - uses: actions/checkout@v4 + + - name: Set up MSVC + uses: ilammy/msvc-dev-cmd@v1 + with: + arch: ${{ matrix.msvc_arch }} + + - name: Build binary + shell: cmd + run: | + cl /nologo /std:c++17 /O2 /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe + + - name: Smoke test + shell: pwsh + run: | + $ErrorActionPreference = 'Continue' + & .\quadtrix.exe --chat | Out-Null + exit 0 + + - name: Package artifact + shell: pwsh + run: | + New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu" | Out-Null + Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\" + Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\" + Compress-Archive -Path "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\*" -DestinationPath "${{ matrix.artifact }}" -Force + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: quadtrix-windows-${{ matrix.arch }}-cpu + path: ${{ matrix.artifact }} + if-no-files-found: error + retention-days: 30 + + build-macos: + name: macOS ${{ matrix.arch }} CPU + needs: prepare-release + runs-on: ${{ matrix.runner }} + strategy: + fail-fast: false + matrix: + include: + - arch: x64 + arch_flag: x86_64 + runner: macos-13 + artifact: quadtrix-macos-x64-cpu.tar.gz + - arch: arm64 + arch_flag: arm64 + runner: macos-14 + artifact: quadtrix-macos-arm64-cpu.tar.gz + steps: + - uses: actions/checkout@v4 + + - name: Build binary + shell: bash + run: | + set -euo pipefail + clang++ -std=c++17 -O3 -arch ${{ matrix.arch_flag }} \ + -I. -Iinclude \ + -o quadtrix main.cpp + file quadtrix + + - name: Smoke test + shell: bash + run: | + set +e + ./quadtrix --chat >/dev/null 2>&1 + exit 0 + + - name: Package artifact + shell: bash + run: | + set -euo pipefail + mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu" + cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/" + cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/" + tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-macos-${{ matrix.arch }}-cpu" + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: quadtrix-macos-${{ matrix.arch }}-cpu + path: ${{ matrix.artifact }} + if-no-files-found: error + retention-days: 30 + + publish-release: + name: Publish GitHub release + needs: + - prepare-release + - build-linux + - build-windows + - build-macos + runs-on: ubuntu-latest + permissions: + contents: write + steps: + - name: Download all artifacts + uses: actions/download-artifact@v4 + with: + path: dist + merge-multiple: true + + - name: Publish release + uses: softprops/action-gh-release@v2 + with: + tag_name: ${{ needs.prepare-release.outputs.tag_name }} + target_commitish: ${{ github.sha }} + files: dist/* + generate_release_notes: true From af5a20756767eb7227d7b51ae8110ca1979f0a23 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Wed, 3 Jun 2026 01:17:06 +0530 Subject: [PATCH 18/45] Refactor macOS build workflow for arm64 architecture --- .github/workflows/release.yml | 48 ++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 23 deletions(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 219b56c..4f14816 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -158,22 +158,11 @@ jobs: if-no-files-found: error retention-days: 30 - build-macos: - name: macOS ${{ matrix.arch }} CPU + + build-macos-arm64: + name: macOS arm64 CPU needs: prepare-release - runs-on: ${{ matrix.runner }} - strategy: - fail-fast: false - matrix: - include: - - arch: x64 - arch_flag: x86_64 - runner: macos-13 - artifact: quadtrix-macos-x64-cpu.tar.gz - - arch: arm64 - arch_flag: arm64 - runner: macos-14 - artifact: quadtrix-macos-arm64-cpu.tar.gz + runs-on: macos-14 steps: - uses: actions/checkout@v4 @@ -181,7 +170,7 @@ jobs: shell: bash run: | set -euo pipefail - clang++ -std=c++17 -O3 -arch ${{ matrix.arch_flag }} \ + clang++ -std=c++17 -O3 -arch arm64 \ -I. -Iinclude \ -o quadtrix main.cpp file quadtrix @@ -197,16 +186,16 @@ jobs: shell: bash run: | set -euo pipefail - mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu" - cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/" - cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-${{ matrix.arch }}-cpu/" - tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-macos-${{ matrix.arch }}-cpu" + mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu" + cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/" + cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/" + tar -czf "quadtrix-macos-arm64-cpu.tar.gz" -C "${ARTIFACT_ROOT}" "quadtrix-macos-arm64-cpu" - name: Upload artifact uses: actions/upload-artifact@v4 with: - name: quadtrix-macos-${{ matrix.arch }}-cpu - path: ${{ matrix.artifact }} + name: quadtrix-macos-arm64-cpu + path: quadtrix-macos-arm64-cpu.tar.gz if-no-files-found: error retention-days: 30 @@ -216,8 +205,21 @@ jobs: - prepare-release - build-linux - build-windows - - build-macos + - build-macos-x64 + - build-macos-arm64 runs-on: ubuntu-latest + + if: | + always() && + needs.prepare-release.result == 'success' && + needs.build-linux.result == 'success' && + needs.build-windows.result == 'success' && + needs.build-macos-x64.result == 'success' && + ( + needs.build-macos-arm64.result == 'success' || + needs.build-macos-arm64.result == 'cancelled' || + needs.build-macos-arm64.result == 'skipped' + ) permissions: contents: write steps: From 58f89df8fe246569804a147df893e3e9ebd2262f Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Wed, 3 Jun 2026 01:20:45 +0530 Subject: [PATCH 19/45] Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. --- .github/workflows/release.yml | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 4f14816..9d73bc0 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -158,7 +158,7 @@ jobs: if-no-files-found: error retention-days: 30 - + # Optional — cancelling this job will not block the release build-macos-arm64: name: macOS arm64 CPU needs: prepare-release @@ -205,16 +205,13 @@ jobs: - prepare-release - build-linux - build-windows - - build-macos-x64 - build-macos-arm64 runs-on: ubuntu-latest - if: | always() && needs.prepare-release.result == 'success' && needs.build-linux.result == 'success' && needs.build-windows.result == 'success' && - needs.build-macos-x64.result == 'success' && ( needs.build-macos-arm64.result == 'success' || needs.build-macos-arm64.result == 'cancelled' || From 1718c3df29b14e9b2398e7e6f02ddf0fe3f2cb19 Mon Sep 17 00:00:00 2001 From: Eamon Date: Wed, 3 Jun 2026 11:44:19 +0530 Subject: [PATCH 20/45] perf: update execution time benchmarks in csv Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-Authored-By: Eamon Sippy --- benchmark/results/python_benchmark.csv | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 benchmark/results/python_benchmark.csv diff --git a/benchmark/results/python_benchmark.csv b/benchmark/results/python_benchmark.csv new file mode 100644 index 0000000..c264086 --- /dev/null +++ b/benchmark/results/python_benchmark.csv @@ -0,0 +1,13 @@ +suite,name,backend,batch_size,sequence_length,tokens,avg_ms,median_ms,min_ms,max_ms,p90_ms,p95_ms,std_ms,tokens_per_sec,samples,loss,memory_mb,notes +data,tokenizer_encode,python,0,0,220975,169.76018999121152,164.62069997214712,124.44350001169369,211.44290000665933,204.09656001720577,207.76973001193255,29.81756930091779,1301689.165236207,10,,188.39453125, +data,batch_sample_to_device,python,4,32,128,0.34600000944919884,0.2575500402599573,0.2452000044286251,0.8668999653309584,0.48601999878883345,0.6764599820598955,0.1852238791693057,369942.18642873614,10,,189.2734375, +primitive,matmul_3d_1x16,python,1,16,16,0.028490001568570733,0.026749970857053995,0.024800014216452837,0.04350004019215703,0.03351001651026308,0.038505028351210044,0.005415998067507546,561600.5306805843,10,,181.234375, +primitive,matmul_3d_4x32,python,4,32,128,0.047069991705939174,0.043849984649568796,0.03890000516548753,0.07130001904442906,0.05185997579246759,0.06157999741844831,0.008578937902311791,2719354.632557738,10,,181.296875, +primitive,attention_scores_4x32,python,4,32,128,0.11958999675698578,0.10689999908208847,0.10410003596916795,0.20840001525357366,0.12946999049745497,0.16893500287551425,0.030181239200475163,1070323.6346774376,10,,181.93359375, +forward,batch1_seq8,python,1,8,8,16.073119995417073,15.318600024329498,14.594200009014457,20.715299993753433,17.489159997785464,19.102229995769445,1.798887105385644,497.7253950870173,10,10.797359466552734,166.43359375, +forward,batch1_seq32,python,1,32,32,21.528740011854097,21.653899981174618,20.405600022058934,22.147400013636798,22.095200035255402,22.1213000244461,0.548371285312407,1486.3851754622074,10,10.882255554199219,190.01171875, +forward,batch4_seq32,python,4,32,128,44.681840017437935,45.51370002445765,37.46199997840449,54.08870003884658,48.26489001279697,51.17679502582177,4.5654932173684655,2864.6984983171146,10,10.885703086853027,253.171875, +training,adamw_step_b4_s32,python,4,32,128,229.80256001465023,207.2436999878846,200.93890000134706,321.9230000395328,279.9100400414318,300.9165200404823,46.404669570312535,556.9998871720134,5,10.602718353271484,392.30078125, +generation,empty,python,1,1,32,563.3423800056335,548.9804000244476,466.00820001913235,704.8150000046007,643.7829400005285,674.2989700025645,72.57013670803387,56.80382150492566,10,,218.44140625, +generation,short,python,1,6,32,524.1239399998449,524.1038500098512,493.7280000303872,561.7482999805361,549.8817999905441,555.8150499855401,20.612269243289685,61.054261326070076,10,,218.47265625, +generation,long,python,1,32,32,561.3779200008139,560.0390000035986,545.9933000383899,574.2078999755904,570.0918399612419,572.1498699684162,7.699534842668483,57.00259817834233,10,,218.14453125, From 48971226c6cdc2f270e5f82b53781b116b1885e0 Mon Sep 17 00:00:00 2001 From: Eamon Date: Wed, 3 Jun 2026 22:30:41 +0530 Subject: [PATCH 21/45] ci(docker): refactor image build workflow and add frontend job --- .github/workflows/ci.yml | 2 -- 1 file changed, 2 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index e6502d0..992ef59 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -1,8 +1,6 @@ name: CI on: - push: - branches: [master] workflow_dispatch: inputs: image: From 275ecd12300dc8aec4d215fb19eab49cbb653982 Mon Sep 17 00:00:00 2001 From: Eamon Date: Wed, 3 Jun 2026 22:31:07 +0530 Subject: [PATCH 22/45] ci(docker): refactor image build workflow and add frontend job --- .github/workflows/docker-publish.yml | 116 +++++++++++++++++++-------- 1 file changed, 81 insertions(+), 35 deletions(-) diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml index 3a8fbae..b7c1584 100644 --- a/.github/workflows/docker-publish.yml +++ b/.github/workflows/docker-publish.yml @@ -1,38 +1,40 @@ name: Docker Images on: - push: - tags: - - "v*" workflow_dispatch: inputs: image: - description: "Which image to build? (cpp=C++ engine, cpu=PyTorch CPU, cuda=PyTorch CUDA, all=all three)" + description: "Image variant to build" required: true type: choice options: - cpp - cpu - cuda + - frontend - all version: description: "Optional image tag for manual runs" required: false push_image: - description: "Push to ghcr.io?" + description: "Push to ghcr.io" required: true default: "true" type: choice options: ["true", "false"] +concurrency: + group: docker-images-${{ github.ref }} + cancel-in-progress: true + env: REGISTRY: ghcr.io IMAGE_PREFIX: ghcr.io/${{ github.repository_owner }}/quadtrix jobs: build-cpp-image: - name: "Build -- cpp (C++ engine - linux/amd64 + arm64)" - if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpp' || inputs.image == 'all')) }} + name: Docker cpp + if: ${{ inputs.image == 'cpp' || inputs.image == 'all' }} runs-on: ubuntu-latest permissions: contents: read @@ -44,10 +46,10 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Set lowercase image prefix - run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV" - name: Login to GHCR - if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -60,27 +62,26 @@ jobs: with: images: ${{ env.IMAGE_PREFIX }}-cpp tags: | - type=ref,event=branch + type=ref,event=tag type=sha,prefix=sha- - type=raw,value=latest,enable={{is_default_branch}} - type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} - type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} + type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }} + type=raw,value=latest,enable=${{ inputs.push_image == 'true' }} - - name: Build & push + - name: Build and push uses: docker/build-push-action@v6 with: context: . file: .devops/Dockerfile.cpp platforms: linux/amd64,linux/arm64 - push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cpp cache-to: type=gha,mode=max,scope=cpp build-cpu-image: - name: "Build -- cpu (PyTorch CPU - linux/amd64 + arm64)" - if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cpu' || inputs.image == 'all')) }} + name: Docker cpu + if: ${{ inputs.image == 'cpu' || inputs.image == 'all' }} runs-on: ubuntu-latest permissions: contents: read @@ -92,10 +93,10 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Set lowercase image prefix - run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV" - name: Login to GHCR - if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -108,27 +109,26 @@ jobs: with: images: ${{ env.IMAGE_PREFIX }}-cpu tags: | - type=ref,event=branch + type=ref,event=tag type=sha,prefix=sha- - type=raw,value=latest,enable={{is_default_branch}} - type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} - type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} + type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }} + type=raw,value=latest,enable=${{ inputs.push_image == 'true' }} - - name: Build & push + - name: Build and push uses: docker/build-push-action@v6 with: context: . file: .devops/Dockerfile platforms: linux/amd64,linux/arm64 - push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cpu cache-to: type=gha,mode=max,scope=cpu build-cuda-image: - name: "Build -- cuda (PyTorch CUDA - linux/amd64 only)" - if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && (inputs.image == 'cuda' || inputs.image == 'all')) }} + name: Docker cuda + if: ${{ inputs.image == 'cuda' || inputs.image == 'all' }} runs-on: ubuntu-latest permissions: contents: read @@ -139,10 +139,10 @@ jobs: - uses: docker/setup-buildx-action@v3 - name: Set lowercase image prefix - run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> $GITHUB_ENV + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV" - name: Login to GHCR - if: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + if: ${{ inputs.push_image == 'true' }} uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} @@ -155,20 +155,66 @@ jobs: with: images: ${{ env.IMAGE_PREFIX }}-cuda tags: | - type=ref,event=branch + type=ref,event=tag type=sha,prefix=sha- - type=raw,value=latest,enable={{is_default_branch}} - type=raw,value=${{ inputs.version }},enable=${{ github.event_name == 'workflow_dispatch' && inputs.version != '' }} - type=semver,pattern={{version}},enable=${{ github.event_name == 'push' }} + type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }} + type=raw,value=latest,enable=${{ inputs.push_image == 'true' }} - - name: Build & push + - name: Build and push uses: docker/build-push-action@v6 with: context: . file: .devops/Dockerfile.backend platforms: linux/amd64 - push: ${{ github.event_name == 'push' || inputs.push_image == 'true' }} + push: ${{ inputs.push_image == 'true' }} tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cuda cache-to: type=gha,mode=max,scope=cuda + + build-frontend-image: + name: Docker frontend + if: ${{ inputs.image == 'frontend' || inputs.image == 'all' }} + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + steps: + - uses: actions/checkout@v4 + + - uses: docker/setup-qemu-action@v3 + - uses: docker/setup-buildx-action@v3 + + - name: Set lowercase image prefix + run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV" + + - name: Login to GHCR + if: ${{ inputs.push_image == 'true' }} + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: ${{ env.IMAGE_PREFIX }}-frontend + tags: | + type=ref,event=tag + type=sha,prefix=sha- + type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }} + type=raw,value=latest,enable=${{ inputs.push_image == 'true' }} + + - name: Build and push + uses: docker/build-push-action@v6 + with: + context: . + file: .devops/Dockerfile.frontend + platforms: linux/amd64,linux/arm64 + push: ${{ inputs.push_image == 'true' }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha,scope=frontend + cache-to: type=gha,mode=max,scope=frontend From 947c760b69d77a7a9ab0c0ab38314e0b5cfe5fec Mon Sep 17 00:00:00 2001 From: Eamon Date: Wed, 3 Jun 2026 22:31:15 +0530 Subject: [PATCH 23/45] ci(docker): refactor image build workflow and add frontend job --- .github/workflows/release.yml | 212 +++++++++++++++++----------------- 1 file changed, 104 insertions(+), 108 deletions(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 9d73bc0..7b8e9ab 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -1,83 +1,79 @@ name: Release on: - push: - tags: - - "v*" workflow_dispatch: inputs: version: - description: "Release version, with or without a leading v" + description: "Release version, for example v1.2.3 or 1.2.3" required: true +concurrency: + group: release + cancel-in-progress: false + env: ARTIFACT_ROOT: release-assets jobs: - prepare-release: - name: Prepare release metadata + release-metadata: + name: Release metadata runs-on: ubuntu-latest outputs: - tag_name: ${{ steps.meta.outputs.tag_name }} - version: ${{ steps.meta.outputs.version }} + tag_name: ${{ steps.tag.outputs.tag_name }} steps: - - id: meta + - id: tag shell: bash run: | set -euo pipefail - if [ "${GITHUB_EVENT_NAME}" = "workflow_dispatch" ]; then - raw_version="${{ inputs.version }}" + raw_tag="${{ inputs.version }}" + if [[ "${raw_tag}" == v* ]]; then + tag_name="${raw_tag}" else - raw_version="${GITHUB_REF_NAME}" + tag_name="v${raw_tag}" fi - - raw_version="${raw_version#v}" - tag_name="v${raw_version}" - echo "tag_name=${tag_name}" >> "$GITHUB_OUTPUT" - echo "version=${raw_version}" >> "$GITHUB_OUTPUT" - build-linux: - name: Linux ${{ matrix.arch }} CPU - needs: prepare-release - runs-on: ${{ matrix.runner }} + ubuntu-cpu: + name: Ubuntu ${{ matrix.build }} CPU + needs: release-metadata + runs-on: ${{ matrix.os }} strategy: fail-fast: false matrix: include: - - arch: x64 - runner: ubuntu-22.04 - compiler: g++ - packages: build-essential file - cxxflags: -std=c++17 -O3 -march=native - artifact: quadtrix-ubuntu-x64-cpu.tar.gz - - arch: arm64 - runner: ubuntu-24.04-arm - compiler: g++ - packages: build-essential file - cxxflags: -std=c++17 -O3 -march=native - artifact: quadtrix-ubuntu-arm64-cpu.tar.gz - - arch: s390x - runner: ubuntu-22.04 - compiler: s390x-linux-gnu-g++ - packages: g++-s390x-linux-gnu file - cxxflags: -std=c++17 -O3 - artifact: quadtrix-ubuntu-s390x-cpu.tar.gz + - build: x64 + os: ubuntu-22.04 + - build: arm64 + os: ubuntu-24.04-arm + - build: s390x + os: ubuntu-24.04-s390x steps: - - uses: actions/checkout@v4 + - name: Clone + uses: actions/checkout@v4 + with: + fetch-depth: 0 - - name: Install toolchain + - name: Dependencies shell: bash run: | set -euo pipefail sudo apt-get update - sudo apt-get install -y ${{ matrix.packages }} + sudo apt-get install -y build-essential file + + - name: Toolchain workaround + if: ${{ contains(matrix.os, 'ubuntu-24.04') }} + shell: bash + run: | + set -euo pipefail + sudo apt-get install -y gcc-14 g++-14 + echo "CC=gcc-14" >> "$GITHUB_ENV" + echo "CXX=g++-14" >> "$GITHUB_ENV" - - name: Build binary + - name: Build shell: bash run: | set -euo pipefail - ${{ matrix.compiler }} ${{ matrix.cxxflags }} \ + ${CXX:-g++} -std=c++17 -O3 -DNDEBUG \ -I. -Iinclude \ -o quadtrix main.cpp file quadtrix @@ -89,88 +85,97 @@ jobs: ./quadtrix --chat >/dev/null 2>&1 exit 0 - - name: Package artifact + - name: Pack artifacts shell: bash run: | set -euo pipefail - mkdir -p "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu" - cp quadtrix "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/" - cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-ubuntu-${{ matrix.arch }}-cpu/" - tar -czf "${{ matrix.artifact }}" -C "${ARTIFACT_ROOT}" "quadtrix-ubuntu-${{ matrix.arch }}-cpu" + package="quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-ubuntu-${{ matrix.build }}-cpu" + mkdir -p "${ARTIFACT_ROOT}/${package}" + cp quadtrix README.md LICENSE "${ARTIFACT_ROOT}/${package}/" + tar -czf "${package}.tar.gz" -C "${ARTIFACT_ROOT}" "${package}" - - name: Upload artifact + - name: Upload artifacts uses: actions/upload-artifact@v4 with: - name: quadtrix-ubuntu-${{ matrix.arch }}-cpu - path: ${{ matrix.artifact }} + name: quadtrix-bin-ubuntu-${{ matrix.build }}-cpu + path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-ubuntu-${{ matrix.build }}-cpu.tar.gz if-no-files-found: error retention-days: 30 - build-windows: + windows-cpu: name: Windows ${{ matrix.arch }} CPU - needs: prepare-release - runs-on: ${{ matrix.runner }} + needs: release-metadata + runs-on: windows-2022 strategy: fail-fast: false matrix: include: - arch: x64 - runner: windows-latest - msvc_arch: x64 - artifact: quadtrix-windows-x64-cpu.zip + vcvars: x64 - arch: arm64 - runner: windows-11-arm - msvc_arch: arm64 - artifact: quadtrix-windows-arm64-cpu.zip + vcvars: amd64_arm64 steps: - - uses: actions/checkout@v4 - - - name: Set up MSVC - uses: ilammy/msvc-dev-cmd@v1 + - name: Clone + uses: actions/checkout@v4 with: - arch: ${{ matrix.msvc_arch }} + fetch-depth: 0 - - name: Build binary + - name: Build shell: cmd run: | - cl /nologo /std:c++17 /O2 /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe + call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" ${{ matrix.vcvars }} + cl /nologo /std:c++17 /O2 /DNDEBUG /EHsc /Iinclude /I. main.cpp /Fe:quadtrix.exe - name: Smoke test + if: ${{ matrix.arch == 'x64' }} shell: pwsh run: | $ErrorActionPreference = 'Continue' & .\quadtrix.exe --chat | Out-Null exit 0 - - name: Package artifact + - name: Pack artifacts shell: pwsh run: | - New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu" | Out-Null - Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\" - Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\" - Compress-Archive -Path "${env:ARTIFACT_ROOT}\quadtrix-windows-${{ matrix.arch }}-cpu\*" -DestinationPath "${{ matrix.artifact }}" -Force + $package = "quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-windows-${{ matrix.arch }}-cpu" + New-Item -ItemType Directory -Force "${env:ARTIFACT_ROOT}\${package}" | Out-Null + Copy-Item quadtrix.exe "${env:ARTIFACT_ROOT}\${package}\" + Copy-Item README.md, LICENSE "${env:ARTIFACT_ROOT}\${package}\" + Compress-Archive -Path "${env:ARTIFACT_ROOT}\${package}\*" -DestinationPath "${package}.zip" -Force - - name: Upload artifact + - name: Upload artifacts uses: actions/upload-artifact@v4 with: - name: quadtrix-windows-${{ matrix.arch }}-cpu - path: ${{ matrix.artifact }} + name: quadtrix-bin-windows-${{ matrix.arch }}-cpu + path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-windows-${{ matrix.arch }}-cpu.zip if-no-files-found: error retention-days: 30 - # Optional — cancelling this job will not block the release - build-macos-arm64: - name: macOS arm64 CPU - needs: prepare-release - runs-on: macos-14 + macos-cpu: + name: macOS ${{ matrix.build }} CPU + needs: release-metadata + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false + matrix: + include: + - build: arm64 + arch: arm64 + os: macos-14 + - build: x64 + arch: x86_64 + os: macos-13 steps: - - uses: actions/checkout@v4 + - name: Clone + uses: actions/checkout@v4 + with: + fetch-depth: 0 - - name: Build binary + - name: Build shell: bash run: | set -euo pipefail - clang++ -std=c++17 -O3 -arch arm64 \ + clang++ -std=c++17 -O3 -DNDEBUG -arch ${{ matrix.arch }} \ -I. -Iinclude \ -o quadtrix main.cpp file quadtrix @@ -182,45 +187,35 @@ jobs: ./quadtrix --chat >/dev/null 2>&1 exit 0 - - name: Package artifact + - name: Pack artifacts shell: bash run: | set -euo pipefail - mkdir -p "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu" - cp quadtrix "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/" - cp README.md LICENSE "${ARTIFACT_ROOT}/quadtrix-macos-arm64-cpu/" - tar -czf "quadtrix-macos-arm64-cpu.tar.gz" -C "${ARTIFACT_ROOT}" "quadtrix-macos-arm64-cpu" + package="quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-macos-${{ matrix.build }}-cpu" + mkdir -p "${ARTIFACT_ROOT}/${package}" + cp quadtrix README.md LICENSE "${ARTIFACT_ROOT}/${package}/" + tar -czf "${package}.tar.gz" -C "${ARTIFACT_ROOT}" "${package}" - - name: Upload artifact + - name: Upload artifacts uses: actions/upload-artifact@v4 with: - name: quadtrix-macos-arm64-cpu - path: quadtrix-macos-arm64-cpu.tar.gz + name: quadtrix-bin-macos-${{ matrix.build }}-cpu + path: quadtrix-${{ needs.release-metadata.outputs.tag_name }}-bin-macos-${{ matrix.build }}-cpu.tar.gz if-no-files-found: error retention-days: 30 publish-release: name: Publish GitHub release needs: - - prepare-release - - build-linux - - build-windows - - build-macos-arm64 + - release-metadata + - ubuntu-cpu + - windows-cpu + - macos-cpu runs-on: ubuntu-latest - if: | - always() && - needs.prepare-release.result == 'success' && - needs.build-linux.result == 'success' && - needs.build-windows.result == 'success' && - ( - needs.build-macos-arm64.result == 'success' || - needs.build-macos-arm64.result == 'cancelled' || - needs.build-macos-arm64.result == 'skipped' - ) permissions: contents: write steps: - - name: Download all artifacts + - name: Download artifacts uses: actions/download-artifact@v4 with: path: dist @@ -229,7 +224,8 @@ jobs: - name: Publish release uses: softprops/action-gh-release@v2 with: - tag_name: ${{ needs.prepare-release.outputs.tag_name }} + tag_name: ${{ needs.release-metadata.outputs.tag_name }} target_commitish: ${{ github.sha }} + prerelease: false files: dist/* generate_release_notes: true From 3b6555384f10e4d876866ea8f77e902fdcaa01c8 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Wed, 3 Jun 2026 22:41:20 +0530 Subject: [PATCH 24/45] Remove frontend job from Docker Images workflow --- .github/workflows/docker-publish.yml | 48 ---------------------------- 1 file changed, 48 deletions(-) diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml index b7c1584..0986534 100644 --- a/.github/workflows/docker-publish.yml +++ b/.github/workflows/docker-publish.yml @@ -11,7 +11,6 @@ on: - cpp - cpu - cuda - - frontend - all version: description: "Optional image tag for manual runs" @@ -171,50 +170,3 @@ jobs: labels: ${{ steps.meta.outputs.labels }} cache-from: type=gha,scope=cuda cache-to: type=gha,mode=max,scope=cuda - - build-frontend-image: - name: Docker frontend - if: ${{ inputs.image == 'frontend' || inputs.image == 'all' }} - runs-on: ubuntu-latest - permissions: - contents: read - packages: write - steps: - - uses: actions/checkout@v4 - - - uses: docker/setup-qemu-action@v3 - - uses: docker/setup-buildx-action@v3 - - - name: Set lowercase image prefix - run: echo "IMAGE_PREFIX=ghcr.io/${GITHUB_REPOSITORY_OWNER,,}/quadtrix" >> "$GITHUB_ENV" - - - name: Login to GHCR - if: ${{ inputs.push_image == 'true' }} - uses: docker/login-action@v3 - with: - registry: ${{ env.REGISTRY }} - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - - name: Extract metadata - id: meta - uses: docker/metadata-action@v5 - with: - images: ${{ env.IMAGE_PREFIX }}-frontend - tags: | - type=ref,event=tag - type=sha,prefix=sha- - type=raw,value=${{ inputs.version }},enable=${{ inputs.version != '' }} - type=raw,value=latest,enable=${{ inputs.push_image == 'true' }} - - - name: Build and push - uses: docker/build-push-action@v6 - with: - context: . - file: .devops/Dockerfile.frontend - platforms: linux/amd64,linux/arm64 - push: ${{ inputs.push_image == 'true' }} - tags: ${{ steps.meta.outputs.tags }} - labels: ${{ steps.meta.outputs.labels }} - cache-from: type=gha,scope=frontend - cache-to: type=gha,mode=max,scope=frontend From 1d63e8b4775c1c2afb45e39d19113a11ef10456d Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Thu, 4 Jun 2026 00:07:57 +0530 Subject: [PATCH 25/45] Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. --- .github/workflows/release.yml | 44 ++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 7b8e9ab..e92b486 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -45,8 +45,6 @@ jobs: os: ubuntu-22.04 - build: arm64 os: ubuntu-24.04-arm - - build: s390x - os: ubuntu-24.04-s390x steps: - name: Clone uses: actions/checkout@v4 @@ -162,9 +160,6 @@ jobs: - build: arm64 arch: arm64 os: macos-14 - - build: x64 - arch: x86_64 - os: macos-13 steps: - name: Clone uses: actions/checkout@v4 @@ -221,11 +216,48 @@ jobs: path: dist merge-multiple: true + - name: Write release notes + shell: bash + run: | + cat > release-notes.md <<'EOF' + macOS/iOS: + + macOS Apple Silicon (arm64) + macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED + macOS Intel (x64) SKIPPED + iOS XCFramework DISABLED + + Linux: + + Ubuntu x64 (CPU) + Ubuntu arm64 (CPU) + Ubuntu s390x (CPU) SKIPPED + Ubuntu x64 (Vulkan) DISABLED + Ubuntu arm64 (Vulkan) DISABLED + Ubuntu x64 (ROCm 7.2) DISABLED + Ubuntu x64 (OpenVINO) DISABLED + Ubuntu x64 (SYCL FP32) DISABLED + + Android: + + Android arm64 (CPU) DISABLED + + Windows: + + Windows x64 (CPU) + Windows arm64 (CPU) + Windows x64 (CUDA 12) - CUDA 12.4 DLLs DISABLED + Windows x64 (CUDA 13) - CUDA 13.3 DLLs DISABLED + Windows x64 (Vulkan) DISABLED + Windows x64 (SYCL) DISABLED + Windows x64 (HIP) DISABLED + EOF + - name: Publish release uses: softprops/action-gh-release@v2 with: tag_name: ${{ needs.release-metadata.outputs.tag_name }} target_commitish: ${{ github.sha }} prerelease: false + body_path: release-notes.md files: dist/* - generate_release_notes: true From e29f1bf3fb5336aab12b0523f00f5dc71700f069 Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 00:18:45 +0530 Subject: [PATCH 26/45] feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. --- init.py | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 init.py diff --git a/init.py b/init.py new file mode 100644 index 0000000..e870447 --- /dev/null +++ b/init.py @@ -0,0 +1,113 @@ +from __future__ import annotations + +import os +import signal +import subprocess +import sys +import time +import webbrowser +from pathlib import Path + + +ROOT = Path(__file__).resolve().parent +BACKEND = ROOT / "backend" +FRONTEND = ROOT / "frontend" +DEFAULT_CHECKPOINT = ROOT / "engine" / "best_model.pt" + + +def npm_command() -> str: + return "npm.cmd" if os.name == "nt" else "npm" + + +def python_command() -> str: + venv_python = ROOT / ".venv" / ("Scripts/python.exe" if os.name == "nt" else "bin/python") + return str(venv_python) if venv_python.exists() else sys.executable + + +def start_process(name: str, command: list[str], cwd: Path, env: dict[str, str]) -> subprocess.Popen: + print(f"[start] {name}: {' '.join(command)}") + return subprocess.Popen(command, cwd=str(cwd), env=env) + + +def stop_process(process: subprocess.Popen) -> None: + if process.poll() is not None: + return + if os.name == "nt": + process.terminate() + else: + process.send_signal(signal.SIGTERM) + try: + process.wait(timeout=8) + except subprocess.TimeoutExpired: + process.kill() + + +def main() -> int: + api_port = os.environ.get("API_PORT", "3001") + frontend_port = os.environ.get("FRONTEND_PORT", "5173") + checkpoint = Path(os.environ.get("TORCH_CHECKPOINT_PATH", str(DEFAULT_CHECKPOINT))).resolve() + + if not checkpoint.exists(): + print(f"[error] .pt checkpoint not found: {checkpoint}") + print(" Set TORCH_CHECKPOINT_PATH to your best_model.pt file.") + return 1 + + backend_env = os.environ.copy() + backend_env.update( + { + "API_PORT": api_port, + "CORS_ORIGINS": f"http://localhost:{frontend_port},http://127.0.0.1:{frontend_port}", + "TORCH_CHECKPOINT_PATH": str(checkpoint), + } + ) + + frontend_env = os.environ.copy() + frontend_env.update( + { + "VITE_API_BASE_URL": f"http://localhost:{api_port}", + "VITE_TORCH_ONLY": "1", + } + ) + + backend = start_process( + "backend (.pt)", + [python_command(), "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", api_port, "--reload"], + BACKEND, + backend_env, + ) + frontend = start_process( + "frontend", + [npm_command(), "run", "dev", "--", "--port", frontend_port], + FRONTEND, + frontend_env, + ) + + url = f"http://localhost:{frontend_port}" + print(f"[ready] frontend: {url}") + print(f"[ready] backend : http://localhost:{api_port}") + print("[mode] PyTorch .pt only") + print("[stop] Press Ctrl+C to stop both servers.") + + if os.environ.get("NO_BROWSER") != "1": + time.sleep(2) + webbrowser.open(url) + + try: + while True: + if backend.poll() is not None: + print(f"[exit] backend stopped with code {backend.returncode}") + return backend.returncode or 1 + if frontend.poll() is not None: + print(f"[exit] frontend stopped with code {frontend.returncode}") + return frontend.returncode or 1 + time.sleep(1) + except KeyboardInterrupt: + print("\n[stop] stopping servers...") + return 0 + finally: + stop_process(frontend) + stop_process(backend) + + +if __name__ == "__main__": + raise SystemExit(main()) From 5f95d0f218298a13bcacc3bc7bbc3c5249f20dd8 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 4 Jun 2026 10:28:52 +0530 Subject: [PATCH 27/45] chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](https://github.com/actions/github-script/compare/v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .github/workflows/pr-check.yml | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/.github/workflows/pr-check.yml b/.github/workflows/pr-check.yml index 4824b9e..b50de6b 100644 --- a/.github/workflows/pr-check.yml +++ b/.github/workflows/pr-check.yml @@ -15,7 +15,7 @@ jobs: pr-sha: ${{ steps.get-sha.outputs.sha }} steps: - name: Check commenter permission - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | const { data } = await github.rest.repos.getCollaboratorPermissionLevel({ @@ -34,7 +34,7 @@ jobs: } - name: React with rocket - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | await github.rest.reactions.createForIssueComment({ @@ -46,7 +46,7 @@ jobs: - name: Get PR head SHA id: get-sha - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | const { data: pr } = await github.rest.pulls.get({ @@ -57,7 +57,7 @@ jobs: core.setOutput('sha', pr.head.sha); - name: Set checks to pending - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | const sha = '${{ steps.get-sha.outputs.sha }}'; @@ -96,7 +96,7 @@ jobs: - name: Report status if: always() - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | await github.rest.repos.createCommitStatus({ @@ -158,7 +158,7 @@ jobs: - name: Report status if: always() - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | await github.rest.repos.createCommitStatus({ @@ -218,7 +218,7 @@ jobs: - name: Report status if: always() - uses: actions/github-script@v7 + uses: actions/github-script@v9 with: script: | await github.rest.repos.createCommitStatus({ @@ -237,7 +237,7 @@ jobs: runs-on: ubuntu-latest if: always() steps: - - uses: actions/github-script@v7 + - uses: actions/github-script@v9 with: script: | const jobs = ${{ toJSON(needs) }}; From e4d340985734637061d6615619cae8f7a8d861be Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 10:41:34 +0530 Subject: [PATCH 28/45] feat(cuda): introduce log_message utility and LogLevel enum --- CUDA/includes/logger.h | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 CUDA/includes/logger.h diff --git a/CUDA/includes/logger.h b/CUDA/includes/logger.h new file mode 100644 index 0000000..219c50f --- /dev/null +++ b/CUDA/includes/logger.h @@ -0,0 +1,37 @@ +#pragma once + +#include +#include + +namespace quadtrix { +namespace cuda { + +enum class LogLevel { + Info, + Warn, + Error, +}; + +inline const char* log_level_name(LogLevel level) { + switch (level) { + case LogLevel::Info: + return "info"; + case LogLevel::Warn: + return "warn"; + case LogLevel::Error: + return "error"; + } + return "unknown"; +} + +inline void log_message(LogLevel level, const char* format, ...) { + std::fprintf(level == LogLevel::Error ? stderr : stdout, "[cuda:%s] ", log_level_name(level)); + va_list args; + va_start(args, format); + std::vfprintf(level == LogLevel::Error ? stderr : stdout, format, args); + va_end(args); + std::fprintf(level == LogLevel::Error ? stderr : stdout, "\n"); +} + +} // namespace cuda +} // namespace quadtrix From 71e9abea4ec5477f07dbf096a551b1634828982a Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 10:42:49 +0530 Subject: [PATCH 29/45] feat(cuda): add cuBLAS handle wrapper and matmul operations --- CUDA/includes/matmul.cuh | 99 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) create mode 100644 CUDA/includes/matmul.cuh diff --git a/CUDA/includes/matmul.cuh b/CUDA/includes/matmul.cuh new file mode 100644 index 0000000..12dd4b2 --- /dev/null +++ b/CUDA/includes/matmul.cuh @@ -0,0 +1,99 @@ +#pragma once + +#include "tensor.cuh" + +#include +#include + +#include + +namespace quadtrix { +namespace cuda { + +enum class MatmulTranspose : std::uint8_t { + None, + Transpose, +}; + +struct BlasStatus { + bool ok; + cublasStatus_t cublas_status; + const char* message; + + static BlasStatus success() { + return {true, CUBLAS_STATUS_SUCCESS, "ok"}; + } + + static BlasStatus failure(cublasStatus_t status, const char* message) { + return {false, status, message}; + } +}; + +const char* cublas_status_name(cublasStatus_t status); + +class BlasHandle { +public: + explicit BlasHandle(int device_id = 0); + ~BlasHandle(); + + BlasHandle(const BlasHandle&) = delete; + BlasHandle& operator=(const BlasHandle&) = delete; + + BlasHandle(BlasHandle&& other) noexcept; + BlasHandle& operator=(BlasHandle&& other) noexcept; + + cublasHandle_t get() const { + return handle_; + } + + int device_id() const { + return device_id_; + } + + BlasStatus set_stream(cudaStream_t stream); + +private: + cublasHandle_t handle_ = nullptr; + int device_id_ = 0; +}; + +BlasStatus matmul( + BlasHandle& handle, + const TensorView& a, + MatmulTranspose op_a, + const TensorView& b, + MatmulTranspose op_b, + TensorView c, + float alpha = 1.0f, + float beta = 0.0f, + cudaStream_t stream = nullptr); + +BlasStatus matmul_forward( + BlasHandle& handle, + const TensorView& input, + const TensorView& weight, + TensorView output, + cudaStream_t stream = nullptr, + float alpha = 1.0f, + float beta = 0.0f); + +BlasStatus matmul_backward_input( + BlasHandle& handle, + const TensorView& grad_output, + const TensorView& weight, + TensorView grad_input, + cudaStream_t stream = nullptr, + float alpha = 1.0f, + float beta = 0.0f); + +BlasStatus matmul_backward_weight( + BlasHandle& handle, + const TensorView& input, + const TensorView& grad_output, + TensorView grad_weight, + cudaStream_t stream = nullptr, + float alpha = 1.0f, + float beta = 0.0f); + +} // namespace cuda +} // namespace quadtrix From 7c9db4e009998859ecd1f30fdb7a749340a89c12 Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 10:43:39 +0530 Subject: [PATCH 30/45] feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions --- CUDA/includes/tensor.cuh | 168 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 168 insertions(+) create mode 100644 CUDA/includes/tensor.cuh diff --git a/CUDA/includes/tensor.cuh b/CUDA/includes/tensor.cuh new file mode 100644 index 0000000..c61d77e --- /dev/null +++ b/CUDA/includes/tensor.cuh @@ -0,0 +1,168 @@ +#pragma once + +#include "common.h" +#include "memory.cuh" + +#include +#include +#include + +namespace quadtrix { +namespace cuda { + +constexpr int kMaxTensorDims = 8; + +struct TensorShape { + int rank = 0; + std::array dims{}; + std::array strides{}; + + static TensorShape contiguous(const std::int64_t* sizes, int ndim) { + if (ndim < 1 || ndim > kMaxTensorDims) { + std::fprintf(stderr, "Tensor rank %d is outside supported range [1, %d]\n", ndim, kMaxTensorDims); + std::abort(); + } + + TensorShape shape; + shape.rank = ndim; + for (int i = 0; i < ndim; ++i) { + if (sizes[i] <= 0) { + std::fprintf(stderr, "Tensor dimension %d must be positive, got %lld\n", i, static_cast(sizes[i])); + std::abort(); + } + shape.dims[i] = sizes[i]; + } + + std::int64_t stride = 1; + for (int i = ndim - 1; i >= 0; --i) { + shape.strides[i] = stride; + stride *= shape.dims[i]; + } + return shape; + } + + std::size_t numel() const { + std::size_t total = 1; + for (int i = 0; i < rank; ++i) { + if (dims[i] <= 0) { + return 0; + } + std::size_t next = 0; + if (!checked_mul(total, static_cast(dims[i]), &next)) { + return 0; + } + total = next; + } + return rank == 0 ? 0 : total; + } + + bool is_contiguous() const { + std::int64_t expected = 1; + for (int i = rank - 1; i >= 0; --i) { + if (strides[i] != expected) { + return false; + } + expected *= dims[i]; + } + return true; + } +}; + +struct TensorView { + void* data = nullptr; + TensorShape shape; + DType dtype = DType::F32; + DeviceKind device = DeviceKind::CUDA; + int device_id = 0; + + std::size_t numel() const { + return shape.numel(); + } + + std::size_t bytes() const { + std::size_t out = 0; + if (!checked_mul(numel(), dtype_size(dtype), &out)) { + return 0; + } + return out; + } + + template + T* data_as() { + return static_cast(data); + } + + template + const T* data_as() const { + return static_cast(data); + } +}; + +class Tensor { +public: + Tensor() = default; + + Tensor(const std::int64_t* dims, int rank, DType dtype, int device_id = 0) + : shape_(TensorShape::contiguous(dims, rank)), dtype_(dtype), device_id_(device_id) { + allocate(); + } + + Tensor(const Tensor&) = delete; + Tensor& operator=(const Tensor&) = delete; + Tensor(Tensor&&) noexcept = default; + Tensor& operator=(Tensor&&) noexcept = default; + + TensorView view() { + return {storage_.data(), shape_, dtype_, DeviceKind::CUDA, device_id_}; + } + + TensorView view() const { + return {const_cast(storage_.data()), shape_, dtype_, DeviceKind::CUDA, device_id_}; + } + + const TensorShape& shape() const { + return shape_; + } + + DType dtype() const { + return dtype_; + } + + int device_id() const { + return device_id_; + } + + std::size_t numel() const { + return shape_.numel(); + } + + std::size_t bytes() const { + return storage_.bytes(); + } + + void* data() { + return storage_.data(); + } + + const void* data() const { + return storage_.data(); + } + +private: + void allocate() { + std::size_t bytes = 0; + if (!checked_mul(shape_.numel(), dtype_size(dtype_), &bytes)) { + std::fprintf(stderr, "Tensor allocation size overflow\n"); + std::abort(); + } + storage_.allocate(bytes, device_id_); + } + + TensorShape shape_; + DType dtype_ = DType::F32; + int device_id_ = 0; + DeviceBuffer storage_; +}; + +} // namespace cuda +} // namespace quadtrix From dbf79df0dba61b4e0ab6fc92dc1dc5656bf80fc4 Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 11:04:01 +0530 Subject: [PATCH 31/45] refactor: untie embedding and lm_head weights and to quadtrix --- CUDA/includes/memory.cuh | 120 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 CUDA/includes/memory.cuh diff --git a/CUDA/includes/memory.cuh b/CUDA/includes/memory.cuh new file mode 100644 index 0000000..e08fa4a --- /dev/null +++ b/CUDA/includes/memory.cuh @@ -0,0 +1,120 @@ +#pragma once + +#include "common.h" +#include "runtime.cuh" + +#include + +#include +#include + +namespace quadtrix { +namespace cuda { + +class DeviceBuffer { +public: + DeviceBuffer() = default; + + explicit DeviceBuffer(std::size_t bytes, int device_id = -1) { + allocate(bytes, device_id); + } + + ~DeviceBuffer() { + release(); + } + + DeviceBuffer(const DeviceBuffer&) = delete; + DeviceBuffer& operator=(const DeviceBuffer&) = delete; + + DeviceBuffer(DeviceBuffer&& other) noexcept { + swap(other); + } + + DeviceBuffer& operator=(DeviceBuffer&& other) noexcept { + if (this != &other) { + release(); + swap(other); + } + return *this; + } + + void allocate(std::size_t bytes, int device_id = -1) { + release(); + if (bytes == 0) { + return; + } + if (device_id >= 0) { + device_id_ = device_id; + DeviceGuard guard(device_id); + QUADTRIX_CUDA_ABORT(cudaMalloc(&ptr_, bytes)); + } else { + device_id_ = current_device(); + QUADTRIX_CUDA_ABORT(cudaMalloc(&ptr_, bytes)); + } + bytes_ = bytes; + } + + void release() { + if (ptr_ != nullptr) { + if (device_id_ >= 0) { + DeviceGuard guard(device_id_); + cudaFree(ptr_); + } else { + cudaFree(ptr_); + } + ptr_ = nullptr; + bytes_ = 0; + device_id_ = -1; + } + } + + void* data() { + return ptr_; + } + + const void* data() const { + return ptr_; + } + + std::size_t bytes() const { + return bytes_; + } + + bool empty() const { + return ptr_ == nullptr || bytes_ == 0; + } + + int device_id() const { + return device_id_; + } + + void swap(DeviceBuffer& other) noexcept { + std::swap(ptr_, other.ptr_); + std::swap(bytes_, other.bytes_); + std::swap(device_id_, other.device_id_); + } + +private: + void* ptr_ = nullptr; + std::size_t bytes_ = 0; + int device_id_ = -1; +}; + +inline Status copy_h2d(void* dst_device, const void* src_host, std::size_t bytes, cudaStream_t stream = nullptr) { + return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_device, src_host, bytes, cudaMemcpyHostToDevice, stream)); +} + +inline Status copy_d2h(void* dst_host, const void* src_device, std::size_t bytes, cudaStream_t stream = nullptr) { + return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_host, src_device, bytes, cudaMemcpyDeviceToHost, stream)); +} + +inline Status copy_d2d(void* dst_device, const void* src_device, std::size_t bytes, cudaStream_t stream = nullptr) { + return QUADTRIX_CUDA_CHECK(cudaMemcpyAsync(dst_device, src_device, bytes, cudaMemcpyDeviceToDevice, stream)); +} + +inline Status memset_device(void* dst_device, int value, std::size_t bytes, cudaStream_t stream = nullptr) { + return QUADTRIX_CUDA_CHECK(cudaMemsetAsync(dst_device, value, bytes, stream)); +} + +} // namespace cuda +} // namespace quadtrix From 7c461b8e36084249a9b89e247ef47d6e4fc59b31 Mon Sep 17 00:00:00 2001 From: Eamon Date: Thu, 4 Jun 2026 11:04:49 +0530 Subject: [PATCH 32/45] feat(cuda): add NCCL communicator wrapper and all-reduce primitives --- CUDA/includes/nccl_all_reduce.cuh | 96 +++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 CUDA/includes/nccl_all_reduce.cuh diff --git a/CUDA/includes/nccl_all_reduce.cuh b/CUDA/includes/nccl_all_reduce.cuh new file mode 100644 index 0000000..c712a6a --- /dev/null +++ b/CUDA/includes/nccl_all_reduce.cuh @@ -0,0 +1,96 @@ +#pragma once + +#include "tensor.cuh" + +#include + +#ifdef QUADTRIX_ENABLE_NCCL +#include +#else +typedef struct { + char internal[128]; +} ncclUniqueId; +typedef struct ncclComm* ncclComm_t; +typedef enum { + ncclSuccess = 0, + ncclUnhandledCudaError = 1, + ncclSystemError = 2, + ncclInternalError = 3, + ncclInvalidArgument = 4, + ncclInvalidUsage = 5, + ncclNumResults = 6 +} ncclResult_t; +#endif + +namespace quadtrix { +namespace cuda { + +struct NcclStatus { + bool ok; + ncclResult_t nccl_status; + const char* message; + + static NcclStatus success() { + return {true, ncclSuccess, "ok"}; + } + + static NcclStatus failure(ncclResult_t status, const char* message) { + return {false, status, message}; + } +}; + +const char* nccl_status_name(ncclResult_t status); + +class NcclCommunicator { +public: + NcclCommunicator() = default; + NcclCommunicator(ncclUniqueId unique_id, int world_size, int rank, int device_id); + ~NcclCommunicator(); + + NcclCommunicator(const NcclCommunicator&) = delete; + NcclCommunicator& operator=(const NcclCommunicator&) = delete; + + NcclCommunicator(NcclCommunicator&& other) noexcept; + NcclCommunicator& operator=(NcclCommunicator&& other) noexcept; + + ncclComm_t get() const { + return comm_; + } + + int world_size() const { + return world_size_; + } + + int rank() const { + return rank_; + } + + int device_id() const { + return device_id_; + } + + bool valid() const { + return comm_ != nullptr; + } + +private: + ncclComm_t comm_ = nullptr; + int world_size_ = 1; + int rank_ = 0; + int device_id_ = 0; +}; + +NcclStatus create_unique_id(ncclUniqueId* unique_id); + +NcclStatus all_reduce_sum( + NcclCommunicator& communicator, + TensorView tensor, + cudaStream_t stream = nullptr); + +NcclStatus all_reduce_average( + NcclCommunicator& communicator, + TensorView tensor, + cudaStream_t stream = nullptr); + +} // namespace cuda +} // namespace quadtrix From c5d06b6f3b8f70e2af6da338d7eeb1ebaf4bd94b Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Thu, 4 Jun 2026 22:04:35 +0530 Subject: [PATCH 33/45] Update README.md with workflow badges Added badges for release, package, and CI workflows. --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 56f99cc..d8a6ca1 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,9 @@

image + + [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml) [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml) + [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml)

A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend. From e0400256d1c0f3540c708e340812079a12da4318 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sat, 6 Jun 2026 18:04:07 +0530 Subject: [PATCH 34/45] kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. --- CUDA/llmcpp/adamw.cuh | 98 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 CUDA/llmcpp/adamw.cuh diff --git a/CUDA/llmcpp/adamw.cuh b/CUDA/llmcpp/adamw.cuh new file mode 100644 index 0000000..4453576 --- /dev/null +++ b/CUDA/llmcpp/adamw.cuh @@ -0,0 +1,98 @@ +/* +AdamW kernel +*/ + +// llmc internal imports +#include "cuda_common.h" +#include "cuda_utils.cuh" + +// ---------------------------------------------------------------------------- +// CUDA kernels + +// Implements linear interpolation using only two floating-point operations (as opposed to three in a naive implementation). +// Reference: https://developer.nvidia.com/blog/lerp-faster-cuda +__device__ float lerp(float start, float end, float weight) { + return fma(weight, end, fma(-weight, start, start)); +} + +template +__device__ void adamw_update(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters, + float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay, + float grad_scale, unsigned int seed) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= num_parameters) { return; } // guard + + // get the gradient, m, and v for this parameter + float grad = grad_scale * (float)grads_memory[idx]; + float m = m_memory[idx]; + float v = v_memory[idx]; + // update the first moment (momentum) + m = lerp(grad, m, beta1); + m_memory[idx] = m; + // update the second moment (RMSprop) + v = lerp(grad * grad, v, beta2); + v_memory[idx] = v; + m /= beta1_correction; // m_hat + v /= beta2_correction; // v_hat + // fetch the old value of this parameter as a float, from either source + float old_param = (master_params_memory != NULL) ? master_params_memory[idx] : (float)params_memory[idx]; + // update this parameter + float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param)); + // update our low precision version of the parameters using stochastic rounding + // this will be used in the next forward pass + stochastic_rounding(param, ¶ms_memory[idx], seed); + // write the full, float version of the param into our master copy, if we maintain one + // this will be used in the next update + if (master_params_memory != NULL) { master_params_memory[idx] = param; } +} + +template +__global__ void adamw_kernel3(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters, + ptrdiff_t w_stride, ptrdiff_t g_stride, ptrdiff_t s_stride, + float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay, + float grad_scale, unsigned int seed) { + adamw_update(params_memory + blockIdx.y * w_stride, + master_params_memory ? master_params_memory + blockIdx.y * s_stride : NULL, + grads_memory + blockIdx.y * g_stride, + m_memory + blockIdx.y * s_stride, + v_memory + blockIdx.y * s_stride, + num_parameters, learning_rate, beta1, beta2, beta1_correction, beta2_correction, eps, weight_decay, grad_scale, + seed + ); +} + +template +__global__ void init_from_master_kernel(Tp* params_memory, float* master_params_memory, size_t num_parameters, + ptrdiff_t w_stride, ptrdiff_t s_stride, unsigned int seed) { + size_t idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= num_parameters) { return; } + params_memory += blockIdx.y * w_stride; // adjust for layer offset + master_params_memory += blockIdx.y * s_stride; + stochastic_rounding(master_params_memory[idx], ¶ms_memory[idx], seed); +} + +template +void adamw_update(Tp* params_memory, float* master_params_memory, Tg* grads_memory, float* m_memory, float* v_memory, size_t num_parameters, + ptrdiff_t w_stride, ptrdiff_t g_stride, ptrdiff_t s_stride, int num_slices, float learning_rate, float beta1, float beta2, int t, float eps, float weight_decay, + float grad_scale, unsigned int seed, cudaStream_t stream) { + // AdamW update + int block_size = 512; + int num_blocks = CEIL_DIV(num_parameters, block_size); + float beta1_correction = 1.0f - powf(beta1, t); + float beta2_correction = 1.0f - powf(beta2, t); + adamw_kernel3<<>>(params_memory, master_params_memory, grads_memory, + m_memory, v_memory, num_parameters, w_stride, g_stride, s_stride, + learning_rate, beta1, beta2, beta1_correction, beta2_correction, eps, weight_decay, + grad_scale, seed); + cudaCheck(cudaGetLastError()); +} + +template +void init_from_master(Tp* params_memory, float* master_params_memory, size_t num_parameters, + ptrdiff_t w_stride, ptrdiff_t s_stride, int num_slices, unsigned int seed, cudaStream_t stream) { + int block_size = 512; // must match block size of adamw_update so that RNG also matches + int num_blocks = CEIL_DIV(num_parameters, block_size); + init_from_master_kernel<<>> + (params_memory, master_params_memory, num_parameters, w_stride, s_stride, seed); + cudaCheck(cudaGetLastError()); +} From c3dc5ae4c37280839ebf95feb4ce8acd51da8b1c Mon Sep 17 00:00:00 2001 From: Eamon Date: Sat, 6 Jun 2026 18:07:43 +0530 Subject: [PATCH 35/45] cudnn: implement cached SDPA forward graph using cuDNN frontend --- CUDA/llmcpp/cudnn_att.cpp | 297 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100644 CUDA/llmcpp/cudnn_att.cpp diff --git a/CUDA/llmcpp/cudnn_att.cpp b/CUDA/llmcpp/cudnn_att.cpp new file mode 100644 index 0000000..0330abe --- /dev/null +++ b/CUDA/llmcpp/cudnn_att.cpp @@ -0,0 +1,297 @@ +// all cudnn-related functions are in this file, so that they don't need to be recompiled everytime +// we change some unrelated piece of the code. +// TODO this currently duplicates some of the utilities from the main file + +#define NOMINMAX +#include +#include "cudnn_att.h" +#include + +namespace fe = cudnn_frontend; + +// Specific configurations based on the enabled precision +#if defined(ENABLE_FP32) +static_assert(false, "cuDNN is not supported in FP32 mode.") +// use fp16 (note: this may require gradient scaler, currently not implemented!) +#elif defined(ENABLE_FP16) +#define CUDNN_16BIT fe::DataType_t::HALF +#else // Default to bfloat16 +#define CUDNN_16BIT fe::DataType_t::BFLOAT16 +#endif + +static cudnnHandle_t cudnn_handle; +static size_t cudnn_workspace_size = 0; // dynamically allocated as needed (up to 256MiB!) +static void* cudnn_workspace = NULL; + +static void cuDNNCheck(cudnnStatus_t error, const char *file, int line) { + if (error != CUDNN_STATUS_SUCCESS) { + printf("[CUDNN ERROR] at file %s:%d:\n%s\n", file, line, cudnnGetErrorString(error)); + exit(EXIT_FAILURE); + } +}; +#define cuDNNCheck(err) (cuDNNCheck(err, __FILE__, __LINE__)) + +static void checkCudnnFE(const fe::error_object& e, const char *file, int line) { + if(!e.is_good()) { + printf("[CUDNN ERROR] at file %s:%d:\n%s\n", file, line, e.err_msg.c_str()); + exit(EXIT_FAILURE); + } +} +#define checkCudnnFE(err) checkCudnnFE(err, __FILE__, __LINE__) + +enum UIDs { + Q_UID, + K_UID, + V_UID, + Attn_scale_UID, + O_UID, + Stats_UID, + dO_UID, + dQ_UID, + dK_UID, + dV_UID +}; + +// Need a cache because graph->build_operation_graph() is slow but everything else seems fast +using cache_type_fwd = std::map, std::shared_ptr>; +using cache_type_bwd = std::map, std::shared_ptr>; + +// Loosely based on cuDNN frontend samples functions and massively simplified +auto lookup_cache_or_build_graph_fwd(int B,int H,int T,int HS, int is_inference_only) { + + static cache_type_fwd user_maintained_cache_fwd; + + auto key = std::make_tuple(B, H, T, HS, is_inference_only); + + auto it = user_maintained_cache_fwd.find(key); + if (it != user_maintained_cache_fwd.end()) { + return it->second; + } + + auto graph = std::make_shared(); + graph->set_io_data_type(CUDNN_16BIT) + .set_intermediate_data_type(fe::DataType_t::FLOAT) + .set_compute_data_type(fe::DataType_t::FLOAT); + + // QKV is (B, T, 3, NH, HS) which cuDNN can handle directly without an external permute + auto Q = graph->tensor(fe::graph::Tensor_attributes().set_name("Q") + .set_dim({B, H, T, HS}) + .set_uid(Q_UID) + .set_stride({3 * H * HS * T, HS, 3 * H * HS, 1})); + auto K = graph->tensor(fe::graph::Tensor_attributes().set_name("K") + .set_dim({B, H, T, HS}) + .set_uid(K_UID) + .set_stride({3 * H * HS * T, HS, 3 * H * HS, 1})); + auto V = graph->tensor(fe::graph::Tensor_attributes().set_name("V") + .set_dim({B, H, T, HS}) + .set_uid(V_UID) + .set_stride({3 * H * HS * T, HS, 3 * H * HS, 1})); + auto attn_scale = graph->tensor(fe::graph::Tensor_attributes().set_name("attn_scale") + .set_dim({1, 1, 1, 1}) + .set_stride({1, 1, 1, 1}) + .set_uid(Attn_scale_UID) + .set_is_pass_by_value(true) + .set_data_type(fe::DataType_t::FLOAT)); + + auto sdpa_options = fe::graph::SDPA_attributes().set_name("flash_attention"); + sdpa_options.set_is_inference(is_inference_only); + sdpa_options.set_attn_scale(attn_scale); + sdpa_options.set_causal_mask(true); + + // Create the graph operation and get the output tensors back + auto [O, stats] = graph->sdpa(Q, K, V, sdpa_options); + + // Output is (B, T, NH, HS) BF16/FP16 and stats for backward pass is (B, NH, T) FP32 + O->set_output(true).set_dim({B, H, T, HS}).set_stride({H * HS * T, HS, H * HS, 1}).set_uid(O_UID); + + assert(stats == nullptr || is_inference_only == false); + if (is_inference_only == false) { + stats->set_output(true).set_data_type(fe::DataType_t::FLOAT) + .set_dim({B, H, T, 1}) + .set_stride({H * T, T, 1, 1}) + .set_uid(Stats_UID); + } + + checkCudnnFE(graph->validate()); + + // Build the operation graph and execution part (this is the VERY SLOW PART) + checkCudnnFE(graph->build_operation_graph(cudnn_handle)); + auto plans = graph->create_execution_plans({fe::HeurMode_t::A}); + checkCudnnFE(graph->check_support(cudnn_handle)); + checkCudnnFE(graph->build_plans(cudnn_handle)); + // Reallocate the workspace if the required size is greater than the current workspace + // In H100 this may be around 16B + if (graph->get_workspace_size() > cudnn_workspace_size) { + if (cudnn_workspace_size > 0) { + cudaCheck(cudaFree(cudnn_workspace)); + } + cudnn_workspace_size = graph->get_workspace_size(); + cudaCheck(cudaMalloc(&cudnn_workspace, cudnn_workspace_size)); + } + + user_maintained_cache_fwd.insert({key, graph}); + + return graph; +} + +auto lookup_cache_or_build_graph_bwd(int B, int NH, int T, int HS) { + static cache_type_bwd user_maintained_cache_bwd; + + auto key = std::make_tuple(B, NH, T, HS); + + auto it = user_maintained_cache_bwd.find(key); + if (it != user_maintained_cache_bwd.end()) { + return it->second; + } + + auto graph = std::make_shared(); + graph->set_io_data_type(CUDNN_16BIT) + .set_intermediate_data_type(fe::DataType_t::FLOAT) + .set_compute_data_type(fe::DataType_t::FLOAT); + + // (B, N, 3, NH, HS) + // must come from inp (which means we also need to convert THAT to FP16) + auto Q = graph->tensor(fe::graph::Tensor_attributes().set_name("Q") + .set_dim({B, NH, T, HS}) + .set_uid(Q_UID) + .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1})); + auto K = graph->tensor(fe::graph::Tensor_attributes().set_name("K") + .set_dim({B, NH, T, HS}) + .set_uid(K_UID) + .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1})); + auto V = graph->tensor(fe::graph::Tensor_attributes().set_name("V") + .set_dim({B, NH, T, HS}) + .set_uid(V_UID) + .set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1})); + auto O = graph->tensor(fe::graph::Tensor_attributes().set_name("O") + .set_dim({B, NH, T, HS}) + .set_uid(O_UID) + .set_stride({NH * HS * T, HS, NH * HS, 1})); + auto dO = graph->tensor(fe::graph::Tensor_attributes().set_name("dO") + .set_dim({B, NH, T, HS}) + .set_uid(dO_UID) + .set_stride({NH * HS * T, HS, NH * HS, 1})); + + auto stats = graph->tensor(fe::graph::Tensor_attributes().set_name("stats") + .set_dim({B, NH, T, 1}) + .set_uid(Stats_UID) + .set_stride({NH * T, T, 1, 1}) + .set_data_type(fe::DataType_t::FLOAT)); + auto attn_scale = graph->tensor(fe::graph::Tensor_attributes().set_name("attn_scale") + .set_dim({1, 1, 1, 1}) + .set_stride({1, 1, 1, 1}) + .set_is_pass_by_value(true) + .set_uid(Attn_scale_UID) + .set_data_type(fe::DataType_t::FLOAT)); + auto sdpa_backward_options = fe::graph::SDPA_backward_attributes().set_name("flash_attention_backward") +#if CUDNN_FRONTEND_MAJOR_VERSION > 1 || CUDNN_FRONTEND_MINOR_VERSION >= 5 + .set_deterministic_algorithm(true) // 1.5+ needs this for determinism +#endif + .set_causal_mask(true) + .set_attn_scale(attn_scale); + + // Create the graph operation and get the output tensors back + auto [dQ, dK, dV] = graph->sdpa_backward(Q, K, V, O, dO, stats, sdpa_backward_options); + + dQ->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dQ_UID); + dK->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dK_UID); + dV->set_output(true).set_dim({B, NH, T, HS}).set_stride({3 * NH * HS * T, HS, 3 * NH * HS, 1}).set_uid(dV_UID); + + checkCudnnFE(graph->validate()); + + // Build the operation graph and execution part (this is the VERY SLOW PART) + checkCudnnFE(graph->build_operation_graph(cudnn_handle)); + auto plans = graph->create_execution_plans({fe::HeurMode_t::A}); + checkCudnnFE(graph->check_support(cudnn_handle)); + checkCudnnFE(graph->build_plans(cudnn_handle)); + + // Reallocate the workspace if the required size is greater than the current workspace + // By default, cuDNN uses up to 256MiB of workspace, so we don't want to just allocate the maximum + if (graph->get_workspace_size() > cudnn_workspace_size) { + if (cudnn_workspace_size > 0) { + cudaCheck(cudaFree(cudnn_workspace)); + } + cudnn_workspace_size = graph->get_workspace_size(); + cudaCheck(cudaMalloc(&cudnn_workspace, cudnn_workspace_size)); + } + + user_maintained_cache_bwd.insert({key, graph}); + return graph; +} + +void attention_forward_cudnn(floatX* out, // output: (B, T, NH, HS) + float* stats, // output for backward pass: (B, NH, T) + floatX* inp, // input: (B, T, 3, NH, HS) QKV + int B, int T, int NH, int C, cudaStream_t stream) { + NVTX_RANGE_FN(); + int HS = C / NH; // number of features per head + bool is_inference_only = (stats == nullptr); + + cuDNNCheck(cudnnSetStream(cudnn_handle, stream)); + + // Get graph and tensors from cache (or generate it on first use) + auto graph = lookup_cache_or_build_graph_fwd(B, NH, T, HS, is_inference_only); + + // Prepare all the tensor pointers for executing the graph + void* devPtrQ = inp; + void* devPtrK = (inp + C); + void* devPtrV = (inp + 2 * C); + float attn_scale_cpu = 1.0 / sqrtf(HS); + void* devPtrO = out; + + // Build variant pack + std::unordered_map variant_pack = { + {Q_UID, devPtrQ}, {K_UID, devPtrK}, {V_UID, devPtrV}, {Attn_scale_UID, &attn_scale_cpu}, {O_UID, devPtrO}}; + + // Add the stats tensor unless we are only doing inference (only needed for backward pass) + if (is_inference_only == false) { + variant_pack[Stats_UID] = stats; + } + + // Execute graph + checkCudnnFE(graph->execute(cudnn_handle, variant_pack, cudnn_workspace)); + cudaCheck(cudaGetLastError()); +} + +void attention_backward_cudnn(floatX* dqkvr, // output + floatX* dout, floatX* qkvr, floatX* o, float* stats, // inputs + int B, int T, int NH, int C, cudaStream_t stream) { + NVTX_RANGE_FN(); + int HS = C / NH; // number of features per head + + // Get graph and tensors from cache (or generate it on first use) + auto graph = lookup_cache_or_build_graph_bwd(B, NH, T, HS); + + // Prepare all the tensor pointers for executing the graph + void* devPtrQ = qkvr; + void* devPtrK = (qkvr + NH * HS); + void* devPtrV = (qkvr + 2 * NH * HS); + void* devPtrO = o; + void* devPtrdO = dout; + void* devPtrStats = stats; + float attn_scale_cpu = 1.0 / sqrtf(HS); + + void* devPtrdQ = dqkvr; + void* devPtrdK = (dqkvr + NH * HS); + void* devPtrdV = (dqkvr + 2 * NH * HS); + + // Build variant pack that links each tensor to its data pointer + std::unordered_map variant_pack = { + {Q_UID, devPtrQ}, {K_UID, devPtrK}, {V_UID, devPtrV}, {O_UID, devPtrO}, {dO_UID, devPtrdO}, {Stats_UID, devPtrStats}, + {dQ_UID, devPtrdQ}, {dK_UID, devPtrdK}, {dV_UID, devPtrdV}, + {Attn_scale_UID, &attn_scale_cpu}}; + + // Execute graph + cuDNNCheck(cudnnSetStream(cudnn_handle, stream)); + checkCudnnFE(graph->execute(cudnn_handle, variant_pack, cudnn_workspace)); + cudaCheck(cudaGetLastError()); +} + +void create_cudnn() { + cuDNNCheck(cudnnCreate(&cudnn_handle)); +} + +void destroy_cudnn() { + if (cudnn_workspace != NULL) { cudaCheck(cudaFree(cudnn_workspace)); } + cuDNNCheck(cudnnDestroy(cudnn_handle)); +} \ No newline at end of file From 49099aeb88a87ed8a1f95493afd1827ed5507257 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sat, 6 Jun 2026 18:08:37 +0530 Subject: [PATCH 36/45] feat(cuda): implement Packed128 memory vectorization utilities --- CUDA/llmcpp/cuda_utils.cuh | 286 +++++++++++++++++++++++++++++++++++++ 1 file changed, 286 insertions(+) create mode 100644 CUDA/llmcpp/cuda_utils.cuh diff --git a/CUDA/llmcpp/cuda_utils.cuh b/CUDA/llmcpp/cuda_utils.cuh new file mode 100644 index 0000000..030ec07 --- /dev/null +++ b/CUDA/llmcpp/cuda_utils.cuh @@ -0,0 +1,286 @@ +// Utilities for use in __device__ code + +#ifndef CUDA_UTILS_CUH +#define CUDA_UTILS_CUH + +#include "cuda_common.h" + +// ---------------------------------------------------------------------------- +// Packed128 data structure that forces the compiler to use 128-bit loads/stores +// in GPUs that support (the LDG.128 and STS.128 instructions) +// This is a bit similar to the use of float4 in the case of 32-bit floats, but +// supports arbitrary precision. + +template +struct alignas(16) Packed128 { + Packed128() = default; + __device__ explicit Packed128(int4 bits) { + static_assert(sizeof(bits) == sizeof(payload), "Size mismatch."); + memcpy(&payload, &bits, sizeof(bits)); + } + + __device__ static Packed128 constant(ElementType value) { + Packed128 result; + for(int k = 0; k < size; ++k) { + result.payload[k] = value; + } + return result; + } + __device__ static Packed128 zeros() { + return constant(0.f); + } + __device__ static Packed128 ones() { + return constant(1.f); + } + + __device__ ElementType& operator[](int index) { + return payload[index]; + } + __device__ const ElementType& operator[](int index) const { + return payload[index]; + } + __device__ int4 get_bits() const { + int4 bits; + static_assert(sizeof(bits) == sizeof(payload), "Size mismatch."); + memcpy(&bits, &payload, sizeof(bits)); + return bits; + } + static constexpr const size_t size = sizeof(int4) / sizeof(ElementType); + ElementType payload[size]; +}; + +// load a Packed128 from an aligned memory address +template +__device__ Packed128 load128(const ElementType* address) { + return Packed128{*reinterpret_cast(address)}; +} +// load a Packed128 from an aligned memory address with streaming cache hint +template +__device__ Packed128 load128cs(const ElementType* address) { + return Packed128{__ldcs(reinterpret_cast(address))}; +} +// store a Packed128 to an aligned memory address +template +__device__ void store128(ElementType* target, Packed128 value) { + *reinterpret_cast(target) = value.get_bits(); +} +// store a Packed128 to an aligned memory address with streaming cache hint +template +__device__ void store128cs(ElementType* target, Packed128 value) { + __stcs(reinterpret_cast(target), value.get_bits()); +} +// store a Packed128 to an aligned memory address while caching in L2 but bypassing L1 +template +__device__ void store128cg(ElementType* target, Packed128 value) { + __stcg(reinterpret_cast(target), value.get_bits()); +} + +// short-form typedefs +typedef Packed128 f128; +typedef Packed128 x128; + +// ---------------------------------------------------------------------------- +// DType support + +// enumerator to indentify the datatype of a tensor. +enum class DType : uint8_t { + FP32, FP16, BF16 +}; + +// Given a datatype enum, returns the underlying number of bytes +// for a scalar of that type +size_t sizeof_dtype(DType type) { + switch (type) { + case DType::FP32: + return sizeof(float); + case DType::FP16: + return sizeof(half); + case DType::BF16: + return sizeof(nv_bfloat16); + default: // handle or get compiler warning + fprintf(stderr, "Unknown datatype\n"); + exit(EXIT_FAILURE); + } +} + +DType dtype_of(float* f) { return DType::FP32; } +DType dtype_of(nv_bfloat16 * f) { return DType::BF16; } +DType dtype_of(half * f) { return DType::FP16; } + + + +// ---------------------------------------------------------------------------- +// Copy, cast functions + +// device functions and the kernel to cast data between types +template +__device__ Td cast_value(Ts val); + +template<> +__device__ float cast_value(float val) { + return val; +} + +template<> +__device__ float cast_value(half val) { + return __half2float(val); +} + +template<> +__device__ float cast_value(__nv_bfloat16 val) { + return __bfloat162float(val); +} + +template +__global__ void copy_and_cast_kernel(Td* dst, const Ts* src, size_t n, ptrdiff_t stride_dst, ptrdiff_t stride_src) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + // need to try grid stride looping for more perf later + if (idx < n) { + dst[idx + stride_dst * blockIdx.y] = cast_value(src[idx + stride_src * blockIdx.y]); + } +} + +// ---------------------------------------------------------------------------- +// Warp/Block communication primitives + +// warp-level reduction for summing values +__device__ inline float warpReduceSum(float val) { + for (int offset = 16; offset > 0; offset /= 2) { + val += __shfl_xor_sync(0xFFFFFFFF, val, offset); + } + return val; +} +// warp-level reduction for finding the maximum value +__device__ inline float warpReduceMax(float val) { + for (int offset = 16; offset > 0; offset /= 2) { + val = fmaxf(val, __shfl_xor_sync(0xFFFFFFFF, val, offset)); + } + return val; +} +// requires all 32 threads in the warp to be active, but should work for any block size +// uses non-dynamic shared memory so every call increases shared memory requirements by 128 bytes +// the fact it's unique shared memory allows us to avoid an extra __syncthreads() call at the end +// but if called inside a loop, the shared memory will be implicitly reused, so set final_sync to 1 +using reduction_func_t = float (*) (float); +template +__device__ inline float blockReduce(float val, bool final_sync=false, float out_of_bounds=0.0f) { + // two reductions of up to 1024 threads: + // 1) inside warp (shuffle), 2) cross-warp (shared memory), 3) inside warp (shuffle) + __shared__ float shared_val[WARP_SIZE]; + const int lane_id = threadIdx.x % WARP_SIZE; + const int warp_id = threadIdx.x / WARP_SIZE; + const int num_warps = blockDim.x / WARP_SIZE; + + float warp_val = warp_reduction(val); + if (lane_id == 0) { shared_val[warp_id] = warp_val; } + __syncthreads(); + warp_val = (lane_id < num_warps) ? shared_val[lane_id] : out_of_bounds; + float block_val = warp_reduction(warp_val); + + if (final_sync) { + __syncthreads(); // only needed in loops when effectively reusing shared memory etc. + } + return block_val; +} + +// Performs a _deterministic_ sum reduction. determinism is achieved by requiring that only +// a single block be used. +template +__global__ void global_sum_single_block_kernel(float* result, const Float* values, size_t count) { + assert(gridDim.x == 1); // only a single block! + float thread_sum = 0; + for(size_t index = threadIdx.x; index < count; index += blockDim.x) { + thread_sum += (float)values[index]; + } + + float reduction = blockReduce(thread_sum, true); + if(threadIdx.x == 0) { + *result = reduction; + } +} + +template +void global_sum_deterministic(float* result, const Float* values, int count, cudaStream_t stream) { + global_sum_single_block_kernel<<<1, 1024, 0, stream>>>(result, values, count); + cudaCheck(cudaGetLastError()); +} + +// ---------------------------------------------------------------------------- +// memory management + +// allocate memory, preferrably on the device +// returns a status code. 0 = OK, 1 = fell back to managed memory +int cudaMallocConditionallyManaged(void** out, size_t bytes, const char *file, int line) { + // try to allocate + cudaError_t err = cudaMalloc(out, bytes); + if(err == cudaErrorMemoryAllocation) { + // if we OOM, fallback to a managed allocation. slower but at least won't crash. + cudaGetLastError(); // reset the error before the next API call + cudaCheck_(cudaMallocManaged(out, bytes), file, line); + cudaCheck_(cudaMemAdvise(*out, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId), file, line); + return 1; + } else { + cudaCheck_(err, file, line); + return 0; + } +} + +#define cudaMallocConditionallyManaged(out, bytes)\ +(cudaMallocConditionallyManaged((void**)out, bytes, __FILE__, __LINE__)) + +// ---------------------------------------------------------------------------- +// Random Number Generation used in Stochastic Rounding + +// SquirrelNoise5 - Squirrel's Raw Noise utilities (version 5) +// This gives us a random number from threadIdx/blockIdx + a single seed for the entire GPU +// todo - possibly overkill and we don't need such high quality random numbers? (tbd) +// http://eiserloh.net/noise/SquirrelNoise5.hpp +__device__ __host__ constexpr unsigned int SquirrelNoise5(unsigned int positionX, unsigned int seed) +{ + constexpr unsigned int SQ5_BIT_NOISE1 = 0xd2a80a3f; // 11010010101010000000101000111111 + constexpr unsigned int SQ5_BIT_NOISE2 = 0xa884f197; // 10101000100001001111000110010111 + constexpr unsigned int SQ5_BIT_NOISE3 = 0x6C736F4B; // 01101100011100110110111101001011 + constexpr unsigned int SQ5_BIT_NOISE4 = 0xB79F3ABB; // 10110111100111110011101010111011 + constexpr unsigned int SQ5_BIT_NOISE5 = 0x1b56c4f5; // 00011011010101101100010011110101 + unsigned int mangledBits = positionX; + mangledBits *= SQ5_BIT_NOISE1; + mangledBits += seed; + mangledBits ^= (mangledBits >> 9); + mangledBits += SQ5_BIT_NOISE2; + mangledBits ^= (mangledBits >> 11); + mangledBits *= SQ5_BIT_NOISE3; + mangledBits ^= (mangledBits >> 13); + mangledBits += SQ5_BIT_NOISE4; + mangledBits ^= (mangledBits >> 15); + mangledBits *= SQ5_BIT_NOISE5; + mangledBits ^= (mangledBits >> 17); + return mangledBits; +} +__device__ __host__ constexpr unsigned int Get2dNoiseUint(int indexX, int indexY, unsigned int seed) +{ + constexpr unsigned int PRIME_NUMBER = 198491317u; // Large prime number with non-boring bits + unsigned int x = static_cast(indexX); + unsigned int y = static_cast(indexY); + + return SquirrelNoise5(x + (PRIME_NUMBER * y), seed); +} + +// stochastic rounding built on top of Squirel Noise above (with seed updated per step via xorshift) +__device__ __forceinline__ void stochastic_rounding(float in, __nv_bfloat16 *out, unsigned int seed) { + // todo - is this stochastic rounding *too good*? can we cut any corners? + // makes sure each thread gets a different random number + unsigned int random = Get2dNoiseUint(threadIdx.x, blockIdx.x * blockDim.x + blockIdx.y, seed); + unsigned int threshold = random & 0xFFFF; + unsigned int float_bits = __float_as_uint(in); + unsigned int rounded_bits = float_bits & 0x0000FFFF; + float_bits = (rounded_bits > threshold) ? (float_bits | 0xFFFF) : (float_bits & ~0xFFFF); + *out = __float2bfloat16_rn(__uint_as_float(float_bits)); +} +__device__ __forceinline__ void stochastic_rounding(float in, half *out, unsigned int random) { + *out = (float)in; // todo - implement this... +} +__device__ __forceinline__ void stochastic_rounding(float in, float *out, unsigned int random) { + *out = in; // dummy function for when floatX is float (FP32 mode) +} + +#endif \ No newline at end of file From b41b9892eebc894445e19b34027bf1e6a4d5b2d6 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sat, 6 Jun 2026 18:09:59 +0530 Subject: [PATCH 37/45] feat: add distributed sharded DataLoader for binary token files --- CUDA/llmcpp/dataloader.h | 496 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 496 insertions(+) create mode 100644 CUDA/llmcpp/dataloader.h diff --git a/CUDA/llmcpp/dataloader.h b/CUDA/llmcpp/dataloader.h new file mode 100644 index 0000000..0ee0588 --- /dev/null +++ b/CUDA/llmcpp/dataloader.h @@ -0,0 +1,496 @@ +#ifndef DATALOADER_H +#define DATALOADER_H + +#include +#include +#include +#include +#include +#include +#include "utils.h" +#include "rand.h" +#ifndef _WIN32 +#include +#endif +#define HEADER_SIZE 256 + +typedef struct +{ + + int process_rank; + int num_processes; + + size_t B; + size_t T; + size_t num_tokens; + size_t shard_num_samples; + + glob_t glob_result; + size_t current_shard_idx; + size_t current_sample_idx; + + FILE *tokens_file; + + uint16_t *buffer; + int *inputs; + int *targets; + + mt19937_state shuffle_rng; + int should_shuffle; + int *shard_indices; + int *intra_shard_indices; + + size_t total_batch_size_bytes; + size_t local_batch_offset_bytes; + size_t header_bytes; + int64_t file_size_bytes; +} DataLoader; + +int64_t dataloader_load_shard_(DataLoader *loader, int shard_index) +{ + if (loader->should_shuffle) + { + shard_index = loader->shard_indices[shard_index]; + } + + const char *filename = loader->glob_result.gl_pathv[shard_index]; + + if (loader->tokens_file != NULL) + { + fcloseCheck(loader->tokens_file); + } + loader->tokens_file = fopenCheck(filename, "rb"); + + int header[HEADER_SIZE]; + freadCheck(header, sizeof(int), HEADER_SIZE, loader->tokens_file); + if (header[0] != 20240520) + { + + printf("---> HINT: Are you passing in a correct file?\n"); + printf("---> HINT: The data encoding may have changed, re-run data prepro or refer again to README.\n"); + exit(EXIT_FAILURE); + } + if (header[1] != 1) + { + printf("Bad version in data file\n"); + exit(EXIT_FAILURE); + } + int64_t ntok = header[2]; // + assert(ntok > 0); + fseekCheck(loader->tokens_file, 0, SEEK_END); + loader->file_size_bytes = ftell(loader->tokens_file); + fseekCheck(loader->tokens_file, 0, SEEK_SET); + int64_t expected_file_size = HEADER_SIZE * sizeof(int) + ntok * sizeof(uint16_t); + if (loader->file_size_bytes != expected_file_size) + { + printf("Error: file size is not as expected\n"); + exit(EXIT_FAILURE); + } + + loader->shard_num_samples = (ntok * sizeof(uint16_t) - sizeof(uint16_t)) / loader->total_batch_size_bytes; + return ntok; +} + +void prepare_intra_shard_indices_(DataLoader *loader) +{ + + if (loader->intra_shard_indices != NULL) + { + + free(loader->intra_shard_indices); + } + loader->intra_shard_indices = (int *)mallocCheck(loader->shard_num_samples * sizeof(int)); + init_identity_permutation(loader->intra_shard_indices, (int)loader->shard_num_samples); + random_permutation(loader->intra_shard_indices, (int)loader->shard_num_samples, &loader->shuffle_rng); +} + +void dataloader_reset(DataLoader *loader) +{ + loader->current_shard_idx = 0; + loader->current_sample_idx = 0; + + if (loader->should_shuffle) + { + random_permutation(loader->shard_indices, (int)loader->glob_result.gl_pathc, &loader->shuffle_rng); + } + + dataloader_load_shard_(loader, (int)loader->current_shard_idx); + + if (loader->should_shuffle) + { + prepare_intra_shard_indices_(loader); + } +} + +void dataloader_advance_(DataLoader *loader) +{ + if (loader->current_shard_idx == loader->glob_result.gl_pathc - 1) + { + + dataloader_reset(loader); + return; + } + + loader->current_shard_idx = (loader->current_shard_idx + 1) % loader->glob_result.gl_pathc; + loader->current_sample_idx = 0; + dataloader_load_shard_(loader, (int)loader->current_shard_idx); + + if (loader->should_shuffle) + { + prepare_intra_shard_indices_(loader); + } +} + +void dataloader_init(DataLoader *loader, + const char *filename_pattern, + size_t B, + size_t T, + int process_rank, + int num_processes, + int should_shuffle) +{ + loader->process_rank = process_rank; + loader->num_processes = num_processes; + loader->B = B; + loader->T = T; + loader->tokens_file = NULL; + loader->should_shuffle = should_shuffle; + loader->header_bytes = HEADER_SIZE * sizeof(int); + loader->total_batch_size_bytes = ((loader->num_processes * (loader->B * loader->T)) * sizeof(uint16_t)); + loader->local_batch_offset_bytes = loader->process_rank * loader->B * loader->T * sizeof(uint16_t); + + int glob_status = glob(filename_pattern, 0, NULL, &loader->glob_result); + if (glob_status != 0) + { + printf("Error: failed to glob pattern: %s\n", filename_pattern); + exit(EXIT_FAILURE); + } + if (loader->glob_result.gl_pathc == 0) + { + printf("Error: no files found matching the pattern: %s\n", filename_pattern); + exit(EXIT_FAILURE); + } + + if (should_shuffle) + { + mt19937_state shuffle_rng; + manual_seed(&shuffle_rng, 42 + process_rank); + loader->shuffle_rng = shuffle_rng; + loader->shard_indices = (int *)mallocCheck(loader->glob_result.gl_pathc * sizeof(int)); + init_identity_permutation(loader->shard_indices, (int)loader->glob_result.gl_pathc); + loader->intra_shard_indices = NULL; + } + + int64_t ntok_total = 0; + for (int shard_index = 0; shard_index < loader->glob_result.gl_pathc; shard_index++) + { + int64_t shard_ntok = dataloader_load_shard_(loader, shard_index); + + assert(shard_ntok >= (int64_t)(num_processes * B * T + 1)); + ntok_total += shard_ntok; + } + + loader->buffer = (uint16_t *)mallocCheck((B * T + 1) * sizeof(uint16_t)); + loader->inputs = (int *)mallocCheck(B * T * sizeof(int)); + loader->targets = (int *)mallocCheck(B * T * sizeof(int)); + loader->num_tokens = ntok_total; + + dataloader_reset(loader); +} + +void dataloader_load_batch(DataLoader *loader) +{ + assert(!loader->should_shuffle || (loader->should_shuffle && loader->intra_shard_indices != NULL)); + assert(loader->current_sample_idx < loader->shard_num_samples); + size_t idx = loader->should_shuffle ? loader->intra_shard_indices[loader->current_sample_idx] : loader->current_sample_idx; + size_t global_batch_offset_bytes = idx * loader->total_batch_size_bytes; + int64_t current_offset = loader->header_bytes + global_batch_offset_bytes + loader->local_batch_offset_bytes; + + size_t B = loader->B; + size_t T = loader->T; + + fseekCheck(loader->tokens_file, (int)current_offset, SEEK_SET); + freadCheck(loader->buffer, sizeof(uint16_t), B * T + 1, loader->tokens_file); + + for (int i = 0; i < B * T; i++) + { + loader->inputs[i] = (int)loader->buffer[i]; + loader->targets[i] = (int)loader->buffer[i + 1]; + } +} + +void dataloader_next_batch(DataLoader *loader) +{ + + if (loader->current_sample_idx >= loader->shard_num_samples) + { + dataloader_advance_(loader); + } + dataloader_load_batch(loader); + loader->current_sample_idx += 1; +} + +void dataloader_resume(DataLoader *loader, size_t current_shard_idx, size_t current_sample_idx) +{ + + loader->current_shard_idx = current_shard_idx; + loader->current_sample_idx = current_sample_idx; + dataloader_load_shard_(loader, (int)loader->current_shard_idx); +} + +void dataloader_free(DataLoader *loader) +{ + free(loader->buffer); + free(loader->inputs); + free(loader->targets); + if (loader->should_shuffle) + { + free(loader->shard_indices); + free(loader->intra_shard_indices); + } + fcloseCheck(loader->tokens_file); + globfree(&loader->glob_result); +} + +#define ASSUMED_NUM_COMPLETIONS 4 +#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N)) + +typedef struct +{ + + int process_rank; + int num_processes; + + size_t B; + size_t T; + FILE *eval_file; + uint16_t *buffer; + int num_examples; + int num_batches; + int start_example_index; + int end_example_index; + int current_example_index; + int *inputs; + int *targets; + char *mask; + int *label; + int num_completions; +} EvalLoader; + +void evalloader_reset(EvalLoader *loader) +{ + int examples_per_process = CEIL_DIV(loader->num_examples, loader->num_processes); + int can_fit_examples = (int)(loader->B / ASSUMED_NUM_COMPLETIONS); + if (can_fit_examples == 0) + { + + printf("HellaSwag EvalLoader: batch size %zu is < %d\n", loader->B, ASSUMED_NUM_COMPLETIONS); + printf("---> HINT: Disable HellaSwag eval with -h 0, or increase batch size with -b\n"); + exit(EXIT_FAILURE); + } + loader->num_batches = CEIL_DIV(examples_per_process, can_fit_examples); + + loader->start_example_index = examples_per_process * loader->process_rank; + loader->end_example_index = examples_per_process * (loader->process_rank + 1); + + if (loader->end_example_index > loader->num_examples) + { + loader->end_example_index = loader->num_examples; + } + + int64_t header_bytes = HEADER_SIZE * sizeof(int); + fseekCheck(loader->eval_file, (int)header_bytes, SEEK_SET); + for (int i = 0; i < loader->start_example_index; i++) + { + uint16_t example_header[3]; + // read 3 uint16_t values: , , + freadCheck(&example_header[0], sizeof(uint16_t), 3, loader->eval_file); + // validate the delimiter + assert(example_header[0] == 65535); // delimiter + // validate the + assert(example_header[2] == i); // should match the loop index + // skip to the next example, keeping in mind that we already read the header + size_t remaining_bytes = example_header[1] - sizeof(uint16_t) * 3; + assert(remaining_bytes > 0); // we expect some bytes in the example + fseekCheck(loader->eval_file, (int)remaining_bytes, SEEK_CUR); + } + // now we are at the start of the example we want to start at, pointing at + loader->current_example_index = loader->start_example_index; +} + +void evalloader_init(EvalLoader *loader, + const char *filename, + size_t B, + size_t T, + int process_rank, + int num_processes) +{ + loader->process_rank = process_rank; + loader->num_processes = num_processes; + loader->B = B; + loader->T = T; + + // open the file and validate the header + loader->eval_file = fopenCheck(filename, "rb"); + // validate the header + int header[HEADER_SIZE]; + freadCheck(header, sizeof(int), HEADER_SIZE, loader->eval_file); + if (header[0] != 20240522) + { + printf("Bad magic in eval file\n"); + exit(EXIT_FAILURE); + } + if (header[1] != 1) + { + printf("Bad version in data file\n"); + exit(EXIT_FAILURE); + } + loader->num_examples = header[2]; // number of examples in the file + assert(loader->num_examples >= num_processes); // avoid headaches for now + size_t longest_example_bytes = header[3]; // longest example in the file + // basic sensibility check we could relax later. but roughly each example + // contains the prompt (or "context") and 4 completions, all of these have to be + // up to T tokens, and their tokens are uint16_t (so 2 bytes/token). + // There's a few more things in each example but they are minor. + // So longest example should be roughly this. Just trying to make sure it's sensible. + assert(longest_example_bytes > 0 && longest_example_bytes < (1 + ASSUMED_NUM_COMPLETIONS) * T * 2); + + // allocate all the space we'll need + int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS); + loader->buffer = (uint16_t *)mallocCheck(longest_example_bytes); + loader->inputs = (int *)calloc(B * T, sizeof(int)); + loader->targets = (int *)calloc(B * T, sizeof(int)); + loader->mask = (char *)mallocCheck(B * T * sizeof(char)); + loader->label = (int *)mallocCheck(can_fit_examples * sizeof(int)); + + // reset the loader, to initialize it + evalloader_reset(loader); +} + +void evalloader_next_example_(EvalLoader *loader, int example_batch_index) +{ + size_t B = loader->B; + size_t T = loader->T; + int batch_dim_offset = example_batch_index * ASSUMED_NUM_COMPLETIONS; + uint16_t example_header[3]; + freadCheck(&example_header[0], sizeof(uint16_t), 3, loader->eval_file); + assert(example_header[0] == 65535); + assert(example_header[2] == loader->current_example_index); + assert(example_header[2] >= loader->start_example_index && example_header[2] < loader->end_example_index); + + size_t example_bytes = example_header[1] - sizeof(uint16_t) * 3; + freadCheck(loader->buffer, sizeof(char), example_bytes, loader->eval_file); + int label = (int)loader->buffer[0]; + int can_fit_examples = (int)(loader->B / ASSUMED_NUM_COMPLETIONS); + assert(label >= 0 && label < ASSUMED_NUM_COMPLETIONS); + assert(example_batch_index >= 0 && example_batch_index < can_fit_examples); + loader->label[example_batch_index] = label; + int num_completions = (int)loader->buffer[1]; + assert(num_completions == ASSUMED_NUM_COMPLETIONS); + assert(batch_dim_offset + num_completions <= B); + loader->num_completions = num_completions; + + int context_length = (int)loader->buffer[2]; + uint16_t *context_tokens_start = &loader->buffer[3]; + assert(context_length > 0 && context_length < T); + for (int b = 0; b < num_completions; b++) + { + for (int i = 0; i < context_length; i++) + { + int boff = batch_dim_offset + b; + int tok_cur = (int)context_tokens_start[i]; + loader->inputs[boff * T + i] = tok_cur; + } + } + uint16_t *completions_iter = loader->buffer + 3 + context_length; + for (int c = 0; c < num_completions; c++) + { + int coff = batch_dim_offset + c; + int completion_length = (int)completions_iter[0]; + uint16_t *completion_tokens_start = completions_iter + 1; + assert(completion_length > 0 && context_length + completion_length < T); + for (int i = 0; i < completion_length; i++) + { + int tok_cur = (int)completion_tokens_start[i]; + + loader->inputs[coff * T + context_length + i] = tok_cur; + + loader->targets[coff * T + context_length + i - 1] = tok_cur; + + loader->mask[coff * T + context_length + i - 1] = 1; + } + completions_iter += 1 + completion_length; + loader->current_example_index += 1; + } + + void evalloader_next_batch(EvalLoader * loader) + { + size_t B = loader->B; + size_t T = loader->T; + memset(loader->mask, 0, B * T * sizeof(char)); + int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS); + for (int i = 0; i < can_fit_examples; i++) + { + if (loader->current_example_index >= loader->end_example_index) + { + break; + } + evalloader_next_example_(loader, i); + } + } + + int evalloader_stat_losses(EvalLoader * loader, float *losses) + { + int correct = 0; + size_t B = loader->B; + size_t T = loader->T; + int can_fit_examples = (int)(B / ASSUMED_NUM_COMPLETIONS); + for (int i = 0; i < can_fit_examples; i++) + { + float min_loss = 0.0f; + int min_loss_index = -1; + char active = 0; + for (int b = 0; b < ASSUMED_NUM_COMPLETIONS; b++) + { + int boff = i * ASSUMED_NUM_COMPLETIONS + b; + float average_loss = 0.0f; + int count = 0; + for (int t = 0; t < T; t++) + { + char mask = loader->mask[boff * T + t]; + if (mask == 1) + { + active = 1; + average_loss += losses[boff * T + t]; + count++; + } + } + if (count > 0) + { + average_loss /= count; + } + if (b == 0 || average_loss < min_loss) + { + min_loss = average_loss; + min_loss_index = b; + } + } + if (active && (min_loss_index == loader->label[i])) + { + correct += 1; + } + } + return correct; + } + + void evalloader_free(EvalLoader * loader) + { + free(loader->buffer); + free(loader->inputs); + free(loader->targets); + free(loader->mask); + free(loader->label); + fcloseCheck(loader->eval_file); + } + +#endif \ No newline at end of file From 58ab6040f8ecb8d7ba7111971ec88a7bec16379b Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 7 Jun 2026 17:47:09 +0530 Subject: [PATCH 38/45] feat(multi-gpu): add foundational utilities for ZeRO sharding --- CUDA/llmcpp/zero.cuh | 597 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 597 insertions(+) create mode 100644 CUDA/llmcpp/zero.cuh diff --git a/CUDA/llmcpp/zero.cuh b/CUDA/llmcpp/zero.cuh new file mode 100644 index 0000000..e6c5b6e --- /dev/null +++ b/CUDA/llmcpp/zero.cuh @@ -0,0 +1,597 @@ +/* +Utilities for ZeRO sharding +*/ + +#ifndef LLMC_ZERO_CUH +#define LLMC_ZERO_CUH + +#include +#include +#include +#include +#include + +#ifdef MULTI_GPU +#include +#ifdef USE_MPI +#include +#endif +#endif + +// defines: fcloseCheck, fwriteCheck, scloseCheck, sclosesocketCheck +#include "utils.h" + +// ---------------------------------------------------------------------------- +// Multi-GPU related +#ifdef MULTI_GPU + +#if defined(ENABLE_FP32) +const ncclDataType_t ncclFloatX = ncclFloat; +#elif defined(ENABLE_FP16) +const ncclDataType_t ncclFloatX = ncclHalf; +#else // Default to bfloat16 +const ncclDataType_t ncclFloatX = ncclBfloat16; +#endif + +void nccl_check(ncclResult_t status, const char *file, int line) { + if (status != ncclSuccess) { + printf("[NCCL ERROR] at file %s:%d:\n%s\n", file, line, ncclGetErrorString(status)); + exit(EXIT_FAILURE); + } +} +#define ncclCheck(err) (nccl_check(err, __FILE__, __LINE__)) + +#ifdef USE_MPI +void mpi_check(int status, const char *file, int line) { + if (status != MPI_SUCCESS) { + char mpi_error[4096]; + int mpi_error_len = 0; + assert(MPI_Error_string(status, &mpi_error[0], &mpi_error_len) == MPI_SUCCESS); + printf("[MPI ERROR] at file %s:%d:\n%.*s\n", file, line, mpi_error_len, mpi_error); + exit(EXIT_FAILURE); + } +} +#define mpiCheck(err) (mpi_check(err, __FILE__, __LINE__)) +#endif + +#endif // MULTI_GPU + +// ---------------------------------------------------------------------------- +// Parameters specific to training on multiple GPUs. +typedef struct { + int process_rank; // Rank of this process among all processes. 0 if no multi-GPU. + int num_processes; // Total number of processes. 1 if no multi-GPU. + int local_device_idx; // This process GPU index on current machine. 0 if no multi-GPU. + + // Zero Redundancy Optimizer stage - https://fairscale.readthedocs.io/en/stable/deep_dive/oss_sdp_fsdp.html + // 0-Disabled + // 1-Optimizer State Sharding (OSS) + // 2-Optimizer + Gradient State Sharding (SDP) + // 3-Optimizer + Gradient + Horizontal Model Sharding (FSDP) + int zero_stage; + size_t shard_num_parameters; +#ifdef MULTI_GPU + ncclComm_t nccl_comm; // NCCL communication primitive, used for collective multi-GPU work. + cudaStream_t nccl_stream; // CUDA Stream to perform NCCL operations. + cudaEvent_t compute_nccl_sync; // Event used to synchronize NCCL with the compute + float* unified_buffer; +#endif +} MultiGpuConfig; + +// one global variable to hold the multi-GPU configuration for this process +// inline, so we can include this header multiple times without getting multiple definitions +inline MultiGpuConfig multi_gpu_config; + +#ifdef MULTI_GPU + +#ifdef _WIN32 +void send_nccl_id_to_clients_windows(ncclUniqueId *nccl_id, SOCKET client_sockets[], int num_clients) { + for (int i = 0; i < num_clients; ++i) { + if (send(client_sockets[i], (const char *)nccl_id, sizeof(*nccl_id), 0) == SOCKET_ERROR) { + printf("Failed to send nccl_id"); + WSACleanup(); + exit(EXIT_FAILURE); + } + closesocketCheck(client_sockets[i]); + } +} +#else +void send_nccl_id_to_clients(ncclUniqueId *nccl_id, int client_sockets[], int num_clients) { + for (int i = 0; i < num_clients; ++i) { + if (send(client_sockets[i], nccl_id, sizeof(*nccl_id), 0) == -1) { + printf("Failed to send nccl_id"); + exit(EXIT_FAILURE); + } + scloseCheck(client_sockets[i]); + } +} +#endif + +#ifdef _WIN32 +// Same as get_nccl_id_via_tcp but for Windows +ncclUniqueId get_nccl_id_via_tcp_windows(MultiGpuConfig* result, const char* server_ip) { + ncclUniqueId nccl_id; + + int SERVER_PORT = 12345; // hardcoded an arbitrary port number between 1024 and 49151 (registered ports) + WSADATA wsaData; + if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0) { + printf("WSAStartup failed"); + exit(EXIT_FAILURE); + } + + if (result->process_rank == 0) { + ncclCheck(ncclGetUniqueId(&nccl_id)); + + int MAX_CLIENTS = result->num_processes - 1; + SOCKET client_sockets[MAX_CLIENTS]; + int num_clients = 0; + SOCKET server_socket, new_socket; + struct sockaddr_in address; + int addrlen = sizeof(address); + + // Step 1) create a server TCP socket + if ((server_socket = socket(AF_INET, SOCK_STREAM, 0)) == INVALID_SOCKET) { + printf("Socket failed"); + WSACleanup(); + exit(EXIT_FAILURE); + } + + // Step 2) set the server address and port + address.sin_family = AF_INET; // IPv4 + address.sin_addr.s_addr = inet_addr(server_ip); + address.sin_port = htons(SERVER_PORT); + + // Step 3) bind the socket to the address and port + if (bind(server_socket, (struct sockaddr *)&address, sizeof(address)) == SOCKET_ERROR) { + printf("Bind failed"); + closesocketCheck(server_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + + // Step 4) MAX_CLIENTS specifies the maximum number of clients that can be queued for this server + if (listen(server_socket, MAX_CLIENTS) == SOCKET_ERROR) { + printf("Listen failed"); + closesocketCheck(server_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + + // Step 5) accept connections from clients + printf("Waiting for clients to connect...\n"); + while (num_clients < MAX_CLIENTS) { + if ((new_socket = accept(server_socket, (struct sockaddr *)&address, &addrlen)) == INVALID_SOCKET) { + printf("Accept failed"); + closesocketCheck(server_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + client_sockets[num_clients++] = new_socket; + printf("Client %d connected\n", num_clients); + } + + // Step 6) send the NCCL ID to all clients + send_nccl_id_to_clients_windows(&nccl_id, client_sockets, num_clients); + printf("NCCL ID sent to all clients\n"); + + closesocketCheck(server_socket); + } else { + int num_connection_attempts = 5; + int time_to_sleep = 2; + SOCKET client_socket; + struct sockaddr_in serv_addr; + + // Step 1) create a client TCP socket + if ((client_socket = socket(AF_INET, SOCK_STREAM, 0)) == INVALID_SOCKET) { + printf("Socket creation error"); + WSACleanup(); + exit(EXIT_FAILURE); + } + + // Step 2) set the server address and port + serv_addr.sin_family = AF_INET; + serv_addr.sin_port = htons(SERVER_PORT); + if (inet_pton(AF_INET, server_ip, &serv_addr.sin_addr) <= 0) { + printf("Invalid address or address not supported"); + closesocketCheck(client_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + + // Step 3) Try to connect to the server - retry up to `num_connection_attempts` times if the connection fails + while (connect(client_socket, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) == SOCKET_ERROR) { + printf("%d Connection failed, retrying in %d seconds\n", result->process_rank, time_to_sleep); + if (--num_connection_attempts == 0) { + printf("Failed to connect to the server\n"); + closesocketCheck(client_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + Sleep(time_to_sleep * 1000); + } + + // Step 4) receive the NCCL ID from the server + if (recv(client_socket, (char *)&nccl_id, sizeof(nccl_id), 0) <= 0) { + printf("Failed to receive nccl_id"); + closesocketCheck(client_socket); + WSACleanup(); + exit(EXIT_FAILURE); + } + + printf("Received NCCL ID\n"); + closesocketCheck(client_socket); + } + + WSACleanup(); + return nccl_id; +} +#else +ncclUniqueId get_nccl_id_via_tcp(MultiGpuConfig* result, const char* server_ip) { + ncclUniqueId nccl_id; + + int SERVER_PORT = 12345; // hardcoded an arbitrary port number between 1024 and 49151 (registered ports) + if (result->process_rank == 0) { + ncclCheck(ncclGetUniqueId(&nccl_id)); + + int MAX_CLIENTS = result->num_processes - 1; + int client_sockets[MAX_CLIENTS]; + int num_clients = 0; + int server_socket, new_socket; + struct sockaddr_in address; + int addrlen = sizeof(address); + int opt = 1; + + // Step 1) create a server TCP socket + if ((server_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + printf("Socket failed"); + exit(EXIT_FAILURE); + } + + // Step 2) set socket options + // SOL_SOCKET - means that option is configured at socket level + // SO_REUSEADDR - allows to bind to an address which is in a TIME_WAIT state (already used by another socket) - useful when restarting the server + // SO_REUSEPORT - allows to bind to the same port multiple times + if (setsockopt(server_socket, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT, &opt, sizeof(opt)) < 0) { + printf("Setsockopt failed"); + exit(EXIT_FAILURE); + } + + // Step 3) set the server address and port + address.sin_family = AF_INET; // IPv4 + address.sin_addr.s_addr = inet_addr(server_ip); // alternatively use INADDR_ANY to bind to all interfaces, currently we only allow ethernet + address.sin_port = htons(SERVER_PORT); + + // Step 4) bind the socket to the address and port + if (bind(server_socket, (struct sockaddr *)&address, sizeof(address)) < 0) { + printf("Bind failed"); + exit(EXIT_FAILURE); + } + + // Step 5) MAX_CLIENTS specifies the maximum number of clients that can be queued for this server + if (listen(server_socket, MAX_CLIENTS) < 0) { + printf("Listen failed"); + exit(EXIT_FAILURE); + } + + // Step 6) accept connections from clients + printf("Waiting for clients to connect...\n"); + while (num_clients < MAX_CLIENTS) { + if ((new_socket = accept(server_socket, (struct sockaddr *)&address, (socklen_t*)&addrlen)) < 0) { + printf("Accept failed"); + exit(EXIT_FAILURE); + } + client_sockets[num_clients++] = new_socket; + printf("Client %d connected\n", num_clients); + } + + // Step 7) send the NCCL ID to all clients + send_nccl_id_to_clients(&nccl_id, client_sockets, num_clients); + printf("NCCL ID sent to all clients\n"); + + scloseCheck(server_socket); + } else { + int num_connection_attempts = 5; + int time_to_sleep = 2; + int client_socket; + struct sockaddr_in serv_addr; + + // Step 1) create a client TCP socket + if ((client_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + printf("Socket creation error"); + exit(EXIT_FAILURE); + } + + // Step 2) set the server address and port + serv_addr.sin_family = AF_INET; + serv_addr.sin_port = htons(SERVER_PORT); + if (inet_pton(AF_INET, server_ip, &serv_addr.sin_addr) <= 0) { + printf("Invalid address or address not supported"); + exit(EXIT_FAILURE); + } + + // Step 3) Try to connect to the server - retry up to `num_connection_attempts` times if the connection fails + while (connect(client_socket, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) { + printf("%d Connection failed, retrying in %d seconds\n", result->process_rank, time_to_sleep); + if (--num_connection_attempts == 0) { + printf("Failed to connect to the server\n"); + exit(EXIT_FAILURE); + } + sleep(time_to_sleep); + } + + // Step 4) receive the NCCL ID from the server + if (recv(client_socket, &nccl_id, sizeof(nccl_id), 0) <= 0) { + printf("Failed to receive nccl_id"); + exit(EXIT_FAILURE); + } + + printf("Received NCCL ID\n"); + scloseCheck(client_socket); + } + + return nccl_id; +} +#endif + +ncclUniqueId get_nccl_id_via_fs(MultiGpuConfig* result, char* fs_path) { + // Works assuming that the filesystem is shared among all processes + ncclUniqueId nccl_id; + FILE* idFile; + static char filename[1024]; + snprintf(filename, sizeof(filename), "%s/ncclUniqueId.sync", fs_path); + + if (result->process_rank != 0) { // client processse should wait for the server to write to the file + // This is a naive and not 100% robust way to synchronize the processes but it should work almost always + sleep(2); + } + + if (result->process_rank == 0) { + ncclCheck(ncclGetUniqueId(&nccl_id)); + idFile = fopen(filename, "wb"); + assert(idFile != NULL); + fwriteCheck(&nccl_id, sizeof(nccl_id), 1, idFile); + fcloseCheck(idFile); + } else { + // Other ranks wait until the file is available and read the unique ID + do { + sleep(1); // 1 second + idFile = fopen(filename, "rb"); + if (idFile != NULL) break; + } while (idFile == NULL); + freadCheck(&nccl_id, sizeof(nccl_id), 1, idFile); + fcloseCheck(idFile); + } + + return nccl_id; +} + +#ifdef USE_MPI +// Determine which GPU this process should use. +// Processes on the same machines use different GPU indicies. Processes on other machines don't. +// Copied from NCCL examples: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-2-one-device-per-process-or-thread +int multi_gpu_get_local_device_idx(int process_rank, int num_processes) { + char hostname[1024]; + hostname[1023] = '\0'; + // All processes on the same machine will share the same hostname. + gethostname(hostname, 1023); + for (int i=0; i < 1024; i++) { + if (hostname[i] == '.') { + hostname[i] = '\0'; + break; + } + } + uint64_t hostname_hash = 5381u; + for (int c = 0; hostname[c] != '\0'; c++){ hostname_hash = ((hostname_hash << 5u) + hostname_hash) ^ hostname[c]; } + + // Distribute all hostname hashes to all processes. + uint64_t* all_hostsname_hashes = (uint64_t*)malloc(num_processes * sizeof(uint64_t)); + all_hostsname_hashes[process_rank] = hostname_hash; + mpiCheck(MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, all_hostsname_hashes, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD)); + + // Identify which GPU we need to use. + int local_device_idx = 0; + for (int current_process = 0; current_process < num_processes; ++current_process) { + if (current_process == process_rank) { + // Found my gpu, local_device_idx now has my target GPU index. + break; + } + if (all_hostsname_hashes[current_process] == all_hostsname_hashes[process_rank]) { + // This process ID runs on the same machine, but it's not me, skip this GPU + local_device_idx++; + } + } + + free(all_hostsname_hashes); + return local_device_idx; +} +#endif + +#endif + +MultiGpuConfig multi_gpu_config_init(int num_processes, int process_rank, int gpus_per_node, char* server_ip, char* fs_path, char* init_method) { +#ifdef MULTI_GPU + MultiGpuConfig result; + ncclUniqueId nccl_id; + // Get nccl_id using MPI, TCP, or FS (file system synchronization) methods + // On newer slurm versions (slurm-wlm package) PMIx is disabled so we can not use MPI for NCCL init in multi node setup + if (strcmp(init_method, "mpi") == 0) { + #ifdef USE_MPI + mpiCheck(MPI_Init(NULL, NULL)); + mpiCheck(MPI_Comm_rank(MPI_COMM_WORLD, &result.process_rank)); + mpiCheck(MPI_Comm_size(MPI_COMM_WORLD, &result.num_processes)); + result.local_device_idx = multi_gpu_get_local_device_idx(result.process_rank, result.num_processes); + if (result.process_rank == 0) { + ncclCheck(ncclGetUniqueId(&nccl_id)); + } + mpiCheck(MPI_Bcast(&nccl_id, sizeof(nccl_id), MPI_BYTE, 0, MPI_COMM_WORLD)); + #else + printf("MPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method.\n"); + exit(EXIT_FAILURE); + #endif + } else { + result.process_rank = process_rank; + result.num_processes = num_processes; + result.local_device_idx = process_rank % gpus_per_node; + if (strcmp(init_method, "tcp") == 0) { + #ifdef _WIN32 + nccl_id = get_nccl_id_via_tcp_windows(&result, server_ip); + #else + nccl_id = get_nccl_id_via_tcp(&result, server_ip); + #endif + } else if (strcmp(init_method, "fs") == 0) { + nccl_id = get_nccl_id_via_fs(&result, fs_path); + } else { + printf("Invalid NCCL-init method\n"); + exit(EXIT_FAILURE); + } + } + cudaCheck(cudaSetDevice(result.local_device_idx)); + ncclCheck(ncclCommInitRank(&result.nccl_comm, result.num_processes, nccl_id, result.process_rank)); + cudaCheck(cudaStreamCreate(&result.nccl_stream)); + // event without timing for maximum performance + cudaCheck(cudaEventCreate(&result.compute_nccl_sync, cudaEventDisableTiming)); + nvtxNameCudaStreamA(result.nccl_stream, "nccl stream"); + nvtxNameCudaEventA(result.compute_nccl_sync, "nccl compute sync"); + cudaCheck(cudaMallocManaged(&result.unified_buffer, sizeof(float))); + return result; +#else + printf("Multi-GPU support is disabled. Using a single GPU.\n"); + cudaCheck(cudaSetDevice(0)); + MultiGpuConfig result; + result.process_rank = 0; + result.num_processes = 1; + result.local_device_idx = 0; + return result; +#endif +} + +void multi_gpu_config_free(MultiGpuConfig* config) { +#ifdef MULTI_GPU + ncclCheck(ncclCommDestroy(config->nccl_comm)); + cudaCheck(cudaStreamDestroy(config->nccl_stream)); + cudaCheck(cudaEventDestroy(config->compute_nccl_sync)); + cudaCheck(cudaFree(config->unified_buffer)); + #ifdef USE_MPI + mpiCheck(MPI_Finalize()); + #endif +#endif +} + +void multi_gpu_barrier(const MultiGpuConfig* config) { +#ifdef MULTI_GPU + if (config->num_processes > 1) { + ncclCheck(ncclAllReduce(config->unified_buffer, config->unified_buffer, sizeof(float), ncclFloat, ncclSum, config->nccl_comm, config->nccl_stream)); + } + cudaCheck(cudaDeviceSynchronize()); +#endif +} + +// Offset and size of a tensor shard +typedef struct { + ptrdiff_t offset; + size_t size; +} ShardInfo; + +// Get info about sharding for a tensor of elements many numbers +ShardInfo multi_gpu_get_shard_offset(size_t elements, const MultiGpuConfig* config, int shard_at_stage) { + const int nproc = config->num_processes; + if(config->zero_stage >= shard_at_stage) { + if (elements % nproc != 0) { + fprintf(stderr, "Number of elements %zu must be a multiple of the number of processes %d\n", elements, nproc); + exit(EXIT_FAILURE); + } + return {(ptrdiff_t) (config->process_rank * (elements / nproc)), elements / nproc}; + } else { + return {0, elements}; + } +} + +// Block NCCL stream until computations on compute_stream are done, then aggregate multiple pointers in an NCCL group. +// This can work either as an all-reduce (i.e., no ZeRo), or a reduce-scatter (ZeRO 1). +// The awkward `(&pointers)[N]` syntax ensures we are capturing the parameters as sized arrays, so that it becomes impossible +// to call this function if pointers and pointers_sizes do not match. +template +void multi_gpu_async_reduce_gradient( + floatX* const (&pointers)[N], const size_t (&pointers_sizes)[N], + MultiGpuConfig* config, cudaStream_t compute_stream) { + if (config->num_processes == 1) { + return; // no multi-GPU, just exit. + } + +#ifdef MULTI_GPU + NVTX_RANGE_FN(); + // mark an event on the compute stream, and immediately wait on this in the nccl stream + // this means that the nccl stream won't start executing before all compute kernels that + // have been submitted before this point have finished. + // by using an event instead of cudaSyncStream, we avoid having to synchronize the host, and + // can enqueue new work to the GPU right away. + cudaCheck(cudaEventRecord(config->compute_nccl_sync, compute_stream)); + cudaCheck(cudaStreamWaitEvent(config->nccl_stream, config->compute_nccl_sync)); + ncclCheck(ncclGroupStart()); // NCCL group: aggregate all pointers in a single NCCL GPU kernel. + for (int i = 0; i < N; ++i) { + if(config->zero_stage == 0) { + ncclCheck(ncclAllReduce( + pointers[i], pointers[i], + pointers_sizes[i], + ncclFloatX, ncclAvg, + config->nccl_comm, config->nccl_stream + )); + } else if(config->zero_stage == 1) { + assert(pointers_sizes[i] % config->num_processes == 0); + size_t shard_size = pointers_sizes[i] / config->num_processes; + ptrdiff_t shard_offset = (ptrdiff_t)shard_size * config->process_rank; + ncclCheck(ncclReduceScatter( + pointers[i], pointers[i] + shard_offset, + shard_size, + ncclFloatX, ncclAvg, + config->nccl_comm, config->nccl_stream + )); + } + } + ncclCheck(ncclGroupEnd()); +#endif +} + +// convenience macro that only prints if the rank of process is zero +#define printf0(...) if (::multi_gpu_config.process_rank == 0) { printf(__VA_ARGS__); } + +void set_zero_configs(MultiGpuConfig* config, int zero_stage, size_t total_parameters) { + config->zero_stage = 0; + config->shard_num_parameters = total_parameters; + // Check the Zero Stage and define sharding parameters + if (zero_stage == 0) { + printf0("| Zero Optimization is disabled |\n"); + } + else if (zero_stage == 1) { + if (total_parameters % config->num_processes != 0) { + printf0("| Zero Optimization is disabled, Can't equally partition parameters |\n"); + config->zero_stage = 0; + } + else { + config->zero_stage = 1; + config->shard_num_parameters = total_parameters / config->num_processes; + } + } + else{ + printf0("| Disabling Zero Optimization, Zero Stage2 and Stage3 are not yet supported |\n"); + config->zero_stage = 0; + } +} + +// Compute sum of a single CPU value across all GPU processes. No-op when multi-GPU is disabled. +float multi_gpu_cpu_float_sum(float value, MultiGpuConfig* config) { +#ifdef MULTI_GPU + if (config->num_processes == 1) return value; + + float* unified_buffer = config->unified_buffer; + *unified_buffer = value; + ncclCheck(ncclAllReduce(unified_buffer, unified_buffer, sizeof(float), ncclFloat, ncclSum, config->nccl_comm, config->nccl_stream)); + cudaCheck(cudaDeviceSynchronize()); + return *unified_buffer; +#else + return value; +#endif +} + +#endif + From b91f867b826f8b74772d7f7c236a544f84e0c083 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 7 Jun 2026 17:49:34 +0530 Subject: [PATCH 39/45] feat(utils): add I/O and memory error-checking wrappers --- CUDA/llmcpp/layernorm.cuh | 505 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 505 insertions(+) create mode 100644 CUDA/llmcpp/layernorm.cuh diff --git a/CUDA/llmcpp/layernorm.cuh b/CUDA/llmcpp/layernorm.cuh new file mode 100644 index 0000000..9777d06 --- /dev/null +++ b/CUDA/llmcpp/layernorm.cuh @@ -0,0 +1,505 @@ +/* +LayerNorm CUDA kernel, and also Residual, because sometimes they are fused + +Note in llm.c we try to be clever in the backward pass to conserve memory. +All parameters use a += in the backward pass, so we can do gradient accumulation. +But all activations have = instead of += because these are faster (just read, no write). +This is okay for all activations except for those in the residual stream, where the +gradients have to add. We make sure that we do a += as necessary. +E.g., the layernorms are connected to the residuals so we += in layernorm backward. +*/ + +#include +// llmc internal imports +#include "cuda_common.h" +#include "cuda_utils.cuh" + +// ---------------------------------------------------------------------------- +// CUDA kernels + +__global__ void layernorm_forward_kernel3(floatX* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd, + const floatX* __restrict__ inp, const floatX* __restrict__ weight, + const floatX* __restrict__ bias, int N, int C) { + int lane_id = threadIdx.x % WARP_SIZE; + int warp_id = threadIdx.x / WARP_SIZE; + int num_warps = blockDim.x / WARP_SIZE; + + int idx = blockIdx.x * num_warps + warp_id; + if(idx >= N) { return; } // guard + + // the row of input that this group of threads is responsible for + const floatX* x = inp + idx * C; + + // mean + float sum = 0.0f; + for (int i = lane_id; i < C; i += WARP_SIZE) { + sum += (float)x[i]; + } + sum = warpReduceSum(sum); + float m = sum / C; + if(lane_id == 0 && mean != nullptr) { + __stcs(mean + idx, m); + } + + // rstd + sum = 0.0f; + for (int i = lane_id; i < C; i += WARP_SIZE) { + float diff = (float)x[i] - m; + sum += diff * diff; + } + sum = warpReduceSum(sum); + float s = rsqrtf(sum / C + 1e-5f); + if(lane_id == 0 && rstd != nullptr) { + __stcs(rstd + idx, s); + } + + // final normalization and scaling by weight/bias + floatX* o = out + idx * C; + for (int c = lane_id; c < C; c += WARP_SIZE) { + // load and store using the .cs "streaming" hint to the compiler, + // indicating that this data will not be reused soon, and can be streamed through the caches + // this allows the threads to get more cache-hits for the (shared) weight and bias parameters + float n = s * ((float)__ldcs(x+c) - m); + __stcs(o+c, (floatX)(n * (float)weight[c] + (float)bias[c])); + } +} + +__global__ void layernorm_forward_kernel6(floatX* __restrict__ out, float* __restrict__ mean, float* __restrict__ rstd, + const floatX* __restrict__ inp, const floatX* __restrict__ weight, + const floatX* __restrict__ bias, int N, int C) { + assert(blockDim.x == WARP_SIZE); + + // load weights and biases into shared memory + // do this before we allow any threads to exit! + extern __shared__ char* params[]; + // load128/store128 sometimes generated multiple instructions when the types here were floatX*, so + // let's keep everything as x128 + x128* s_weight = reinterpret_cast(params); + x128* s_bias = reinterpret_cast(params) + (C / x128::size); + x128* s_in = reinterpret_cast(params) + ((2 + threadIdx.y) * C / x128::size); + + int sidx = (threadIdx.x + WARP_SIZE * threadIdx.y) * x128::size; + for(int i = sidx; i < C; i += blockDim.y * WARP_SIZE * x128::size) { + s_weight[i/x128::size] = load128(weight + i); + s_bias[i/x128::size] = load128(bias + i); + } + __syncthreads(); + + int idx = blockIdx.x * blockDim.y + threadIdx.y; + if(idx >= N) { return; } // guard + + // adjust pointers to current token + inp += idx * C; + out += idx * C; + + const float eps = 1e-5f; + float sum = 0.0f; + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 in_data = load128cs(inp + c); + for(int k = 0; k < x128::size; ++k) { + sum += (float)in_data[k]; + } + s_in[c / x128::size] = in_data; + } + + sum = warpReduceSum(sum); + float m = sum / C; + float v = 0.f; + + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 in_data = s_in[c / x128::size]; + for(int k = 0; k < x128::size; ++k) { + v += ((float)in_data[k] - m) * ((float)in_data[k] - m); + } + } + + v = warpReduceSum(v) / C; + float s = rsqrtf(v + eps); + + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 in_data = s_in[c / x128::size]; + const x128 w = s_weight[c / x128::size]; + const x128 b = s_bias[c / x128::size]; + x128 out_data; + for(int k = 0; k < x128::size; ++k) { + float n = s * ((float)in_data[k] - m); // normalized output + float o = n * (float)w[k] + (float)b[k]; // scale and shift it + out_data[k] = (floatX)o; + } + + store128cs(out + c, out_data); + } + // cache the mean and rstd for the backward pass later + if(threadIdx.x == 0 && mean != nullptr) { + __stcs(mean + idx, m); + } + // store the rstd, no need to cache it + if(threadIdx.x == 0 && rstd != nullptr) { + __stcs(rstd + idx, s); + } +} + +__global__ void fused_residual_forward_kernel5(floatX* residual, floatX* normed, float* mean, float* rstd, + const floatX* inp1, const floatX* inp2, + const floatX* weight, const floatX* bias, + int N, int C) { + assert(blockDim.x == WARP_SIZE); + + // load weights and biases into shared memory + // do this before we allow any threads to exit! + extern __shared__ char* params[]; + // load128/store128 sometimes generated multiple instructions when the types here were floatX*, so + // let's keep everything as x128 + x128* s_weight = reinterpret_cast(params); + x128* s_bias = reinterpret_cast(params) + (C / x128::size); + x128* s_res = reinterpret_cast(params) + ((2 + threadIdx.y) * C / x128::size); + + int sidx = (threadIdx.x + WARP_SIZE * threadIdx.y) * x128::size; + for(int i = sidx; i < C; i += blockDim.y * WARP_SIZE * x128::size) { + s_weight[i/x128::size] = load128(weight + i); + s_bias[i/x128::size] = load128(bias + i); + } + __syncthreads(); + + int idx = blockIdx.x * blockDim.y + threadIdx.y; + if(idx > N) return; + + // adjust pointers to current token + residual += C * idx; + normed += C * idx; + inp1 += C * idx; + inp2 += C * idx; + + const float eps = 1e-5f; + float sum = 0.0f; + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 in1 = load128cs(inp1 + c); + const x128 in2 = load128cs(inp2 + c); + x128 out; + for(int k = 0; k < x128::size; ++k) { + out[k] = (float)in1[k] + (float)in2[k]; + sum += (float)out[k]; + } + store128cs(residual + c, out); + s_res[c / x128::size] = out; + } + + sum = warpReduceSum(sum); + float m = sum / C; + float v = 0.f; + + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 res = s_res[c / x128::size]; + for(int k = 0; k < x128::size; ++k) { + v += ((float)res[k] - m) * ((float)res[k] - m); + } + } + + v = warpReduceSum(v) / C; + float s = rsqrtf(v + eps); + + for(int c = threadIdx.x * x128::size; c < C; c += WARP_SIZE * x128::size) { + const x128 res = s_res[c / x128::size]; + const x128 w = s_weight[c / x128::size]; + const x128 b = s_bias[c / x128::size]; + x128 out; + for(int k = 0; k < x128::size; ++k) { + float n = s * ((float)res[k] - m); // normalized output + float o = n * (float)w[k] + (float)b[k]; // scale and shift it + out[k] = o; + } + + store128cs(normed + c, out); + } + // cache the mean and rstd for the backward pass later + if(threadIdx.x == 0) { + mean[idx] = m; + rstd[idx] = s; + } +} + +__global__ void residual_forward_kernel(floatX* out, const floatX* inp1, const floatX* inp2) { + int idx = (blockIdx.x * blockDim.x + threadIdx.x) * x128::size; + + x128 packed_out; + x128 packed_inp1 = load128cs(inp1 + idx); + x128 packed_inp2 = load128cs(inp2 + idx); + for (int k = 0; k < packed_inp1.size; k++) { + packed_out[k] = (floatX)((float)packed_inp1[k] + (float)packed_inp2[k]); + } + store128(out + idx, packed_out); +} + +__global__ void __launch_bounds__(512, 2) // todo - any warnings on Turing with only 1024 threads? + layernorm_backward_kernel10(floatX* dinp, floatX* dweight, floatX* dbias, float* scratch, + const floatX* dout, const floatX* inp, const floatX* weight, + const float* mean, const float* rstd, + int B, int T, int C) { + int BLOCK_SIZE = blockDim.x; + int warpsInBlock = BLOCK_SIZE / WARP_SIZE; //number of warps in block + extern __shared__ float shared[]; + + int warpId = threadIdx.x / WARP_SIZE; // warp index within a block + int baseIdx = blockIdx.x * warpsInBlock + warpId; + int warpThreadIdx = threadIdx.x % WARP_SIZE; // Thread index within the warp + int warpsInGrid = gridDim.x * warpsInBlock; + int C_per_iteration = WARP_SIZE * x128::size; + int iterations_C = CEIL_DIV(C, C_per_iteration); // + 2; + + // the first half of shared memory is bias, second is weight + size_t rounded_C = CEIL_DIV(C, (32 * x128::size)) * (32 * x128::size); + float* dbias_shared = shared; + float* dweight_shared = shared + rounded_C; + // warp zero doesn't actually write to the _tmp_shared memory locations, so we don't need to reserve memory + // the obvious solution is to change the addressing below to use (threadId.x-32) as offset, but that causes + // register spills, so instead we mess with the base pointer here, which doesn't increase register usage. + float* dbias_tmp_shared = shared + 2 * rounded_C - WARP_SIZE * f128::size; + float* dweight_tmp_shared = shared + 2 * rounded_C + f128::size * BLOCK_SIZE - 2 * WARP_SIZE * f128::size; + + // init shared memory to zero + for(int i = threadIdx.x * f128::size; i < rounded_C; i += BLOCK_SIZE * f128::size) { + store128(dbias_shared + i, f128::zeros()); + store128(dweight_shared + i, f128::zeros()); + } + __syncthreads(); + + for (int bt = baseIdx; bt < B * T; bt += warpsInGrid) { + const floatX* dout_bt = dout + bt * C; + const floatX* inp_bt = inp +bt * C; + floatX* dinp_bt = dinp + bt * C; + + // first: two reduce operations + float dnorm_mean = 0.0f; + float dnorm_norm_mean = 0.0f; + for (int i = warpThreadIdx * x128::size; i < C; i += WARP_SIZE * x128::size) { + x128 dout128_i = load128(dout_bt + i); + x128 inp128_i = load128(inp_bt + i); + x128 weight128_i = load128(weight + i); + for (int k = 0; k < x128::size; k++) { + float dnorm_i = (float)weight128_i[k] * (float)dout128_i[k]; + dnorm_mean += dnorm_i; + dnorm_norm_mean += dnorm_i * (float)inp128_i[k]; + } + } + + const float mean_bt = mean[bt]; + const float rstd_bt = rstd[bt]; + dnorm_mean = warpReduceSum(dnorm_mean) / C; + dnorm_norm_mean = warpReduceSum(dnorm_norm_mean) / C * rstd_bt - dnorm_mean * mean_bt * rstd_bt; + + for (int c = 0; c < iterations_C; c++) { + int global_index = (warpThreadIdx * x128::size) + (c * C_per_iteration); + + x128 dout128 = x128::zeros(); + x128 inp128 = x128::zeros(); + x128 dinp128 = x128::zeros(); + x128 weight128 = x128::zeros(); + + if(global_index < C) { + dout128 = load128cs(dout_bt + global_index); + inp128 = load128cs(inp_bt + global_index); + dinp128 = load128(dinp_bt + global_index); + weight128 = load128(weight + global_index); + } + + for(int o = 0; o < x128::size / f128::size; ++o) { + f128 dbias_f; + f128 dweight_f; + for(int i = 0; i < f128::size; ++i) { + int x = o * f128::size + i; + float dout_i = (float)dout128[x]; + float norm_bti = ((float)inp128[x] - mean_bt) * rstd_bt; + dbias_f[i] = dout_i; + dweight_f[i] = norm_bti * dout_i; + + float dval = 0.0f; + dval += (float) weight128[x] * (float)dout128[x]; // term 1 + dval -= dnorm_mean; // term 2 + dval -= norm_bti * dnorm_norm_mean; // term 3 + dval *= rstd_bt; // final scale + dinp128[x] = (floatX) ((float) dinp128[x] + dval); + } + + if (warpId != 0) { + store128(dbias_tmp_shared + threadIdx.x * f128::size, dbias_f); + // this seems to generate a 64-bit store, instead of 128-bit. + // however, forcing 128-bit (e.g., using inline ptx), results in register + // spilling and much worse performance, so we'll keep it like this for now + // but ideally, we could reduce the register pressure a little. + store128(dweight_tmp_shared + threadIdx.x * f128::size, dweight_f); + } + __syncthreads(); + if (warpId == 0) { + for (int j = 1; j < warpsInBlock; j++) { + f128 dbias_tmp = load128(dbias_tmp_shared + f128::size * (threadIdx.x + j * WARP_SIZE)); + f128 dweight_tmp = load128(dweight_tmp_shared + f128::size * (threadIdx.x + j * WARP_SIZE)); + for(int i = 0; i < f128::size; ++i) { + dbias_f[i] += dbias_tmp[i]; + dweight_f[i] += dweight_tmp[i]; + } + } + } + __syncthreads(); + if (warpId == 0) { + f128 db_old = load128(dbias_shared + global_index + f128::size * o); + f128 dw_old = load128(dweight_shared + global_index + f128::size * o); + for(int i = 0; i < f128::size; ++i) { + dbias_f[i] += db_old[i]; + dweight_f[i] += dw_old[i]; + } + store128(dbias_shared + global_index + f128::size * o, dbias_f); + store128(dweight_shared + global_index + f128::size * o, dweight_f); + } + } + if(global_index < C) { + // cache in L2 as this is read by the next kernel, but bypass L1 to minimise thrashing + store128cg(dinp_bt + global_index, dinp128); + } + } + } + __syncthreads(); + // Each block writes its partial sum to global memory + // The last block to finish becomes responsible for summing up all the partial sums + // This is done by atomically incrementing a flag (cleared to 0 before launching the kernel) + unsigned int* scratchFlag = (unsigned int*)(scratch); + // Increment scratch pointer by a full cacheline so that everything remains cacheline aligned + scratch += 32; + float* scratch_dbias = scratch; + float* scratch_dweight = scratch + C; + for(int i = threadIdx.x * f128::size; i < C; i += BLOCK_SIZE * f128::size) { + // Write to global memory in the same "shared memory banking friendly" order + store128(scratch_dbias + i + 2*C*blockIdx.x, load128(dbias_shared + i)); + store128(scratch_dweight + i + 2*C*blockIdx.x, load128(dweight_shared + i)); + } + __syncthreads(); + // that portion of shared memory is no longer used, so we can repurpose it for the scratch flag. + unsigned int *tmp_flag = (unsigned int*)(shared + 2*rounded_C); + if (threadIdx.x == 0) { + *tmp_flag = atomicInc(scratchFlag, gridDim.x); + } + __syncthreads(); + if (*tmp_flag == gridDim.x-1) { + // Reduction of the partial sums by the final block + // todo - there isn't enough parallelism even inside that single SM... + // ==> so could maybe split into another kernel with YET ANOTHER level of reduction?! + for(int i = threadIdx.x * f128::size; i < C; i += BLOCK_SIZE * f128::size) { + f128 dbias_accum = f128::zeros(); + f128 dweight_accum = f128::zeros(); + + for (int read_block_idx = 0; read_block_idx < gridDim.x; read_block_idx++) { + int offset = i + 2*C*read_block_idx; + f128 dbias128 = load128(scratch_dbias + offset); + f128 dweight128 = load128(scratch_dweight + offset); + for(int k = 0; k < f128::size; k++) { + dbias_accum[k] += dbias128[k]; + dweight_accum[k] += dweight128[k]; + } + } + store128(dbias_shared + i, dbias_accum); + store128(dweight_shared + i, dweight_accum); + } + __syncthreads(); + + // convert from float/FP32 to floatX/BF16 for the final write + // this is separate because it cannot use as many warps as the above (f128 vs x128) + // todo - if we split this code into another kernel, we could maybe do it at the same time? + for (int c = warpId; c < iterations_C; c += warpsInBlock) { + int global_index = (warpThreadIdx * x128::size) + (c * C_per_iteration); + if (global_index >= C) { + break; + } + + x128 dbias128 = load128(dbias + global_index); + x128 dweight128 = load128(dweight + global_index); + for(int o = 0; o < x128::size / f128::size; ++o) { + f128 s_db = load128(dbias_shared + global_index + o * f128::size); + f128 s_dw = load128(dweight_shared + global_index + o * f128::size); + for(int i = 0; i < f128::size; ++i) { + int x = o * f128::size + i; + dbias128[x] = (floatX)(s_db[i] + (float)dbias128[x]); + dweight128[x] = (floatX)(s_dw[i] + (float)dweight128[x]); + } + } + store128(dbias + global_index, dbias128); + store128(dweight + global_index, dweight128); + } + } +} + +// ---------------------------------------------------------------------------- +// kernel launchers + +// similar to `fused_residual_forward5` +void layernorm_forward(floatX* out, float* mean, float* rstd, + floatX* inp, const floatX* weight, const floatX* bias, + int B, int T, int C, cudaStream_t stream) { + NVTX_RANGE_FN(); + const int block_size = 256; + int block_y = block_size / WARP_SIZE; + const int N = B * T; + const int grid_size = CEIL_DIV(N, block_y); + size_t smem = (2 + block_y) * C * sizeof(floatX); + + // in order to use more than 48 KiB of smem, need to call cudaFuncSetAttribute + // this may fail, in which case we fall back to the smem free implementation. + cudaCheck(cudaGetLastError()); + auto status = cudaFuncSetAttribute(layernorm_forward_kernel6, cudaFuncAttributeMaxDynamicSharedMemorySize, smem); + cudaCheck(cudaGetLastError()); + if (status == cudaSuccess) { + layernorm_forward_kernel6<<>>(out, mean, rstd, inp, weight, bias, N, C); + } else { + // fall back to the version without shared memory + const int grid_size_fb = CEIL_DIV(N * WARP_SIZE, block_size); + layernorm_forward_kernel3<<>>(out, mean, rstd, inp, weight, bias, N, C); + } + cudaCheck(cudaGetLastError()); +} + +void residual_forward(floatX* out, const floatX* inp1, const floatX* inp2, int N, cudaStream_t stream) { + NVTX_RANGE_FN(); + const int block_size = 256; + assert(N % (block_size * x128::size) == 0); + const int grid_size = CEIL_DIV(N, block_size * x128::size); + residual_forward_kernel<<>>(out, inp1, inp2); + cudaCheck(cudaGetLastError()); +} + +void fused_residual_forward5(floatX* residual, floatX* normed, float* mean, float* rstd, + const floatX* inp1, const floatX* inp2, + const floatX* weight, const floatX* bias, + int N, int C, cudaStream_t stream) { + const int block_size = 256; + int block_y = block_size / WARP_SIZE; + const int grid_size = CEIL_DIV(N, block_y); + size_t smem = (2 + block_y) * C * sizeof(floatX); + + // in order to use more than 48 KiB of smem, need to call cudaFuncSetAttribute + // this may fail, in which case we fall back to the smem free implementation. + cudaCheck(cudaGetLastError()); + auto status = cudaFuncSetAttribute(fused_residual_forward_kernel5, cudaFuncAttributeMaxDynamicSharedMemorySize, smem); + cudaCheck(cudaGetLastError()); + if(status == cudaSuccess) { + fused_residual_forward_kernel5<<>>(residual, normed, + mean, rstd, inp1, inp2, + weight, bias, N, C); + } else { + residual_forward(residual, inp1, inp2, N*C, stream); + layernorm_forward(normed, mean, rstd, residual, weight, bias, N, 1, C, stream); + } + cudaCheck(cudaGetLastError()); +} + +void layernorm_backward(floatX* dinp, floatX* dweight, floatX* dbias, float* scratch, + const floatX* dout, const floatX* inp, const floatX* weight, const float* mean, const float* rstd, + int B, int T, int C, cudaStream_t stream) { + NVTX_RANGE_FN(); + const int block_size = 512; + const int blocks_per_sm = 2; // supported on every architecture and less cache thrashing than 3 + const int grid_size = blocks_per_sm * deviceProp.multiProcessorCount; + size_t rounded_C = CEIL_DIV(C, (32 * x128::size)) * (32 * x128::size); + size_t shared_mem_size = (2 * rounded_C + 2 * (block_size - 32) * f128::size) * sizeof(float); + + cudaCheck(cudaMemsetAsync(scratch, 0, 1 * sizeof(float), stream)); // only need to reset the flag to 0 + layernorm_backward_kernel10<<>>(dinp, dweight, dbias, scratch, dout, inp, weight, mean, rstd, B, T, C); + cudaCheck(cudaGetLastError()); +} From 811018613fdf9579daaaa35fbe324c0f6cbaa109 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 7 Jun 2026 17:50:59 +0530 Subject: [PATCH 40/45] feat : add PyTorch-compatible Mersenne Twister random utilities --- CUDA/llmcpp/rand.h | 240 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 240 insertions(+) create mode 100644 CUDA/llmcpp/rand.h diff --git a/CUDA/llmcpp/rand.h b/CUDA/llmcpp/rand.h new file mode 100644 index 0000000..b66aa04 --- /dev/null +++ b/CUDA/llmcpp/rand.h @@ -0,0 +1,240 @@ +/* +Mersenne Twisters implementation, numerically identical to torch. + +Example usage: + + mt19937_state state; + manual_seed(&state, 137); + printf("%u\n", randint32(&state)); + printf("%u\n", randint32(&state)); + printf("%u\n", randint32(&state)); + printf("%u\n", randint32(&state)); + printf("%u\n", randint32(&state)); + + float t8[8]; + normal_(t8, 8, 0, 1, &state); + for (int i = 0; i < 8; i++) { + printf("%f\n", t8[i]); + } + printf("%u\n", randint32(&state)); + + float t16[16]; + normal_(t16, 16, 0, 1, &state); + for (int i = 0; i < 16; i++) { + printf("%f\n", t16[i]); + } + printf("%u\n", randint32(&state)); + +PyTorch reference (producing identical results): + + import torch + torch.manual_seed(137) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + t = torch.zeros(8); + t.normal_() + for i in range(len(t)) : + print(t[i].item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + t = torch.zeros(16); + t.normal_() + for i in range(len(t)) : + print(t[i].item()) + print(torch.randint(0, 0xFFFFFFFF, [1]).item()) + +Both output: + + 4053805790 + 2173880614 + 380293709 + 1237255315 + 2986595568 + 0.7947664260864258 + 1.4369317293167114 + - 0.2292192131280899 + 0.47556325793266296 + - 0.6334410905838013 + - 0.5791953802108765 + - 0.0925704762339592 + - 0.8659197092056274 + 2186503452 + - 1.2813878059387207 + - 2.646395683288574 + - 0.06569503247737885 + 0.2180829495191574 + - 0.46536165475845337 + - 0.33108410239219666 + 2.5485482215881348 + 0.10425379872322083 + 0.8460659980773926 + 0.9462448358535767 + - 0.2913765013217926 + 0.34313806891441345 + - 1.1186704635620117 + - 0.18305328488349915 + - 2.3153159618377686 + 0.3961987793445587 + 2756748748 +*/ + +#ifndef RAND_H +#define RAND_H + +#include + +#define MERSENNE_STATE_M 397u +#define MERSENNE_STATE_N 624u + +#define LMASK 0x7ffffffful +#define UMASK 0x80000000ul + +// Copyright(c) Makoto Matsumoto and Takuji Nishimura + +// This implementation follows PyTorch so that we are numerically identical when running verification tests. + +typedef struct { + unsigned long long seed_; + int left_; + unsigned int next_; + unsigned int state_[MERSENNE_STATE_N]; + unsigned int MATRIX_A[2]; +} mt19937_state; + +void manual_seed(mt19937_state* state, unsigned int seed) { + state->MATRIX_A[0] = 0x0u; + state->MATRIX_A[1] = 0x9908b0df; + state->state_[0] = seed & 0xffffffff; + for (unsigned int j = 1; j < MERSENNE_STATE_N; j++) { + state->state_[j] = 1812433253 * (state->state_[j - 1] ^ (state->state_[j - 1] >> 30)) + j; + state->state_[j] &= 0xffffffff; + } + state->left_ = 1; + state->next_ = 0; +} + +void next_state(mt19937_state* state) { + state->left_ = MERSENNE_STATE_N; + state->next_ = 0; + unsigned int y, j; + for (j = 0; j < MERSENNE_STATE_N - MERSENNE_STATE_M; j++) { + y = (state->state_[j] & UMASK) | (state->state_[j + 1] & LMASK); + state->state_[j] = state->state_[j + MERSENNE_STATE_M] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1]; + } + for (; j < MERSENNE_STATE_N - 1; j++) { + y = (state->state_[j] & UMASK) | (state->state_[j + 1] & LMASK); + state->state_[j] = state->state_[j + (MERSENNE_STATE_M - MERSENNE_STATE_N)] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1]; + } + y = (state->state_[MERSENNE_STATE_N - 1] & UMASK) | (state->state_[0] & LMASK); + state->state_[MERSENNE_STATE_N - 1] = state->state_[MERSENNE_STATE_M - 1] ^ (y >> 1) ^ state->MATRIX_A[y & 0x1]; +} + +unsigned int randint32(mt19937_state* state) { + if (!state) return 0; + if (state->MATRIX_A[0] != 0 || state->MATRIX_A[1] != 0x9908b0df) manual_seed(state, 5489); // auto-initialize + if (--state->left_ <= 0) { + next_state(state); + } + unsigned int y = state->state_[state->next_++]; + y ^= y >> 11; + y ^= (y << 7) & 0x9d2c5680; + y ^= (y << 15) & 0xefc60000; + y ^= y >> 18; + return y; +} + +inline unsigned long long randint64(mt19937_state* state) { + return (((unsigned long long)(randint32(state)) << 32) | randint32(state)); +} + +inline float randfloat32(mt19937_state* state) { + return (randint32(state) & ((1ull << 24) - 1)) * (1.0f / (1ull << 24)); +} + +inline double randfloat64(mt19937_state* state) { + return (randint64(state) & ((1ull << 53) - 1)) * (1.0 / (1ull << 53)); +} + +void uniform_(float* data, unsigned int numel, float from, float to, mt19937_state* state) { + for (unsigned int t = 0; t < numel; t++) { + data[t] = randfloat32(state) * (to - from) + from; + } +} + +// Box-Muller transform: maps uniform random numbers to Gaussian distributed numbers +// https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform +void normal_fill_16(float* data, float mean, float std) { + #define EPSILONE 1e-12f + for (unsigned int t = 0; t < 8; t++) { + float u1 = 1 - data[t]; + float u2 = data[t + 8]; + float radius = sqrtf(-2 * logf(u1 + EPSILONE)); + float theta = (float) (2.0 * M_PI * u2); + data[t] = (radius * cosf(theta) * std + mean); + data[t + 8] = (radius * sinf(theta) * std + mean); + } +} + +void normal_fill(float* data, unsigned int numel, float mean, float std, mt19937_state* state) { + for (unsigned int t = 0; t < numel; t++) { + data[t] = randfloat32(state); + } + for (unsigned int i = 0; i < numel - 15; i += 16) { + normal_fill_16(data + i, mean, std); + } + if (numel % 16 != 0) { + // recompute the last 16 values + data = data + numel - 16; + for (unsigned int i = 0; i < 16; i++) { + data[i] = randfloat32(state); + } + normal_fill_16(data, mean, std); + } +} + +void normal_(float* data, unsigned int numel, float mean, float std, mt19937_state* state) { + #define EPSILONE 1e-12f + if (numel >= 16) { + normal_fill(data, numel, mean, std, state); + } + else { + double next_double_normal_sample = 0.0; // make compiler warning happy, won't be used + int has_next_double_normal_sample = 0; + for (unsigned int t = 0; t < numel; t++) { + if (has_next_double_normal_sample) { + data[t] = (float)(next_double_normal_sample * std + mean); + has_next_double_normal_sample = 0; + continue; + } + // for numel < 16 we draw a double (float64) + float u1 = (float) randfloat64(state); + float u2 = (float) randfloat64(state); + float radius = sqrtf(-2 * logf(1 - u2 + EPSILONE)); + float theta = (float) (2.0 * M_PI * u1); + next_double_normal_sample = radius * sinf(theta); + has_next_double_normal_sample = 1; + data[t] = (radius * cosf(theta) * std + mean); + } + } +} + +void init_identity_permutation(int *data, int numel) { + for (int i = 0; i < numel; i++) { + data[i] = i; + } +} + +void random_permutation(int* data, int numel, mt19937_state* state) { + for (int i = numel - 1; i > 0; i--) { + // pick an index j in [0, i] with equal probability + int j = randint32(state) % (i + 1); + // swap i <-> j + int tmp = data[i]; + data[i] = data[j]; + data[j] = tmp; + } +} + +#endif \ No newline at end of file From 54b727bcfaaac00e6476d5326adbd8c3b64df022 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Sun, 7 Jun 2026 17:59:57 +0530 Subject: [PATCH 41/45] README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d8a6ca1..6a0931c 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,13 @@ # Quadtrix.cpp -

+

image +


+ [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml) [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml) [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml) -

+ A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend. From 9b34e36e0b5f22d1bef168b18cad3185146f532f Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 7 Jun 2026 18:06:49 +0530 Subject: [PATCH 42/45] utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. --- CUDA/llmcpp/utils.h | 223 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 223 insertions(+) create mode 100644 CUDA/llmcpp/utils.h diff --git a/CUDA/llmcpp/utils.h b/CUDA/llmcpp/utils.h new file mode 100644 index 0000000..775534c --- /dev/null +++ b/CUDA/llmcpp/utils.h @@ -0,0 +1,223 @@ +/* + This file contains utilities shared between the different training scripts. + In particular, we define a series of macros xxxCheck that call the corresponding + C standard library function and check its return code. If an error was reported, + the program prints some debug information and exits. +*/ +#ifndef UTILS_H +#define UTILS_H + +#include +#include +#include +#include +#include +// implementation of dirent for Windows is in dev/unistd.h +#ifndef _WIN32 +#include +#include +#endif + +// ---------------------------------------------------------------------------- +// fread convenience utils, with nice handling of error checking using macros +// simple replace fopen, fread, fclose, fseek +// with fopenCheck, freadCheck, fcloseCheck, fseekCheck + +extern inline FILE *fopen_check(const char *path, const char *mode, const char *file, int line) { + FILE *fp = fopen(path, mode); + if (fp == NULL) { + fprintf(stderr, "Error: Failed to open file '%s' at %s:%d\n", path, file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + fprintf(stderr, " Path: %s\n", path); + fprintf(stderr, " Mode: %s\n", mode); + fprintf(stderr, "---> HINT 1: dataset files/code have moved to dev/data recently (May 20, 2024). You may have to mv them from the legacy data/ dir to dev/data/(dataset), or re-run the data preprocessing script. Refer back to the main README\n"); + fprintf(stderr, "---> HINT 2: possibly try to re-run `python train_gpt2.py`\n"); + exit(EXIT_FAILURE); + } + return fp; +} + +#define fopenCheck(path, mode) fopen_check(path, mode, __FILE__, __LINE__) + +extern inline void fread_check(void *ptr, size_t size, size_t nmemb, FILE *stream, const char *file, int line) { + size_t result = fread(ptr, size, nmemb, stream); + if (result != nmemb) { + if (feof(stream)) { + fprintf(stderr, "Error: Unexpected end of file at %s:%d\n", file, line); + } else if (ferror(stream)) { + fprintf(stderr, "Error: File read error at %s:%d\n", file, line); + } else { + fprintf(stderr, "Error: Partial read at %s:%d. Expected %zu elements, read %zu\n", + file, line, nmemb, result); + } + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + fprintf(stderr, " Expected elements: %zu\n", nmemb); + fprintf(stderr, " Read elements: %zu\n", result); + exit(EXIT_FAILURE); + } +} + +#define freadCheck(ptr, size, nmemb, stream) fread_check(ptr, size, nmemb, stream, __FILE__, __LINE__) + +extern inline void fclose_check(FILE *fp, const char *file, int line) { + if (fclose(fp) != 0) { + fprintf(stderr, "Error: Failed to close file at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + exit(EXIT_FAILURE); + } +} + +#define fcloseCheck(fp) fclose_check(fp, __FILE__, __LINE__) + +extern inline void sclose_check(int sockfd, const char *file, int line) { + if (close(sockfd) != 0) { + fprintf(stderr, "Error: Failed to close socket at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + exit(EXIT_FAILURE); + } +} + +#define scloseCheck(sockfd) sclose_check(sockfd, __FILE__, __LINE__) + +#ifdef _WIN32 +extern inline void closesocket_check(int sockfd, const char *file, int line) { + if (closesocket(sockfd) != 0) { + fprintf(stderr, "Error: Failed to close socket at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + exit(EXIT_FAILURE); + } +} + +#define closesocketCheck(sockfd) closesocket_check(sockfd, __FILE__, __LINE__) +#endif + +extern inline void fseek_check(FILE *fp, long off, int whence, const char *file, int line) { + if (fseek(fp, off, whence) != 0) { + fprintf(stderr, "Error: Failed to seek in file at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " Offset: %ld\n", off); + fprintf(stderr, " Whence: %d\n", whence); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + exit(EXIT_FAILURE); + } +} + +#define fseekCheck(fp, off, whence) fseek_check(fp, off, whence, __FILE__, __LINE__) + +extern inline void fwrite_check(void *ptr, size_t size, size_t nmemb, FILE *stream, const char *file, int line) { + size_t result = fwrite(ptr, size, nmemb, stream); + if (result != nmemb) { + if (feof(stream)) { + fprintf(stderr, "Error: Unexpected end of file at %s:%d\n", file, line); + } else if (ferror(stream)) { + fprintf(stderr, "Error: File write error at %s:%d\n", file, line); + } else { + fprintf(stderr, "Error: Partial write at %s:%d. Expected %zu elements, wrote %zu\n", + file, line, nmemb, result); + } + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + fprintf(stderr, " Expected elements: %zu\n", nmemb); + fprintf(stderr, " Written elements: %zu\n", result); + exit(EXIT_FAILURE); + } +} + +#define fwriteCheck(ptr, size, nmemb, stream) fwrite_check(ptr, size, nmemb, stream, __FILE__, __LINE__) + +// ---------------------------------------------------------------------------- +// malloc error-handling wrapper util + +extern inline void *malloc_check(size_t size, const char *file, int line) { + void *ptr = malloc(size); + if (ptr == NULL) { + fprintf(stderr, "Error: Memory allocation failed at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + fprintf(stderr, " Size: %zu bytes\n", size); + exit(EXIT_FAILURE); + } + return ptr; +} + +#define mallocCheck(size) malloc_check(size, __FILE__, __LINE__) + + +// ---------------------------------------------------------------------------- +// check that all tokens are within range +extern inline void token_check(const int* tokens, int token_count, int vocab_size, const char *file, int line) { + for(int i = 0; i < token_count; i++) { + if(!(0 <= tokens[i] && tokens[i] < vocab_size)) { + fprintf(stderr, "Error: Token out of vocabulary at %s:%d\n", file, line); + fprintf(stderr, "Error details:\n"); + fprintf(stderr, " File: %s\n", file); + fprintf(stderr, " Line: %d\n", line); + fprintf(stderr, " Token: %d\n", tokens[i]); + fprintf(stderr, " Position: %d\n", i); + fprintf(stderr, " Vocab: %d\n", vocab_size); + exit(EXIT_FAILURE); + } + } +} +#define tokenCheck(tokens, count, vocab) token_check(tokens, count, vocab, __FILE__, __LINE__) + +// ---------------------------------------------------------------------------- +// I/O ops + +extern inline void create_dir_if_not_exists(const char *dir) { + if (dir == NULL) { return; } + struct stat st = {0}; + if (stat(dir, &st) == -1) { + if (mkdir(dir, 0700) == -1) { + printf("ERROR: could not create directory: %s\n", dir); + exit(EXIT_FAILURE); + } + printf("created directory: %s\n", dir); + } +} + +extern inline int find_max_step(const char* output_log_dir) { + // find the DONE file in the log dir with highest step count + if (output_log_dir == NULL) { return -1; } + DIR* dir; + struct dirent* entry; + int max_step = -1; + dir = opendir(output_log_dir); + if (dir == NULL) { return -1; } + while ((entry = readdir(dir)) != NULL) { + if (strncmp(entry->d_name, "DONE_", 5) == 0) { + int step = atoi(entry->d_name + 5); + if (step > max_step) { + max_step = step; + } + } + } + closedir(dir); + return max_step; +} + +extern inline int ends_with_bin(const char* str) { + // checks if str ends with ".bin". could be generalized in the future. + if (str == NULL) { return 0; } + size_t len = strlen(str); + const char* suffix = ".bin"; + size_t suffix_len = strlen(suffix); + if (len < suffix_len) { return 0; } + int suffix_matches = strncmp(str + len - suffix_len, suffix, suffix_len) == 0; + return suffix_matches; +} + +#endif \ No newline at end of file From a89ab1c819c08cc2e78217b3dc0ae0a70e990591 Mon Sep 17 00:00:00 2001 From: Eamon Date: Sun, 7 Jun 2026 18:08:44 +0530 Subject: [PATCH 43/45] mfu: add GPU specifications database and utilities for MFU estimation --- CUDA/llmcpp/mfu.h | 244 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 244 insertions(+) create mode 100644 CUDA/llmcpp/mfu.h diff --git a/CUDA/llmcpp/mfu.h b/CUDA/llmcpp/mfu.h new file mode 100644 index 0000000..1c40b7b --- /dev/null +++ b/CUDA/llmcpp/mfu.h @@ -0,0 +1,244 @@ +#ifndef MFU_H +#define MFU_H + +#include +#include +#include +#if __has_include() +#define USE_NVML 1 +#include +#else +#define USE_NVML 0 +#endif + +// tied to enum PrecisionMode, in a future refactor make them the same +#define MFUH_PRECISION_FP32 0 +#define MFUH_PRECISION_FP16 1 +#define MFUH_PRECISION_BF16 2 + +#if USE_NVML +inline void nvml_check(nvmlReturn_t status, const char *file, int line) { + if (status != NVML_SUCCESS) { + printf("[NVML ERROR] at file %s:%d:\n%s\n", file, line, nvmlErrorString(status)); + exit(EXIT_FAILURE); + } +}; +#define nvmlCheck(err) (nvml_check(err, __FILE__, __LINE__)) +#endif + + +typedef struct { + float TF_32; // tensor-core performance 32 bit + float BF_16_32; // bf16 with 32 bit accumulate + float FP_16_32; // fp16 with 32 bit accumulate + float FP_16_16; // fp16 with 16 bit accumulate + float FP_8_32; // and so on + float FP_8_16; + float CLOCK; // clock frequency from the spec sheet + float CORES; // #TCs from the spec sheet +} PerfData; + +// basic default data from the nvidia whitepapers +static const PerfData VOLTA = {125.0f, -1.f, 125.f, -1.f, -1.f, -1.f, 1530.f, 640.f}; +static const PerfData AMPERE_DATACENTER = {156.f, 312.f, 312.f, 312.f, -1.f, -1.f, 1410.f, 432.f}; +static const PerfData AMPERE_CONSUMER = {40.f, 80.f, 80.f, 160.f, -1.f, -1.f, 1860.f, 336.f}; +static const PerfData HOPPER = {378.f, 756.f, 756.f, 756.f, 1513.f, 1513.f, 1620.f, 456.f}; +static const PerfData ADA = {82.6f, 165.2f, 165.2f, 330.3f, 330.3f, 660.6f, 2520.f, 512.f}; + +typedef struct { + const char* name; + const PerfData* perf_data; + float new_cores; + float new_mhz; +} GPUEntry; + +// the overrides for each specific GPU +static GPUEntry gpu_db[] = { + {"Tesla V100-SXM2-16GB", &VOLTA, 640, 1530}, + {"Tesla V100-PCIE-32GB", &VOLTA, 640, 1530}, + {"NVIDIA A100-PCIE-40GB", &ERE_DATACENTER, 432, 1410}, + {"NVIDIA A100-PCIE-80GB", &ERE_DATACENTER, 432, 1410}, + {"NVIDIA A100-SXM4-40GB", &ERE_DATACENTER, 432, 1410}, + {"NVIDIA A100-SXM4-80GB", &ERE_DATACENTER, 432, 1410}, + {"NVIDIA RTX A2000", &ERE_CONSUMER, 104, 1200}, + {"NVIDIA RTX A4000", &ERE_CONSUMER, 192, 1560}, + {"NVIDIA RTX A4500", &ERE_CONSUMER, 224, 1650}, + {"NVIDIA RTX A5000", &ERE_CONSUMER, 256, 1695}, + {"NVIDIA RTX A5500", &ERE_CONSUMER, 320, 1770}, + {"NVIDIA RTX A6000", &ERE_CONSUMER, 336, 1800}, + {"NVIDIA GeForce RTX 3090 Ti", &ERE_CONSUMER, 336, 1860}, + {"NVIDIA GeForce RTX 3090", &ERE_CONSUMER, 328, 1695}, + {"NVIDIA GeForce RTX 3080 Ti", &ERE_CONSUMER, 320, 1665}, + {"NVIDIA GeForce RTX 3080", &ERE_CONSUMER, 272, 1710}, + {"NVIDIA GeForce RTX 3070 Ti", &ERE_CONSUMER, 192, 1770}, + {"NVIDIA GeForce RTX 3070", &ERE_CONSUMER, 184, 1725}, + {"NVIDIA GeForce RTX 3060 Ti", &ERE_CONSUMER, 152, 1665}, + {"NVIDIA GeForce RTX 3060", &ERE_CONSUMER, 112, 1777}, + {"NVIDIA RTX A2000 ADA", &ADA, 88, 2130}, + {"NVIDIA RTX A4000 ADA", &ADA, 192, 2175}, + {"NVIDIA RTX A4500 ADA", &ADA, 224, 2580}, + {"NVIDIA RTX A5000 ADA", &ADA, 400, 2550}, + {"NVIDIA RTX A5880 ADA", &ADA, 440, 2460}, + {"NVIDIA RTX A6000 ADA", &ADA, 568, 2505}, + {"NVIDIA GeForce RTX 4090", &ADA, 512, 2520}, + {"NVIDIA GeForce RTX 4080 SUPER", &ADA, 320, 2550}, + {"NVIDIA GeForce RTX 4080", &ADA, 304, 2505}, + {"NVIDIA GeForce RTX 4070 Ti SUPER", &ADA, 264, 2610}, + {"NVIDIA GeForce RTX 4070 Ti", &ADA, 240, 2610}, + {"NVIDIA GeForce RTX 4070 SUPER", &ADA, 224, 2475}, + {"NVIDIA GeForce RTX 4070", &ADA, 184, 2475}, + {"NVIDIA GeForce RTX 4070", &ADA, 184, 2475}, + {"NVIDIA GeForce RTX 4060 Ti", &ADA, 136, 2535}, + {"NVIDIA GeForce RTX 4060", &ADA, 96, 2460}, + {"NVIDIA H100 PCIe", &HOPPER, 456, 1620}, + {"NVIDIA H100 80GB HBM3", &HOPPER, 528, 1830}, // HBM3 = SXM5 +}; + +float get_flops_promised(const char* device, int precision_mode) { + /* + This function is used to estimate the Model Flops Utilization (MFU) + basically we have to figure out how many flops the GPU can do per second. + Note that this is not a simple endeavor and may well go wrong! The details are tricky. + The returned value is in units of 1e12. + + For the non-top models, actual performance numbers aren't that easy to find, e.g., + here https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756, does "Theoretical Performance" + seems to be without tensor cores. + + So, instead we use that all these cards just use the same types of tensor cores in different + numbers and at different frequencies. Then we just need to look up these two easily accesible + numbers for all the other GPUs. + linear scaling seems to work: comparing spec sheet and calculation: + 4080: 304TCs, 2505 GHz; 97.5TFlops = 165.2/512*304 /2520 * 2505 + + Original numbers for the top GPUS are from. + https://resources.nvidia.com/en-us-tensor-core + https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf + */ + + // validate the precision mode as one of the three possible values + if (!(precision_mode == MFUH_PRECISION_FP32 || precision_mode == MFUH_PRECISION_FP16 || precision_mode == MFUH_PRECISION_BF16)) { + fprintf(stderr, "Invalid precision mode: %d\n", precision_mode); + return -1.0f; + } + + // do a linear search until you find our GPU, then calculate the flops promised + int num_gpu_entries = sizeof(gpu_db) / sizeof(gpu_db[0]); + for (int i = 0; i < num_gpu_entries; i++) { + if (strcmp(gpu_db[i].name, device) == 0) { + const PerfData* perf_data = gpu_db[i].perf_data; + + // look up the default flops value for the given precision mode + float value = -1.0f; + if (precision_mode == MFUH_PRECISION_BF16) { value = perf_data->BF_16_32; } + if (precision_mode == MFUH_PRECISION_FP32) { value = perf_data->TF_32; } + if (precision_mode == MFUH_PRECISION_FP16) { value = perf_data->FP_16_32; } + + // we'd get here if we're e.g. trying to use BF16 on Volta GPU or something... + if (value < 0.0f) { + fprintf(stderr, "No data for GPU %s and precision mode %d\n", device, precision_mode); + return -1.0f; + } + + // adjust flops based on the specific core count and clock frequency of this GPU + float new_cores = gpu_db[i].new_cores; + float new_mhz = gpu_db[i].new_mhz; + float adjusted = value * (new_cores / perf_data->CORES) * (new_mhz / perf_data->CLOCK); + return adjusted; + } + } + + return -1.0f; // ¯\_(ツ)_/¯ +} + +struct GPUUtilInfo { + unsigned int clock; + unsigned int max_clock; + unsigned int power; + unsigned int power_limit; + unsigned int fan; + unsigned int temperature; + unsigned int temp_slowdown; + + float gpu_utilization; + float mem_utilization; + const char* throttle_reason; +}; + +// lazily initialize nvml and generate a handle to the GPU +#if USE_NVML +nvmlDevice_t nvml_get_device() { + static bool needs_init = true; + static nvmlDevice_t device; + if(needs_init) { + needs_init = false; + nvmlCheck(nvmlInit()); + nvmlCheck(nvmlDeviceGetHandleByIndex_v2(0, &device)); + } + return device; +} + +// convert throttle reason bitfield into a text reason. +// this is a lossy conversion; we just want to give some idea of what is happening +const char* get_throttle_reason(unsigned long long bits) { + if(bits & (nvmlClocksThrottleReasonSwPowerCap | nvmlClocksThrottleReasonHwPowerBrakeSlowdown)) { + return "power cap"; + } else if (bits & (nvmlClocksThrottleReasonSwThermalSlowdown | nvmlClocksThrottleReasonHwThermalSlowdown)) { + return "thermal cap"; + } else if (bits & (nvmlClocksThrottleReasonAll)) { + return "other cap"; + } else { + return "no cap"; + } +} + +// gather data for a GPUUtilInfo object +GPUUtilInfo get_gpu_utilization_info() { + GPUUtilInfo info; + nvmlDevice_t device = nvml_get_device(); + // query different infos directly + nvmlCheck(nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &info.clock)); + nvmlCheck(nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_SM, &info.max_clock)); + nvmlCheck(nvmlDeviceGetPowerManagementLimit(device, &info.power_limit)); + nvmlCheck(nvmlDeviceGetPowerUsage(device, &info.power)); + nvmlCheck(nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &info.temperature)); + nvmlCheck(nvmlDeviceGetTemperatureThreshold(device, NVML_TEMPERATURE_THRESHOLD_SLOWDOWN, &info.temp_slowdown)); + unsigned long long throttle; + nvmlCheck(nvmlDeviceGetCurrentClocksThrottleReasons(device, &throttle)); + info.throttle_reason = get_throttle_reason(throttle); + nvmlCheck(nvmlDeviceGetFanSpeed(device, &info.fan)); + + // for "utilization", we look at recorded samples. In principle, we could query the driver for how many samples + // to request, but then we'd need to dynamically allocate sufficient space. Let's just hard-code a limit of 128, + // and have no memory management required + constexpr const int BUFFER_LIMIT = 128; + nvmlSample_t buffer[BUFFER_LIMIT]; + nvmlValueType_t v_type; + unsigned int sample_count = BUFFER_LIMIT; + nvmlCheck(nvmlDeviceGetSamples(device, NVML_GPU_UTILIZATION_SAMPLES, 0, &v_type, &sample_count, buffer)); + float gpu_utilization = 0.f; + for(unsigned i = 0; i < sample_count; ++i) { + gpu_utilization += (float)buffer[i].sampleValue.uiVal; + } + gpu_utilization /= (float)sample_count; + + // sample count may have been modified by the query above; reset back to buffer size + sample_count = BUFFER_LIMIT; + nvmlCheck(nvmlDeviceGetSamples(device, NVML_MEMORY_UTILIZATION_SAMPLES, 0, &v_type, &sample_count, buffer)); + float mem_utilization = 0.f; + for(unsigned i = 0; i < sample_count; ++i) { + mem_utilization += (float)buffer[i].sampleValue.uiVal; + } + mem_utilization /= (float)sample_count; + + info.gpu_utilization = gpu_utilization; + info.mem_utilization = mem_utilization; + return info; +} +#else +GPUUtilInfo get_gpu_utilization_info() { + fprintf(stderr, "Error: Compiled without nvml support. Cannot perform additional GPU state tracking."); + exit(EXIT_FAILURE); +} +#endif +#endif // MFU_H From fd41e1b0e916eec812068f9a422ac07cf70dd6b7 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Mon, 8 Jun 2026 01:06:55 +0530 Subject: [PATCH 44/45] Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6a0931c..b544f73 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Quadtrix.cpp +# Quadtrix.cpp (llm.cpp)

image From d5cadb603f562f7db4c9a0fe5560a7884a256d63 Mon Sep 17 00:00:00 2001 From: Eamon Sippy Date: Mon, 8 Jun 2026 13:08:20 +0530 Subject: [PATCH 45/45] Update README to remove image and clean up content Removed image from README and adjusted formatting. --- README.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/README.md b/README.md index 888604e..b441cf4 100644 --- a/README.md +++ b/README.md @@ -8,11 +8,6 @@ [![Release](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/release.yml) [![Package](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/docker-publish.yml) [![CI](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml/badge.svg)](https://github.com/Eamon2009/Quadtrix.cpp/actions/workflows/ci.yml) - -

- image -

- A local large language model with a modular, multi-path execution architecture. Train, run inference, and serve a chat interface — all from a single repository, across bare-metal C++, PyTorch, and a React frontend. > Full technical reference: [docs](https://eamon2009.github.io/LLMs/)