Optimize CI workflow and Docker configurations with refactoring by Eamon2009 · Pull Request #72 · Eamon2009/Quadtrix.cpp

Eamon2009 · 2026-06-08T07:27:46Z

No description provided.

…ions

## Summary Structural Changes and Technical Rationale ### 1. Fixed Broken Multi-Stage Conditional Branching * **The Issue:** The original Dockerfile attempted to resolve the final stage dynamically using `FROM deps-${CUDA:-1}` or `FROM deps-${CUDA}`. Standard Docker builders evaluate `FROM` lines at the very beginning of parsing, before stage targets are explicitly created or named dynamically via evaluated shell variables. Furthermore, the conditional default fallback syntax (`:-1`) does not resolve cleanly within base target selectors. * **The Solution:** Standardized stage resolution by implementing explicitly mapped, strict literal intermediate target stages (`deps-cpu` and `deps-cuda`). A unified `final` target is then cleanly resolved using an un-nested `ARG TARGET_ENV` selector. ### 2. Multi-Stage Build Layer Separations * **The Issue:** The previous implementation left heavy development headers, tooling binaries (`build-essential`, `git`), and cached installer metadata embedded directly into the final execution layers. * **The Solution:** Extracted all shared, system-level compilation utilities into isolated intermediate compiler stages. The final target images inherit exclusively from minimal runtime bases, stripping unnecessary build-essential tooling away from the final deliverable. ### 3. PyTorch Dependency Layer Pinning and Caching Controls * **The Issue:** Bundling `requirements.txt` alongside massive framework downloads (`torch`, `torchvision`) within a single dense execution command limits layer caching capability. Modifying a minor backend requirement would discard the entire layer, forcing a complete download of PyTorch's multi-gigabyte files every build cycle. * **The Solution:** Isolated the massive PyTorch setup operations into distinct cache-pinned execution sequences. This separates transient python packages from heavy ML frameworks, minimizing build time and pipeline failures. ### 4. Consolidated Python Binary and Symlinking Alignment * **The Issue:** Operating system discrepancies across `python:3.11-slim` (which uses a default global `python` and `pip` command) and `ubuntu22.04` (which requires explicit `python3.11` version tags and manual `python3-pip` mappings) caused path collisions, missing dependencies, and system-package management errors. * **The Solution:** Uniformly aligned execution runtimes. The CUDA target environment now establishes automated symlinks mapping local references (`python` -> `python3.11` and `pip` -> `pip3`) ensuring standard execution profiles function uniformly across environments. ### 5. Automated Build Layer Cleanup * **The Issue:** Minor storage leaks accumulated via system package manager footprints (`/var/lib/apt/lists/*`) and internal user pip caching structures (`~/.cache/pip`). * **The Solution:** Implemented zero-cache directives (`--no-cache-dir`) on all pip installation pipelines and appended file cleanup hooks natively onto system installation scripts.

…strategy to reduce duplication

…keyframes

…implementation

…nent

Added badges for release, package, and CI workflows.

…duces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.

Updated README to include a header and badges for release, package, and CI workflows.

…fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

Changed the project title to include 'llm.cpp' for clarity.

Eamon2009 · 2026-06-08T07:31:46Z

/run-checks

github-actions · 2026-06-08T07:32:50Z

✅ All checks passed!

Removed image from README and adjusted formatting.

* exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Eamon Sippy <eamon112009@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. * CUDA header declarations for (LayerNorm) forward and backward (#66) * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces --------- Co-authored-by: Max <eamon5174@gmail.com> * Add CUDA attention kernels, gradient norms, and CI improvements (#69) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- Co-authored-by: Max <eamon5174@gmail.com> * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- Co-authored-by: Max <eamon5174@gmail.com> * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-Authored-By: Eamon Sippy <eamon112009@gmail.com> * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Max <eamon5174@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Eamon <eamon112009@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…#76) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp * docs:report [run_20260530_165216](~791 tok/s) (#60) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs: report [run_20260530_165216] (~791 tok/s) (#62) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 --------- * chore: clang-format configuration file based on LLVM (#63) * ci: add manual PR checks workflow with slash command support * ci: add manual PR checks workflow with slash command support * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. * Refactor core architecture and optimize CUDA features (#75) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- * Add CUDA kernels, optimize CI, and update documentation (#74) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. * CUDA header declarations for (LayerNorm) forward and backward (#66) * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces --------- * Add CUDA attention kernels, gradient norms, and CI improvements (#69) * exp(#58) * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- * ci: add manual PR checks workflow with slash command support * feat(cuda): add attention forward backward kernel declarations (#64) * docs: report [run_20260530_165216] (~791 tok/s) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * docs:report [run_20260530_165216](~791 tok/s) (#61) Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms. Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900 * feat(cuda): add attention forward and backward kernel declarations Introduces the header declarations for `attention_forward` and `attention_backward` operations inside the `quadtrix::cuda` namespace. Configured with support for custom CUDA streams and head partitioning. --------- * feat(cuda): add checkpoint metadata struct and stub functions * feat(cuda): introduce core type definitions and error handling utilities - Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8). - Implements `dtype_name` and `dtype_size` metadata helper functions. - Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation. - Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro. * feat(cuda): add TokenBatchView struct and DataLoader stub class * feat(cuda): add GeLU activation forward and backward declarations - Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants. - Declares the `gelu_forward` and `gelu_backward` kernel entrypoints. - Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`. * feat(cuda): add gradient norm calculation and clipping interfaces * feat(cuda): add LayerNorm forward and backward kernel declarations * refactor(ci): organize workflow into push-triggered QA and manual docker builds Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options. * Fix formatting and update CI workflow steps * Enhance CI with macOS binary build and release Added macOS binary build and release steps to CI workflow. * feat(docker): add Dockerfile for frontend application * feat(docker): add Dockerfile for frontend application * refactor(ci): remove release job from GitHub actions * ci: add unified release and docker build workflow * ci: add unified release and docker build workflow * Refactor macOS build workflow for arm64 architecture * Update release workflow to remove macOS x64 build Removed dependency on build-macos-x64 for the release job. * perf: update execution time benchmarks in csv * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * ci(docker): refactor image build workflow and add frontend job * Remove frontend job from Docker Images workflow * Update release workflow to remove s390x and add notes Removed s390x build configurations and added a step to write detailed release notes. * feat: add local orchestration script for frontend and backend servers Introduces a central Python execution script to concurrently manage and orchestrate the development environment for both the frontend and backend. - Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants. - Verifies existence of the local PyTorch `.pt` model checkpoint before starting. - Configures environment variables dynamically for Uvicorn (FastAPI) and Vite. - Handles cross-origin setups (CORS) linking ports interactively. - Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals. - Automatically launches the frontend application in the system web browser. * chore(deps): bump actions/github-script from 7 to 9 (#71) Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v9) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '9' dependency-type: direct:production update-type: version-update:semver-major ... * feat(cuda): introduce log_message utility and LogLevel enum * feat(cuda): add cuBLAS handle wrapper and matmul operations * feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions * refactor: untie embedding and lm_head weights and to quadtrix * feat(cuda): add NCCL communicator wrapper and all-reduce primitives * Update README.md with workflow badges Added badges for release, package, and CI workflows. * kernels: add AdamW optimization kernel with stochastic rounding Introduces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling. * cudnn: implement cached SDPA forward graph using cuDNN frontend * feat(cuda): implement Packed128 memory vectorization utilities * feat: add distributed sharded DataLoader for binary token files * feat(multi-gpu): add foundational utilities for ZeRO sharding * feat(utils): add I/O and memory error-checking wrappers * feat : add PyTorch-compatible Mersenne Twister random utilities * README : Enhance README with header and workflow badges Updated README to include a header and badges for release, package, and CI workflows. * utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows. * mfu: add GPU specifications database and utilities for MFU estimation * Modify project title in README.md Changed the project title to include 'llm.cpp' for clarity. * Update README to remove image and clean up content Removed image from README and adjusted formatting. --------- --------- --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Max <eamon5174@gmail.com> Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Eamon2009 and others added 30 commits May 29, 2026 11:35

feat(ci): optimize workflow pipeline and update docker configurations

de5c112

feat(ci): optimize workflow pipeline and update docker configurations

e91fff5

feat(ci): optimize workflow pipeline and update docker configurations

576c2b8

feat(ci): optimize workflow pipeline and update docker configurations

f1e4bb8

feat(ci): optimize workflow pipeline and update docker configurations

43d51c8

feat(ci): optimize workflow pipeline and update docker configurations

d96de27

feat(ci): optimize workflow pipeline and update docker configurations

c7d669b

feat(ci): optimize workflow pipeline and update docker configurations

a652ec8

refactor(ci): optimize workflow pipeline and update docker configurat…

ebd8e20

…ions

refactor : optimize workflow pipeline and update docker configurations

c2d78c8

refactor : optimize workflow pipeline and update docker configurations

ff173b4

refactor : optimize workflow pipeline and update docker configurations

8db6cd1

Added MIT LICENSE to this project Quadtrix.cpp

07288d9

Refactor Dockerfile to use ARG for CUDA version

d01af15

Refactor Dockerfile for backend dependencies

ed37774

refactor : Dockerfile.backend optimize workflow pipeline

068cdb7

refactor : Dockerfile.backend optimize workflow pipeline

07826e1

refactor : Dockerfile.backend optimize workflow pipeline

7f4d25a

refactor : Dockerfile.backend optimize workflow pipeline

dcd14e1

Delete .devops/Dockerfile.frontend

2139c1d

Delete .devops/Dockerfile.dev.frontend

74b46ec

refactor : Dockerfile.backend optimize workflow pipeline

0770909

refactor : Dockerfile.backend optimize workflow pipeline

f0ae40b

refactored (CI): consolidated manual Docker build jobs into a matrix …

9f909f4

…strategy to reduce duplication

refactored (CI): consolidated manual Docker build jobs into a matrix …

0ff70b2

…strategy to reduce duplication

refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS …

31e960d

…keyframes

refactor : message bubble layout to use inline styles

31ef90d

refactor(ui): complete inline-style migration and update auto-scroll …

9b19a92

…implementation

refactor(ui): complete inline-style migration for MessageAvatar compo…

250b2fc

…nent

Eamon2009 added 15 commits June 4, 2026 11:04

feat(cuda): add NCCL communicator wrapper and all-reduce primitives

7c461b8

Update README.md with workflow badges

c5d06b6

Added badges for release, package, and CI workflows.

kernels: add AdamW optimization kernel with stochastic rounding Intro…

e040025

…duces the AdamW fused CUDA kernel including linear interpolation optimizations (`lerp`), multi-slice batching support via 2D grids, and `init_from_master` utility functions for low-precision parameter handling.

cudnn: implement cached SDPA forward graph using cuDNN frontend

c3dc5ae

feat(cuda): implement Packed128 memory vectorization utilities

49099ae

feat: add distributed sharded DataLoader for binary token files

b41b989

feat(multi-gpu): add foundational utilities for ZeRO sharding

58ab604

feat(utils): add I/O and memory error-checking wrappers

b91f867

feat : add PyTorch-compatible Mersenne Twister random utilities

8110186

README : Enhance README with header and workflow badges

54b727b

Updated README to include a header and badges for release, package, and CI workflows.

utils:fopenCheck, freadCheck, fwriteCheck, fcloseCheck, and `…

9b34e36

…fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

Merge branch 'master' of https://github.com/Eamon2009/Quadtrix.cpp

d45dcb9

mfu: add GPU specifications database and utilities for MFU estimation

a89ab1c

Modify project title in README.md

fd41e1b

Changed the project title to include 'llm.cpp' for clarity.

Merge branch 'exp' into master

a20fc03

Eamon2009 self-assigned this Jun 8, 2026

Eamon2009 requested a review from codeaddict-119 June 8, 2026 07:31

Eamon2009 temporarily deployed to github-pages June 8, 2026 07:31 — with GitHub Pages Inactive

Merge branch 'Eamon2009-patch-1' into master

da9168e

Eamon2009 temporarily deployed to github-pages June 8, 2026 07:34 — with GitHub Pages Inactive

Update README to remove image and clean up content

d5cadb6

Removed image from README and adjusted formatting.

Eamon2009 temporarily deployed to github-pages June 8, 2026 07:38 — with GitHub Pages Inactive

Eamon2009 temporarily deployed to github-pages June 8, 2026 07:48 — with GitHub Pages Inactive

Eamon2009 temporarily deployed to github-pages June 8, 2026 08:06 — with GitHub Pages Inactive

Eamon2009 merged commit e2946cb into exp Jun 8, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI workflow and Docker configurations with refactoring#72

Optimize CI workflow and Docker configurations with refactoring#72
Eamon2009 merged 90 commits into
expfrom
master

Eamon2009 commented Jun 8, 2026

Uh oh!

Eamon2009 commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Eamon2009 commented Jun 8, 2026

Uh oh!

Eamon2009 commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants