Skip to content

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501

Open
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu
Open

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu

Conversation

@y-coffee-dev
Copy link

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.

Architecture

Topology Detection (nvidia-topo.sh)

Parses the nvidia-smi topo -m matrix to extract GPU-to-GPU link types and assigns numerical ranks:

GPU Assignment Algorithm (assign_gpus.py)

Four-phase pipeline:

  1. Topology Analysis — Parse GPUs and links, build rank matrix
  2. Subset Enumeration — Generate all GPU subsets, sorted by min link rank (desc), size (asc), VRAM (desc). Find the best subset that fits the model; if none fits, greedily span across GPUs
  3. Service Assignment — Allocate remaining GPUs to whisper/comfyui/embeddings based on availability:
    • 0 remaining: colocate all services on llama's last GPU
    • 1 remaining: all auxiliary services share that GPU
    • 2 remaining: whisper gets one, comfyui+embeddings share the other
    • 3+ remaining: dedicated GPUs; extras go back to llama
  4. Parallelism Selection — Based on GPU count and min link rank:
    • NVLink/XGMI (rank >= 80): tensor parallel (<=3 GPUs) or hybrid (>3 GPUs)
    • Same-NUMA PCIe (rank 11-79): pipeline (<=3 GPUs) or hybrid if rank >= 40
    • Cross-NUMA (rank <= 10): pipeline only
    • Heterogeneous VRAM: proportional tensor split weights

Compose Layering

When GPU_COUNT > 1, the stack adds:

  • docker-compose.multigpu.yml — llama-server GPU pinning + split mode
  • extensions/services/*/compose.multigpu.yaml — per-service GPU pinning

Interactive TUI

Multi-GPU systems get a configuration prompt:

  • [1] Automatic — runs assign_gpus.py with detected topology
  • [2] Custom — manual GPU-to-service assignment

Non-interactive installs default to automatic assignment.

Test coverage

Automated tests

  • tests/test-nvidia-topo.sh — Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrix
  • tests/test-assign-gpus.py — Comprehensive pytest suite covering:
    • Single GPU: strategy, service sharing, parallelism mode, model-too-large error
    • 2-GPU PHB: colocated strategy, pipeline parallelism
    • 4-GPU SOC (cross-NUMA): pipeline mode, dedicated strategy
    • 4-GPU SYS + NV pairs: mixed topology handling
    • 5-GPU NV12 + MLX5: NVLink with NIC filtering
    • 8-GPU NV12 full mesh: tensor/hybrid parallelism selection
    • 8-GPU NV1/NV2 partial mesh: degraded NVLink handling
    • VRAM overflow / span subset scenarios
    • Heterogeneous GPU tensor split proportions

Manual hardware testing

Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):

  • 2x NVIDIA RTX 3060
  • 4x NVIDIA RTX 4080
  • 4x NVIDIA RTX 5060 Ti

All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.

What changed

New files

File Purpose
installers/lib/nvidia-topo.sh NVIDIA topology detection library — parses nvidia-smi topo -m matrix into structured JSON with link types, ranks, and labels
scripts/assign_gpus.py GPU assignment algorithm — 4-phase pipeline: topology analysis, subset enumeration, service assignment, parallelism selection
docker-compose.multigpu.yml Compose overlay for llama-server with NVIDIA_VISIBLE_DEVICES, LLAMA_ARG_SPLIT_MODE, and LLAMA_ARG_TENSOR_SPLIT
extensions/services/comfyui/compose.multigpu.yaml Per-service GPU pinning overlay for ComfyUI
extensions/services/whisper/compose.multigpu.yaml Per-service GPU pinning overlay for Whisper
extensions/services/embeddings/compose.multigpu.yaml Per-service GPU pinning overlay for Embeddings
tests/test-nvidia-topo.sh Shell tests for topology parsing against fixture matrices
tests/test-assign-gpus.py Python tests covering single GPU, 2-GPU colocated, 4-GPU NVLink/SYS, 5-GPU NVLink, 8-GPU full mesh/partial mesh topologies
tests/fixtures/topology_json/*.json (8 files) JSON topology fixtures: 1-GPU PCIe, 2-GPU PHB, 4-GPU SOC, 4-GPU SYS+NV pairs, 5-GPU NV12+MLX5, 8-GPU NV12 full mesh, 8-GPU NV12+NUMA, 8-GPU NV1/NV2 partial mesh
tests/fixtures/topology_matrix/*.txt (7 files) Raw nvidia-smi topo -m output fixtures for shell-level testing

Modified files

File Change
installers/phases/01-preflight.sh Adds jq and python3 to preflight dependency checks (required by topology detection and assignment)
installers/phases/02-detection.sh Integrates detect_nvidia_topo() — populates GPU_TOPOLOGY_JSON, GPU_HAS_NVLINK, GPU_TOTAL_VRAM, LLM_MODEL_SIZE_MB
installers/phases/03-features.sh Major expansion — multi-GPU configuration TUI with automatic and custom assignment modes, parallelism selection, env var extraction
installers/phases/04-requirements.sh Adds multi-GPU compose overlay to requirements
installers/phases/06-directories.sh Persists GPU_ASSIGNMENT_JSON and per-service GPU UUIDs to .env
installers/lib/constants.sh Adds multi-GPU related constants
installers/lib/tier-map.sh Adds multi-GPU tier mappings
installers/lib/compose-select.sh Includes docker-compose.multigpu.yml when GPU_COUNT > 1
scripts/resolve-compose-stack.sh Accepts --gpu-count flag; discovers and merges compose.multigpu.yaml from extensions
scripts/detect-hardware.sh Sources nvidia-topo.sh for topology detection
scripts/build-capability-profile.sh Reads actual gpu.count from capability profile instead of hardcoding 1
.env.schema.json Adds new env vars: GPU_ASSIGNMENT_JSON_B64, LLAMA_SERVER_GPU_UUIDS, LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT, EMBEDDINGS_GPU_UUID, COMFYUI_GPU_UUID, WHISPER_GPU_UUID, N_GPU_LAYERS

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Needs Work

Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:

1. jq promoted from optional to required (breaking)

01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:

  • Auto-install jq (like Docker is auto-installed in phase 05), or
  • Keep it optional with graceful degradation when absent

2. No CI checks have run

This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.

3. Docker Compose GPU reservation conflict

docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.

4. Minor: duplicate comment line

constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.

What's good

  • The topology detection with nvidia-smi topo -m fallback is well-handled
  • assign_gpus.py algorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts
  • Single-GPU path is preserved (gated on GPU_COUNT > 1)
  • Graceful degradation when nvidia-smi is absent

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Enhanced multi-GPU tier assignment based on topology
- Implemented robust GPU topology detection for NVIDIA
- Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases
- Implemented gathering detailed per-GPU information
- Data structures for GPU information storage
- Robust and comprehensive test suite for NVIDIA topology detection
- Multi-GPU strategy selection algorithm
- Careful handling of edge cases and subtle bugs in strategy selection
- Robust test suite for multi-GPU strategy selection algorithm

GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups

Adjust env schema validation

Fixed inconsistencies in gpu count, json escaping issues, etc

fix issue with writing multigpu overlay

fix resolve-compose-stack.sh multi gpu overlay

fix gpu device id

Refactors + less convoluted docker compose setup

N_GPU_LAYERS validation

fix multi-gpu overlay
@y-coffee-dev
Copy link
Author

@Lightheartdevs Thanks for the thorough review! I adjusted the PR.

1. jq auto-install - Good catch. I've added auto-install logic for jq.
2. CI - Pushed an adjustments commit, this should trigger the CI pipeline
3. Docker Compose GPU reservation - In fact this is not a conflict, the current setup is intentional and correct, as device_ids in the deploy.resources.reservations.devices block can't be set dynamically, because Docker Compose variable interpolation only produces scalar strings, and since device_ids expects a YAML sequence, there's no way to inject a list like ['0', '2'] from an environment variable.
The two mechanisms are layering, ie. the deploy reservation makes all GPUs available at the Docker level, for the NVIDIA container runtime to use NVIDIA_VISIBLE_DEVICES to scope which GPUs are actually visible inside the container at the runtime level.
This is a common approach when you need dynamic per-container GPU assignment in Compose.

4. INSTALL_START_EPOCH duplication - Fixed!

I appreciate the detailed feedback!

@Lightheartdevs
Copy link
Collaborator

Review Update — Rebase Required Before Merge

Hey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first.

🚨 Silent merge bug: LLM_MODEL_SIZE_MB will be dropped

Since you branched, we merged #572/#573/#574 which rewrote the model names and URLs in tier-map.sh (Qwen 3 → Qwen 3.5). Your branch adds LLM_MODEL_SIZE_MB to each tier in that same file.

Git reports a clean merge — no conflicts — but the result silently drops all 11 of your LLM_MODEL_SIZE_MB additions. This happens because git sees main's rewrites and your additions as non-overlapping changes within each tier block, and resolves by taking main's version (which has no LLM_MODEL_SIZE_MB).

What breaks: assign_gpus.py gets called with --model-size ""float("") → ValueError → multi-GPU assignment fails on every install. Single-GPU installs are fine (early return guard), but the entire multi-GPU feature would be DOA.

What's needed

  1. Rebase onto current main (commit 5a932e9)
  2. Re-add LLM_MODEL_SIZE_MB to each tier. The new Qwen 3.5 model sizes (update as needed):
CLOUD:      LLM_MODEL_SIZE_MB=0
ARC:        LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
ARC_LITE:   LLM_MODEL_SIZE_MB=2870    # Qwen3.5-4B-Q4_K_M
NV_ULTRA:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_LARGE:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_COMPACT: LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)
Tier 0:     LLM_MODEL_SIZE_MB=1500    # Qwen3.5-2B-Q4_K_M
Tier 1:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 2:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 3:     LLM_MODEL_SIZE_MB=16400   # Qwen3.5-27B-Q4_K_M
Tier 4:     LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)

⚠️ Double-check these against the actual GGUF file sizes on HuggingFace — the Qwen 3.5 models are new and some sizes may differ from the Qwen 3 equivalents you had before.

  1. Push — this should also trigger CI, which hasn't run yet on this branch.

Everything else looks good

We did a full merge simulation and traced every touched installer file. The single-GPU path is completely safe — your guards in 02-detection.sh (GPU_COUNT -gt 1) and 03-features.sh (GPU_COUNT -le 1 → return) are clean. The compose layering, hardware detection additions, and .env generation all use safe defaults. No behavioral changes for existing single-GPU installs on any backend.

Two minor suggestions for a follow-up (non-blocking):

  • Add trap "rm -f $TOPOLOGY_FILE" EXIT after the mktemp in 03-features.sh to clean up on early exit
  • Add a # NOTE: keep in sync with assign_gpus.py comment in the custom TUI parallelism logic in 03-features.sh, since it duplicates the threshold logic from the Python script

Looking forward to the rebase — this is a great feature and we want to ship it. 🚀

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants