feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism by y-coffee-dev · Pull Request #501 · Light-Heart-Labs/DreamServer

y-coffee-dev · 2026-03-20T08:47:50Z

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.

Architecture

Topology Detection (`nvidia-topo.sh`)

Parses the nvidia-smi topo -m matrix to extract GPU-to-GPU link types and assigns numerical ranks:

GPU Assignment Algorithm (`assign_gpus.py`)

Four-phase pipeline:

Topology Analysis — Parse GPUs and links, build rank matrix
Subset Enumeration — Generate all GPU subsets, sorted by min link rank (desc), size (asc), VRAM (desc). Find the best subset that fits the model; if none fits, greedily span across GPUs
Service Assignment — Allocate remaining GPUs to whisper/comfyui/embeddings based on availability:
- 0 remaining: colocate all services on llama's last GPU
- 1 remaining: all auxiliary services share that GPU
- 2 remaining: whisper gets one, comfyui+embeddings share the other
- 3+ remaining: dedicated GPUs; extras go back to llama
Parallelism Selection — Based on GPU count and min link rank:
- NVLink/XGMI (rank >= 80): tensor parallel (<=3 GPUs) or hybrid (>3 GPUs)
- Same-NUMA PCIe (rank 11-79): pipeline (<=3 GPUs) or hybrid if rank >= 40
- Cross-NUMA (rank <= 10): pipeline only
- Heterogeneous VRAM: proportional tensor split weights

Compose Layering

When GPU_COUNT > 1, the stack adds:

docker-compose.multigpu.yml — llama-server GPU pinning + split mode
extensions/services/*/compose.multigpu.yaml — per-service GPU pinning

Interactive TUI

Multi-GPU systems get a configuration prompt:

[1] Automatic — runs assign_gpus.py with detected topology
[2] Custom — manual GPU-to-service assignment

Non-interactive installs default to automatic assignment.

Test coverage

Automated tests

tests/test-nvidia-topo.sh — Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrix
tests/test-assign-gpus.py — Comprehensive pytest suite covering:
- Single GPU: strategy, service sharing, parallelism mode, model-too-large error
- 2-GPU PHB: colocated strategy, pipeline parallelism
- 4-GPU SOC (cross-NUMA): pipeline mode, dedicated strategy
- 4-GPU SYS + NV pairs: mixed topology handling
- 5-GPU NV12 + MLX5: NVLink with NIC filtering
- 8-GPU NV12 full mesh: tensor/hybrid parallelism selection
- 8-GPU NV1/NV2 partial mesh: degraded NVLink handling
- VRAM overflow / span subset scenarios
- Heterogeneous GPU tensor split proportions

Manual hardware testing

Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):

2x NVIDIA RTX 3060
4x NVIDIA RTX 4080
4x NVIDIA RTX 5060 Ti

All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.

What changed

New files

File	Purpose
`installers/lib/nvidia-topo.sh`	NVIDIA topology detection library — parses `nvidia-smi topo -m` matrix into structured JSON with link types, ranks, and labels
`scripts/assign_gpus.py`	GPU assignment algorithm — 4-phase pipeline: topology analysis, subset enumeration, service assignment, parallelism selection
`docker-compose.multigpu.yml`	Compose overlay for llama-server with `NVIDIA_VISIBLE_DEVICES`, `LLAMA_ARG_SPLIT_MODE`, and `LLAMA_ARG_TENSOR_SPLIT`
`extensions/services/comfyui/compose.multigpu.yaml`	Per-service GPU pinning overlay for ComfyUI
`extensions/services/whisper/compose.multigpu.yaml`	Per-service GPU pinning overlay for Whisper
`extensions/services/embeddings/compose.multigpu.yaml`	Per-service GPU pinning overlay for Embeddings
`tests/test-nvidia-topo.sh`	Shell tests for topology parsing against fixture matrices
`tests/test-assign-gpus.py`	Python tests covering single GPU, 2-GPU colocated, 4-GPU NVLink/SYS, 5-GPU NVLink, 8-GPU full mesh/partial mesh topologies
`tests/fixtures/topology_json/*.json` (8 files)	JSON topology fixtures: 1-GPU PCIe, 2-GPU PHB, 4-GPU SOC, 4-GPU SYS+NV pairs, 5-GPU NV12+MLX5, 8-GPU NV12 full mesh, 8-GPU NV12+NUMA, 8-GPU NV1/NV2 partial mesh
`tests/fixtures/topology_matrix/*.txt` (7 files)	Raw `nvidia-smi topo -m` output fixtures for shell-level testing

Modified files

File	Change
`installers/phases/01-preflight.sh`	Adds `jq` and `python3` to preflight dependency checks (required by topology detection and assignment)
`installers/phases/02-detection.sh`	Integrates `detect_nvidia_topo()` — populates `GPU_TOPOLOGY_JSON`, `GPU_HAS_NVLINK`, `GPU_TOTAL_VRAM`, `LLM_MODEL_SIZE_MB`
`installers/phases/03-features.sh`	Major expansion — multi-GPU configuration TUI with automatic and custom assignment modes, parallelism selection, env var extraction
`installers/phases/04-requirements.sh`	Adds multi-GPU compose overlay to requirements
`installers/phases/06-directories.sh`	Persists `GPU_ASSIGNMENT_JSON` and per-service GPU UUIDs to `.env`
`installers/lib/constants.sh`	Adds multi-GPU related constants
`installers/lib/tier-map.sh`	Adds multi-GPU tier mappings
`installers/lib/compose-select.sh`	Includes `docker-compose.multigpu.yml` when `GPU_COUNT > 1`
`scripts/resolve-compose-stack.sh`	Accepts `--gpu-count` flag; discovers and merges `compose.multigpu.yaml` from extensions
`scripts/detect-hardware.sh`	Sources `nvidia-topo.sh` for topology detection
`scripts/build-capability-profile.sh`	Reads actual `gpu.count` from capability profile instead of hardcoding `1`
`.env.schema.json`	Adds new env vars: `GPU_ASSIGNMENT_JSON_B64`, `LLAMA_SERVER_GPU_UUIDS`, `LLAMA_ARG_SPLIT_MODE`, `LLAMA_ARG_TENSOR_SPLIT`, `EMBEDDINGS_GPU_UUID`, `COMFYUI_GPU_UUID`, `WHISPER_GPU_UUID`, `N_GPU_LAYERS`

Lightheartdevs

Review: Needs Work

Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:

1. `jq` promoted from optional to required (breaking)

01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:

Auto-install jq (like Docker is auto-installed in phase 05), or
Keep it optional with graceful degradation when absent

2. No CI checks have run

This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.

3. Docker Compose GPU reservation conflict

docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.

4. Minor: duplicate comment line

constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.

What's good

The topology detection with nvidia-smi topo -m fallback is well-handled
assign_gpus.py algorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts
Single-GPU path is preserved (gated on GPU_COUNT > 1)
Graceful degradation when nvidia-smi is absent

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Enhanced multi-GPU tier assignment based on topology - Implemented robust GPU topology detection for NVIDIA - Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases - Implemented gathering detailed per-GPU information - Data structures for GPU information storage - Robust and comprehensive test suite for NVIDIA topology detection - Multi-GPU strategy selection algorithm - Careful handling of edge cases and subtle bugs in strategy selection - Robust test suite for multi-GPU strategy selection algorithm GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups Adjust env schema validation Fixed inconsistencies in gpu count, json escaping issues, etc fix issue with writing multigpu overlay fix resolve-compose-stack.sh multi gpu overlay fix gpu device id Refactors + less convoluted docker compose setup N_GPU_LAYERS validation fix multi-gpu overlay

y-coffee-dev · 2026-03-21T01:13:11Z

@Lightheartdevs Thanks for the thorough review! I adjusted the PR.

1. jq auto-install - Good catch. I've added auto-install logic for jq.
2. CI - Pushed an adjustments commit, this should trigger the CI pipeline
3. Docker Compose GPU reservation - In fact this is not a conflict, the current setup is intentional and correct, as device_ids in the deploy.resources.reservations.devices block can't be set dynamically, because Docker Compose variable interpolation only produces scalar strings, and since device_ids expects a YAML sequence, there's no way to inject a list like ['0', '2'] from an environment variable.
The two mechanisms are layering, ie. the deploy reservation makes all GPUs available at the Docker level, for the NVIDIA container runtime to use NVIDIA_VISIBLE_DEVICES to scope which GPUs are actually visible inside the container at the runtime level.
This is a common approach when you need dynamic per-container GPU assignment in Compose.

4. INSTALL_START_EPOCH duplication - Fixed!

I appreciate the detailed feedback!

Lightheartdevs · 2026-03-23T13:13:19Z

Review Update — Rebase Required Before Merge

Hey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first.

🚨 Silent merge bug: `LLM_MODEL_SIZE_MB` will be dropped

Since you branched, we merged #572/#573/#574 which rewrote the model names and URLs in tier-map.sh (Qwen 3 → Qwen 3.5). Your branch adds LLM_MODEL_SIZE_MB to each tier in that same file.

Git reports a clean merge — no conflicts — but the result silently drops all 11 of your LLM_MODEL_SIZE_MB additions. This happens because git sees main's rewrites and your additions as non-overlapping changes within each tier block, and resolves by taking main's version (which has no LLM_MODEL_SIZE_MB).

What breaks: assign_gpus.py gets called with --model-size "" → float("") → ValueError → multi-GPU assignment fails on every install. Single-GPU installs are fine (early return guard), but the entire multi-GPU feature would be DOA.

What's needed

Rebase onto current main (commit 5a932e9)
Re-add LLM_MODEL_SIZE_MB to each tier. The new Qwen 3.5 model sizes (update as needed):

CLOUD:      LLM_MODEL_SIZE_MB=0
ARC:        LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
ARC_LITE:   LLM_MODEL_SIZE_MB=2870    # Qwen3.5-4B-Q4_K_M
NV_ULTRA:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_LARGE:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_COMPACT: LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)
Tier 0:     LLM_MODEL_SIZE_MB=1500    # Qwen3.5-2B-Q4_K_M
Tier 1:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 2:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 3:     LLM_MODEL_SIZE_MB=16400   # Qwen3.5-27B-Q4_K_M
Tier 4:     LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)

⚠️ Double-check these against the actual GGUF file sizes on HuggingFace — the Qwen 3.5 models are new and some sizes may differ from the Qwen 3 equivalents you had before.

Push — this should also trigger CI, which hasn't run yet on this branch.

Everything else looks good

We did a full merge simulation and traced every touched installer file. The single-GPU path is completely safe — your guards in 02-detection.sh (GPU_COUNT -gt 1) and 03-features.sh (GPU_COUNT -le 1 → return) are clean. The compose layering, hardware detection additions, and .env generation all use safe defaults. No behavioral changes for existing single-GPU installs on any backend.

Two minor suggestions for a follow-up (non-blocking):

Add trap "rm -f $TOPOLOGY_FILE" EXIT after the mktemp in 03-features.sh to clean up on early exit
Add a # NOTE: keep in sync with assign_gpus.py comment in the custom TUI parallelism logic in 03-features.sh, since it duplicates the threshold logic from the Python script

Looking forward to the rebase — this is a great feature and we want to ship it. 🚀

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Lightheartdevs requested changes Mar 20, 2026

View reviewed changes

y-coffee-dev added 5 commits March 21, 2026 01:06

LLAMA_ARG_TENSOR_SPLIT default value

a425e48

Fix critical issues

6b2a2c2

More improvements and fixes

fe1d1e2

auto-install jq + fix duplicated INSTALL_START_EPOCH

7ddc533

y-coffee-dev force-pushed the feat/multi-gpu branch from a49941c to 7ddc533 Compare March 21, 2026 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu

y-coffee-dev commented Mar 20, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

y-coffee-dev commented Mar 21, 2026

Uh oh!

Lightheartdevs commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

y-coffee-dev commented Mar 20, 2026

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Architecture

Topology Detection (nvidia-topo.sh)

GPU Assignment Algorithm (assign_gpus.py)

Compose Layering

Interactive TUI

Test coverage

Automated tests

Manual hardware testing

What changed

New files

Modified files

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Review: Needs Work

1. jq promoted from optional to required (breaking)

2. No CI checks have run

3. Docker Compose GPU reservation conflict

4. Minor: duplicate comment line

What's good

Uh oh!

y-coffee-dev commented Mar 21, 2026

Uh oh!

Lightheartdevs commented Mar 23, 2026

Review Update — Rebase Required Before Merge

🚨 Silent merge bug: LLM_MODEL_SIZE_MB will be dropped

What's needed

Everything else looks good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Topology Detection (`nvidia-topo.sh`)

GPU Assignment Algorithm (`assign_gpus.py`)

1. `jq` promoted from optional to required (breaking)

🚨 Silent merge bug: `LLM_MODEL_SIZE_MB` will be dropped