feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501y-coffee-dev wants to merge 5 commits intoLight-Heart-Labs:mainfrom
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
Review: Needs Work
Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:
1. jq promoted from optional to required (breaking)
01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:
- Auto-install
jq(like Docker is auto-installed in phase 05), or - Keep it optional with graceful degradation when absent
2. No CI checks have run
This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.
3. Docker Compose GPU reservation conflict
docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.
4. Minor: duplicate comment line
constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.
What's good
- The topology detection with
nvidia-smi topo -mfallback is well-handled assign_gpus.pyalgorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts- Single-GPU path is preserved (gated on
GPU_COUNT > 1) - Graceful degradation when
nvidia-smiis absent
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
- Enhanced multi-GPU tier assignment based on topology - Implemented robust GPU topology detection for NVIDIA - Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases - Implemented gathering detailed per-GPU information - Data structures for GPU information storage - Robust and comprehensive test suite for NVIDIA topology detection - Multi-GPU strategy selection algorithm - Careful handling of edge cases and subtle bugs in strategy selection - Robust test suite for multi-GPU strategy selection algorithm GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups Adjust env schema validation Fixed inconsistencies in gpu count, json escaping issues, etc fix issue with writing multigpu overlay fix resolve-compose-stack.sh multi gpu overlay fix gpu device id Refactors + less convoluted docker compose setup N_GPU_LAYERS validation fix multi-gpu overlay
a49941c to
7ddc533
Compare
|
@Lightheartdevs Thanks for the thorough review! I adjusted the PR. 1. jq auto-install - Good catch. I've added auto-install logic for jq. 4. INSTALL_START_EPOCH duplication - Fixed! I appreciate the detailed feedback! |
Review Update — Rebase Required Before MergeHey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first. 🚨 Silent merge bug:
|
feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism
Summary
Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.
Architecture
Topology Detection (
nvidia-topo.sh)Parses the
nvidia-smi topo -mmatrix to extract GPU-to-GPU link types and assigns numerical ranks:GPU Assignment Algorithm (
assign_gpus.py)Four-phase pipeline:
Compose Layering
When
GPU_COUNT > 1, the stack adds:docker-compose.multigpu.yml— llama-server GPU pinning + split modeextensions/services/*/compose.multigpu.yaml— per-service GPU pinningInteractive TUI
Multi-GPU systems get a configuration prompt:
assign_gpus.pywith detected topologyNon-interactive installs default to automatic assignment.
Test coverage
Automated tests
tests/test-nvidia-topo.sh— Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrixtests/test-assign-gpus.py— Comprehensive pytest suite covering:Manual hardware testing
Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):
All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.
What changed
New files
installers/lib/nvidia-topo.shnvidia-smi topo -mmatrix into structured JSON with link types, ranks, and labelsscripts/assign_gpus.pydocker-compose.multigpu.ymlNVIDIA_VISIBLE_DEVICES,LLAMA_ARG_SPLIT_MODE, andLLAMA_ARG_TENSOR_SPLITextensions/services/comfyui/compose.multigpu.yamlextensions/services/whisper/compose.multigpu.yamlextensions/services/embeddings/compose.multigpu.yamltests/test-nvidia-topo.shtests/test-assign-gpus.pytests/fixtures/topology_json/*.json(8 files)tests/fixtures/topology_matrix/*.txt(7 files)nvidia-smi topo -moutput fixtures for shell-level testingModified files
installers/phases/01-preflight.shjqandpython3to preflight dependency checks (required by topology detection and assignment)installers/phases/02-detection.shdetect_nvidia_topo()— populatesGPU_TOPOLOGY_JSON,GPU_HAS_NVLINK,GPU_TOTAL_VRAM,LLM_MODEL_SIZE_MBinstallers/phases/03-features.shinstallers/phases/04-requirements.shinstallers/phases/06-directories.shGPU_ASSIGNMENT_JSONand per-service GPU UUIDs to.envinstallers/lib/constants.shinstallers/lib/tier-map.shinstallers/lib/compose-select.shdocker-compose.multigpu.ymlwhenGPU_COUNT > 1scripts/resolve-compose-stack.sh--gpu-countflag; discovers and mergescompose.multigpu.yamlfrom extensionsscripts/detect-hardware.shnvidia-topo.shfor topology detectionscripts/build-capability-profile.shgpu.countfrom capability profile instead of hardcoding1.env.schema.jsonGPU_ASSIGNMENT_JSON_B64,LLAMA_SERVER_GPU_UUIDS,LLAMA_ARG_SPLIT_MODE,LLAMA_ARG_TENSOR_SPLIT,EMBEDDINGS_GPU_UUID,COMFYUI_GPU_UUID,WHISPER_GPU_UUID,N_GPU_LAYERS