This repository was archived by the owner on Jan 27, 2026. It is now read-only.
detect all gpus during hw check#194
Closed
JannikSt wants to merge 97 commits into
Closed
Conversation
…etting Improvement/shm size setting
* feature: ability to disable node ejection
* improve the resiliency of validator
* add improvement to metrics reporting
…enge very basic validator functionality
…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed
* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping
* add timestamp to latest status change for node * add status change when node in ejected but provider becomes healthy again
* rewrite orchestrator discovery sync to make it testable * fix & test discvoery state changes, add logs when state changes
* fix edge case in node status updater where node is active on chain and sends heartbeats, but dead / discovered state locally. This caused the orchestrator to send invites (which fail on worker side)
…fix google cloud testing bug (#175)
* add missing abi files, adjust work validation * directly fetch work validation contract from domain
* fix chainsync can break on old data * cleanup
* add proper 404 response when pool is not found
* ability to set log level for orchestrtor + validator * log lvl fallback * set blocktime to 2s for docker-compose
* Optimize toploc server interaction * add redis cache to validator * add toploc grace interval, adjust Dockerfile, adjust docker-compose, adjust env var setup
* load validator addresses from contract * ignore blockchain test * Update shared/src/web3/contracts/implementations/prime_network_contract.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * adjust compose --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR updates the hardware check logic to detect all GPUs instead of only the first one, enabling better support for multi-GPU systems.
- In worker/src/checks/hardware/hardware_check.rs, the function now collects all GPUs and selects the primary GPU based on a max_by_key filter.
- In worker/src/checks/hardware/gpu.rs, the GPU detection logic has been refactored to return a vector of GPUs and merges devices with the same name.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| worker/src/checks/hardware/hardware_check.rs | Updated to collect multiple GPU specs and select a main GPU from the list. |
| worker/src/checks/hardware/gpu.rs | Refactored GPU detection to gather all devices, convert from GB to MB, and aggregate same-named GPUs. |
Comments suppressed due to low confidence (2)
worker/src/checks/hardware/hardware_check.rs:148
- [nitpick] Selecting the main GPU solely based on 'count' may not accurately represent the best GPU in terms of performance. Consider evaluating other attributes, such as memory_mb, to determine the primary GPU.
.max_by_key(|gpu| gpu.count.unwrap_or(0));
worker/src/checks/hardware/gpu.rs:91
- [nitpick] Merging GPU devices based solely on name may inadvertently combine devices with different specs. Consider verifying that merging is appropriate or handling discrepancies in attributes like memory and driver_version.
existing_device.count += 1;
mattdf
suggested changes
Mar 30, 2025
Comment on lines
+90
to
+103
| if let Some(existing_device) = device_map.get_mut(&name) { | ||
| existing_device.count += 1; | ||
| } else { | ||
| device_map.insert( | ||
| name.clone(), | ||
| GpuDevice { | ||
| name, | ||
| memory, | ||
| driver_version, | ||
| count: 1, | ||
| }, | ||
| ); | ||
| } | ||
| } |
Contributor
There was a problem hiding this comment.
Would be useful to change this data structure to contain the cuda indexes of these GPUs, so that when you create a docker container instead of passing --gpus all you could have the option to do docker run --gpus '"device=3,4,5"', and as such removes the need to worry about handling random indexes for mixed card types from inside the worker containers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A machine might have multiple GPUs installed - previously it just used the first GPU and simply increased the count.