Skip to content
This repository was archived by the owner on Jan 27, 2026. It is now read-only.

detect all gpus during hw check#194

Closed
JannikSt wants to merge 97 commits into
developfrom
fix/gpu-detection-when-diff-gpu-types
Closed

detect all gpus during hw check#194
JannikSt wants to merge 97 commits into
developfrom
fix/gpu-detection-when-diff-gpu-types

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented Mar 30, 2025

A machine might have multiple GPUs installed - previously it just used the first GPU and simply increased the count.

manveerxyz and others added 21 commits March 25, 2025 14:12
…ode-wallet

PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed
* Improve the synthetic data validation code to ensure we preserve file structure 

* added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping
* add timestamp to latest status change for node

* add status change when node in ejected but provider becomes healthy again
* rewrite orchestrator discovery sync to make it testable

* fix & test discvoery state changes, add logs when state changes
* fix edge case in node status updater where node is active on chain and sends heartbeats, but dead / discovered state locally. This caused the orchestrator to send invites (which fail on worker side)
* add missing abi files, adjust work validation

* directly fetch work validation contract from domain
* fix chainsync can break on old data

* cleanup
* add proper 404 response when pool is not found
* ability to set log level for orchestrtor + validator

* log lvl fallback

* set blocktime to 2s for docker-compose
* Optimize toploc server interaction 

* add redis cache to validator 

* add toploc grace interval, adjust Dockerfile, adjust docker-compose, adjust env var setup
* load validator addresses from contract

* ignore blockchain test

* Update shared/src/web3/contracts/implementations/prime_network_contract.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* adjust compose

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@JannikSt JannikSt requested review from Copilot and mattdf March 30, 2025 01:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the hardware check logic to detect all GPUs instead of only the first one, enabling better support for multi-GPU systems.

  • In worker/src/checks/hardware/hardware_check.rs, the function now collects all GPUs and selects the primary GPU based on a max_by_key filter.
  • In worker/src/checks/hardware/gpu.rs, the GPU detection logic has been refactored to return a vector of GPUs and merges devices with the same name.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
worker/src/checks/hardware/hardware_check.rs Updated to collect multiple GPU specs and select a main GPU from the list.
worker/src/checks/hardware/gpu.rs Refactored GPU detection to gather all devices, convert from GB to MB, and aggregate same-named GPUs.
Comments suppressed due to low confidence (2)

worker/src/checks/hardware/hardware_check.rs:148

  • [nitpick] Selecting the main GPU solely based on 'count' may not accurately represent the best GPU in terms of performance. Consider evaluating other attributes, such as memory_mb, to determine the primary GPU.
.max_by_key(|gpu| gpu.count.unwrap_or(0));

worker/src/checks/hardware/gpu.rs:91

  • [nitpick] Merging GPU devices based solely on name may inadvertently combine devices with different specs. Consider verifying that merging is appropriate or handling discrepancies in attributes like memory and driver_version.
existing_device.count += 1;

@JannikSt JannikSt requested a review from manveerxyz March 30, 2025 01:04
Comment thread worker/src/checks/hardware/gpu.rs Outdated
Comment on lines +90 to +103
if let Some(existing_device) = device_map.get_mut(&name) {
existing_device.count += 1;
} else {
device_map.insert(
name.clone(),
GpuDevice {
name,
memory,
driver_version,
count: 1,
},
);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to change this data structure to contain the cuda indexes of these GPUs, so that when you create a docker container instead of passing --gpus all you could have the option to do docker run --gpus '"device=3,4,5"', and as such removes the need to worry about handling random indexes for mixed card types from inside the worker containers.

@JannikSt JannikSt closed this Apr 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants