detect all gpus during hw check by JannikSt · Pull Request #194 · PrimeIntellect-ai/protocol

JannikSt · 2025-03-30T00:47:47Z

A machine might have multiple GPUs installed - previously it just used the first GPU and simply increased the count.

…etting Improvement/shm size setting

…etting

* feature: ability to disable node ejection

* improve the resiliency of validator

* add improvement to metrics reporting

…enge very basic validator functionality

…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed

* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping

* add timestamp to latest status change for node * add status change when node in ejected but provider becomes healthy again

* rewrite orchestrator discovery sync to make it testable * fix & test discvoery state changes, add logs when state changes

* fix edge case in node status updater where node is active on chain and sends heartbeats, but dead / discovered state locally. This caused the orchestrator to send invites (which fail on worker side)

…fix google cloud testing bug (#175)

* add missing abi files, adjust work validation * directly fetch work validation contract from domain

* fix chainsync can break on old data * cleanup

* add proper 404 response when pool is not found

* ability to set log level for orchestrtor + validator * log lvl fallback * set blocktime to 2s for docker-compose

* Optimize toploc server interaction * add redis cache to validator * add toploc grace interval, adjust Dockerfile, adjust docker-compose, adjust env var setup

* load validator addresses from contract * ignore blockchain test * Update shared/src/web3/contracts/implementations/prime_network_contract.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * adjust compose --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

This PR updates the hardware check logic to detect all GPUs instead of only the first one, enabling better support for multi-GPU systems.

In worker/src/checks/hardware/hardware_check.rs, the function now collects all GPUs and selects the primary GPU based on a max_by_key filter.
In worker/src/checks/hardware/gpu.rs, the GPU detection logic has been refactored to return a vector of GPUs and merges devices with the same name.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
worker/src/checks/hardware/hardware_check.rs	Updated to collect multiple GPU specs and select a main GPU from the list.
worker/src/checks/hardware/gpu.rs	Refactored GPU detection to gather all devices, convert from GB to MB, and aggregate same-named GPUs.

Comments suppressed due to low confidence (2)

worker/src/checks/hardware/hardware_check.rs:148

[nitpick] Selecting the main GPU solely based on 'count' may not accurately represent the best GPU in terms of performance. Consider evaluating other attributes, such as memory_mb, to determine the primary GPU.

.max_by_key(|gpu| gpu.count.unwrap_or(0));

worker/src/checks/hardware/gpu.rs:91

[nitpick] Merging GPU devices based solely on name may inadvertently combine devices with different specs. Consider verifying that merging is appropriate or handling discrepancies in attributes like memory and driver_version.

existing_device.count += 1;

mattdf · 2025-03-30T08:55:09Z

+                if let Some(existing_device) = device_map.get_mut(&name) {
+                    existing_device.count += 1;
+                } else {
+                    device_map.insert(
+                        name.clone(),
+                        GpuDevice {
+                            name,
+                            memory,
+                            driver_version,
+                            count: 1,
+                        },
+                    );
+                }
+            }


Would be useful to change this data structure to contain the cuda indexes of these GPUs, so that when you create a docker container instead of passing --gpus all you could have the option to do docker run --gpus '"device=3,4,5"', and as such removes the need to worry about handling random indexes for mixed card types from inside the worker containers.

JannikSt and others added 30 commits February 6, 2025 01:51

ability to automatically set shm size based on sys memory

c42793f

clippy

f1f7538

Merge pull request #107 from PrimeIntellect-ai/improvement/shm-size-s…

6d4aa01

…etting Improvement/shm size setting

bump version

ca57e74

Merge pull request #108 from PrimeIntellect-ai/improvement/shm-size-s…

faf4375

…etting

Feature/disable ejection (#111)

ad90427

* feature: ability to disable node ejection

support ability to restore metrics via orchestrator (#112)

bf52611

bump version

8e8808a

merge

535e0c3

improve resiliency of validator (#114)

b698e32

* improve the resiliency of validator

bump version

7ddf448

resolve conflicts

bbabc26

add improvement to metrics reporting (#119)

b7adeeb

* add improvement to metrics reporting

resolve conflicts

519307a

fmt

0d3d5cd

very basic validator functionality

8cb71b4

fmt

bf88c80

add signature ...

c58574b

use sign_request

92629cd

use custom serializer for consistency

75e055d

add validator arg to miner, implement rounding robust partial eq

61a482b

fmt, clippy fix

36349d6

fix remote makefile entry

7a47331

misc makefile fixes

9a85c57

fix clippy

58f97ba

fix fmt...

9e801d3

make app_state unused

7255a74

Merge pull request #96 from PrimeIntellect-ai/feature/validator-chall…

e172fbd

…enge very basic validator functionality

minor readme adjustment

7350607

remove redundant files

c3480c9

manveerxyz and others added 21 commits March 25, 2025 14:12

Run cargo fmt

4312d39

Merge pull request #169 from PrimeIntellect-ai/improvement/generate-n…

5a61192

…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed

Preserve folder structure for toploc-validator (#168)

0ec478a

* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping

add timestamp to latest status change for node (#170)

0185496

* add timestamp to latest status change for node * add status change when node in ejected but provider becomes healthy again

Discovery state change tests (#171)

840c7e8

* rewrite orchestrator discovery sync to make it testable * fix & test discvoery state changes, add logs when state changes

adjust beta versioning (#172)

e07e886

Improvement/dev release versioning - fix indentation (#173)

1b7ba47

Fix: node recovery edge case (#174)

f66444e

* fix edge case in node status updater where node is active on chain and sends heartbeats, but dead / discovered state locally. This caused the orchestrator to send invites (which fail on worker side)

add creation and update timestamps, add sorting to discovery output, …

f21bcb8

…fix google cloud testing bug (#175)

Fix: work validation improvements & cleanup (#176)

6ab8c18

* add missing abi files, adjust work validation * directly fetch work validation contract from domain

fix install script with new dev tagging (#177)

3974b92

whitelist provider info in doc

4f295cd

add info reg. whitelist provider

5d53823

Fix: chainsync can break on old data (#178)

8934a12

* fix chainsync can break on old data * cleanup

Fix: discovery wrong status on missing pool (#179)

f2c2df7

* add proper 404 response when pool is not found

Improvement/log level (#180)

ce07605

* ability to set log level for orchestrtor + validator * log lvl fallback * set blocktime to 2s for docker-compose

Synthetic Data Validator: Sequential and without cache (#187)

fd45f37

* Optimize toploc server interaction * add redis cache to validator * add toploc grace interval, adjust Dockerfile, adjust docker-compose, adjust env var setup

add discovery upload retries (#182)

17368a1

detect all gpus during hw check

3fb118b

use gpu with highest count

9e2c1fb

JannikSt requested review from Copilot and mattdf March 30, 2025 01:03

Copilot AI reviewed Mar 30, 2025

View reviewed changes

JannikSt requested a review from manveerxyz March 30, 2025 01:04

mattdf suggested changes Mar 30, 2025

View reviewed changes

JannikSt force-pushed the develop branch from 758d58e to bcf50a4 Compare April 12, 2025 16:50

JannikSt added 2 commits April 17, 2025 14:38

resolve conflicts

2040e25

cleanup

e937021

JannikSt closed this Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect all gpus during hw check#194

detect all gpus during hw check#194
JannikSt wants to merge 97 commits into
developfrom
fix/gpu-detection-when-diff-gpu-types

JannikSt commented Mar 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mattdf Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

JannikSt commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mattdf Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JannikSt commented Mar 30, 2025 •

edited

Loading