Skip to content

Add support for autodetection of gres resources #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: feat/nodegroups
Choose a base branch
from

Conversation

jovial
Copy link
Contributor

@jovial jovial commented Apr 23, 2025

Adds support for setting the AutoDetection property on gres resources. This prevents the need to manually specify File in the gres dictionary. You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).

Example:

# group_vars/all/openhpc.yml

openhpc_nodegroups:
    - name: cpu
    - name: gpu
      gres_autodetect: nvml
      gres:
        - conf: "gpu:nvidia_h100_80gb_hbm3:2"
        - conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
        - conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"

@jovial jovial requested a review from a team as a code owner April 23, 2025 17:07
@jovial jovial marked this pull request as draft April 23, 2025 20:20
@jovial jovial marked this pull request as ready for review April 24, 2025 09:04
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some concerns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PR comment:

You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).

Can you explain why with it "per-gres" you end up with multiple methods per node?

Nodes can - and often are - in multiple partitions. So specifying it per-partition is not sufficent to guarantee this anyway, I think, unless there's some other subtlety in the logic.

I think what we need to support is something like this, and only like this:

openhpc_slurm_partitions:
    - name: gpu
      groups:
        - name: a100
        - name: h100
    - name: a100
      gres:
        - conf: "gpu:nvidia_a100_80gb_hbm3:2"
    - name: h100
      gres_autodetect: nvml
      gres:
        - conf: "gpu:nvidia_h100_80gb_hbm3:2"

i.e. no complicated fallbacks or overriden defaults etc. Maybe this just needs documenting, and we just let an error occur if someone does something wrong. #174 rolled up the slurm.conf NodeName= templating to allow defining nodes in multiple partitions, do we need something similar here? Or maybe not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying, essentially you can't mix methods for a particular node. I'll dig out the error message. It seems like a host var/group var would be more natural:

gres_autodetect: nvml

Outside of the openhpc_slurm_partitions definition. But will that be complicated with the host list expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like:

[rocky@io-io-gpu-02 ~]$ sudo cat /var/spool/slurm/conf-cache/gres.conf
AutoDetect=off
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb AutoDetect=nvml
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb AutoDetect=nvml
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3 File=/dev/nvidia0

produces:

Apr 24 10:49:49 io-io-gpu-02.io.internal slurmd[14141]: slurmd-io-io-gpu-02: fatal: gres.conf for gpu, some records have "File" specification while others do not
Apr 24 10:49:49 io-io-gpu-02.io.internal slurmd-io-io-gpu-02[14141]: fatal: gres.conf for gpu, some records have "File" specification while others do not
Apr 24 10:49:49 io-io-gpu-02.io.internal systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Apr 24 10:49:49 io-io-gpu-02.io.internal systemd[1]: slurmd.service: Failed with result 'exit-code'.
Apr 24 10:49:49 io-io-gpu-02.io.internal systemd[1]: Failed to start Slurm node daemon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, it does start without issue with something like this:

[rocky@io-io-gpu-02 ~]$ sudo cat /var/spool/slurm/conf-cache/gres.conf
AutoDetect=off
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb AutoDetect=nvml
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb AutoDetect=nvml
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3 AutoDetect=nvml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another clarification is that:

[rocky@io-io-gpu-02 ~]$ sudo cat /var/spool/slurm/conf-cache/gres.conf
AutoDetect=off
NodeName=io-io-gpu-[01-02] Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb AutoDetect=nvml

Will just essentially do autodetect for everything (not just the 1g.10gb instances).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think I'm missing some context here! If you have autodetection, is there ever a case where you'd want to specify it manually? (i.e. can't we just say; "don't do that"?). I can imagine there's nvidia nodes where you have autodetection and other nodes where you don't, so you have to specify both, but never for the same nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, those comments were more for my reference. I was just clarifying how it behaved when specified multiple times for the same node. I think you are right when you say that we should make sure each host only appears once if autodetection is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it work as a host/group var. This means you can't set conflicting values on different partitions. Let me know what you think.

README.md Outdated
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.

or move the addition to the end of the item 🤷 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've left it at the beginning, as I felt it was the most important bit of information.

@jovial jovial requested a review from sjpb April 28, 2025 08:39
@jovial jovial marked this pull request as draft May 8, 2025 14:18
@jovial jovial changed the base branch from master to feat/nodegroups May 8, 2025 14:27
@jovial jovial marked this pull request as ready for review May 8, 2025 15:59
@jovial
Copy link
Contributor Author

jovial commented May 8, 2025

Ready for review but merge #183 first (this PR targets that branch to avoid noise in diff)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants