-
Notifications
You must be signed in to change notification settings - Fork 19
Add support for autodetection of gres resources #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jovial
wants to merge
44
commits into
master
Choose a base branch
from
feature/gres-autodetect
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
4137b2b
Add support for autodetection of gres resources
jovial 464d952
Re-add whitespace
jovial 610a8ed
Fix jinja
jovial b8e1400
Move to separate field as it not possible to autodetect for a subset …
jovial 65957bb
...
jovial 681d9a0
Apply suggestions from code review
jovial 290744b
Update README.md
jovial 4b45244
Support hosts in multiple partitions when templating gres.conf
jovial 0682178
Merge remote-tracking branch 'origin/feature/gres-autodetect' into HEAD
jovial ec07266
Fix templating
jovial 863bea5
..
jovial f793689
...
jovial 32fbe3c
Suggestion from code review
jovial 1c281c4
Move donehosts out of loop
jovial ba8a38a
nodegroups using nodesets - doesn't handle empty nodegroups
sjpb 8f9436f
cope with empty nodegroups/partitions
sjpb 0abbf76
make gres work again
sjpb b8c64dc
make node/partition parameters more greppable
sjpb 6dabb2f
use features to simplify nodeset configuration
sjpb 4238c70
Merge remote-tracking branch 'origin/feat/nodegroups' into HEAD
jovial ea7902a
add nodegroup.features
sjpb d16b6ba
add validation
sjpb 4f3bbc8
document nodegroup.features to README
sjpb f126bba
add better examples in README
sjpb e993a54
tidy up README
sjpb e41cc84
fix validate task path
sjpb 3440050
fix lint error
sjpb 319ddf3
default partitions to nodegroups to make CI easier
sjpb c8e73ee
update molecule tests for openhpc_nodegroups
sjpb f5d0698
remove checks from runtime now validation defined
sjpb 51001ed
Update readme
jovial a551b52
Merge remote-tracking branch 'origin/feat/nodegroups' into HEAD
jovial 196716f
...
jovial 9f7b19d
fix NodeName= lines missing newlines between them when multiple hostl…
sjpb 02ba27c
remove tests for extra_nodes
sjpb 26000f4
Update gres.conf.j2
jovial 10a8ace
allow missing inventory groups (as per docs) when validating nodegroups
sjpb 3c706d7
only run validation once
sjpb 03dea2e
remove test14 from CI - extra_nodes feature removed
sjpb 175a1c0
update complex test for new group/partition variables
sjpb 2b22867
Merge remote-tracking branch 'origin/feat/nodegroups' into HEAD
jovial 4930029
Merge remote-tracking branch 'origin/feature/gres-autodetect' into HEAD
jovial 64e61e8
Merge remote-tracking branch 'origin/master' into HEAD
jovial eacee19
Fix README
jovial File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,19 @@ | ||
AutoDetect=off | ||
{% for nodegroup in openhpc_nodegroups %} | ||
{% for gres in nodegroup.gres | default([]) %} | ||
{% set gres_name, gres_type, _ = gres.conf.split(':') %} | ||
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %} | ||
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %} | ||
{% set gres_list = nodegroup.gres | default([]) %} | ||
{% set gres_autodetect = nodegroup.gres_autodetect | default('off') %} | ||
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %} | ||
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %} | ||
{% if gres_autodetect | default('off') != 'off' %} | ||
{% for hostlist in (inventory_group_hosts | hostlist_expression) %} | ||
NodeName={{ hostlist }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file }} | ||
NodeName={{ hostlist }} AutoDetect={{ gres_autodetect }} | ||
{% endfor %}{# hostlists #} | ||
{% endfor %}{# gres #} | ||
{% else %} | ||
{% for gres in gres_list %} | ||
{% set gres_name, gres_type, _ = gres.conf.split(':') %} | ||
{% for hostlist in (inventory_group_hosts | hostlist_expression) %} | ||
NodeName={{ hostlist }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file | mandatory('The gres configuration dictionary: ' ~ gres ~ ' is missing the file key, but gres_autodetect is set to off. The error occured on node group: ' ~ nodegroup.name ~ '. Please add the file key or set gres_autodetect.') }} | ||
{% endfor %}{# hostlists #} | ||
{% endfor %}{# gres #} | ||
{% endif %}{# autodetect #} | ||
{% endfor %}{# nodegroup #} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PR comment:
Can you explain why with it "per-gres" you end up with multiple methods per node?
Nodes can - and often are - in multiple partitions. So specifying it per-partition is not sufficent to guarantee this anyway, I think, unless there's some other subtlety in the logic.
I think what we need to support is something like this, and only like this:
i.e. no complicated fallbacks or overriden defaults etc. Maybe this just needs documenting, and we just let an error occur if someone does something wrong. #174 rolled up the slurm.conf NodeName= templating to allow defining nodes in multiple partitions, do we need something similar here? Or maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying, essentially you can't mix methods for a particular node. I'll dig out the error message. It seems like a host var/group var would be more natural:
Outside of the
openhpc_slurm_partitions
definition. But will that be complicated with the host list expression?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So something like:
produces:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, it does start without issue with something like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another clarification is that:
Will just essentially do autodetect for everything (not just the
1g.10gb
instances).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I'm missing some context here! If you have autodetection, is there ever a case where you'd want to specify it manually? (i.e. can't we just say; "don't do that"?). I can imagine there's nvidia nodes where you have autodetection and other nodes where you don't, so you have to specify both, but never for the same nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, those comments were more for my reference. I was just clarifying how it behaved when specified multiple times for the same node. I think you are right when you say that we should make sure each host only appears once if autodetection is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it work as a host/group var. This means you can't set conflicting values on different partitions. Let me know what you think.