- 
                Notifications
    
You must be signed in to change notification settings  - Fork 109
 
[SlurmTopo] Add support for slurm Block Topology #3002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
        
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Fixed
          
            Show fixed
            Hide fixed
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Fixed
          
            Show fixed
            Hide fixed
        
      
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@             Coverage Diff             @@
##           develop    #3002      +/-   ##
===========================================
- Coverage    75.50%   75.24%   -0.27%     
===========================================
  Files           23       24       +1     
  Lines         2356     2436      +80     
===========================================
+ Hits          1779     1833      +54     
- Misses         577      603      +26     
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
17b882a    to
    544b10b      
    Compare
  
    3a43f29    to
    84276a2      
    Compare
  
            
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Fixed
          
            Show fixed
            Hide fixed
        
              
          
                ...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
              
                Fixed
          
            Show fixed
            Hide fixed
        
              
          
                cookbooks/aws-parallelcluster-slurm/resources/block_topology/partial/_block_topology_common.rb
          
            Show resolved
            Hide resolved
        
      | owner 'root' | ||
| group 'root' | ||
| mode '0644' | ||
| variables(is_amazon_linux_2: platform?('amazon') && node['platform_version'] == "2") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are injecting this variable to skip the inclusion of topology configuration in slurm.conf.
What about having a more generic is_topology_plugin_supported rather than an Os specific variable?
The advantage of the generic approach is to be more open to future extensions. Example: if tomorrow we must skip it for another OS, we will not need another variable dedicated to the additional Os, but simply include the Os in the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update and keep it is_block_topology_plugin_supported.
IF in future we add any more plugins its not necessary that AL2 will not be supported for them
        
          
                ...cluster-slurm/templates/default/slurm/block_topology/slurm_parallelcluster_topology.conf.erb
          
            Show resolved
            Hide resolved
        
              
          
                cookbooks/aws-parallelcluster-slurm/templates/default/slurm/slurm.conf.erb
              
                Outdated
          
            Show resolved
            Hide resolved
        
      … update * Updating to use correct paramter name of block_sizes
… as per CLI changes
| continue | ||
| 
               | 
          ||
| # Check for if reservation is for NVLink and size matches min_block_size_list | ||
| if compute_resource_config.get("InstanceType") == "p6e-gb200.36xlarge": | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just this instance type? .72xlarge? Any way to more programatically determine these?
Description of changes
pcluster_topology_generator.pyis used for creating topology.conf file on HeadNodep6egb200_block_sizesnode attribute is added only for Gb200 instanceWe create
slurm_parallelcluster_topology.confwhich is included as part of slurm.conf and created with any the HeadNode instance, however we only enable theTopologyPlugin=topology/blockonly for gb200 instance.Tests
AL2023
creation is successful
p6egb200_block_sizes: 1and also updated the scheduling section to use 1 static nodeand
AL2023
scontrol show topologydoesnt return anything and returns empty lineAL2
* Force Updated same cluster to use
ExtraChefAttributes: | {"cluster": {"p6egb200_block_sizes": "2" }}and updated the scheduling section to use 2 nodes. Result was none of the conf files were createdSTATUS: DRAFT as I have commits added for testing
References
aws/aws-parallelcluster#6928
#3001
#2996
aws/aws-parallelcluster#6930
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.