- 
                Notifications
    You must be signed in to change notification settings 
- Fork 57
feat: Pathways use single resource group #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota. This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group. This also allows us to run AXLearn jobs without having to make changes manually.
Otherwise the pathways head pod will not first get assigned to CPU only resource flavors.
| Seems @lukebaumann encountered an issue when not using  | 
| Seems NAP without pathways is also impacted. I think we need a different fix. See #603 | 
| This PR should solve NAP and AXLearn support as well. Would prefer to get this merged and will check with Luke on why it wasn't working for him. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, this is no longer needed and should be closed. Flags were added to set CPU and memory limits manually when needed.
| It's still needed for us to be able to run AXLearn on xpk clusters. | 
| I think it's set in xpk/src/xpk/core/kueue_manager.py Line 324 in f0626b9 
 | 
Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.
This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.
This also allows us to run AXLearn jobs without having to make changes manually.
Follow up from: #574
this time with a branch within xpk repo.