Skip to content

Conversation

@samos123
Copy link
Collaborator

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.

This also allows us to run AXLearn jobs without having to make changes manually.

Follow up from: #574
this time with a branch within xpk repo.

Jobs requesting TPU resources may also have requests for CPU and memory.
However when pathways is enabled, Kueue will not be able to admit such
jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources
and merges the pathways resource group with the accelerator resource
group.

This also allows us to run AXLearn jobs without having to make changes
manually.
Otherwise the pathways head pod will not first get assigned to CPU only
resource flavors.
@samos123
Copy link
Collaborator Author

Seems @lukebaumann encountered an issue when not using create-pathways. A potential fix is to remove the create-pathways command all together since it doesn't seem needed. We may be able to get rid of cpu resource flavor which would also unblock AXLearn jobs.

@samos123
Copy link
Collaborator Author

Seems NAP without pathways is also impacted. I think we need a different fix. See #603

@samos123
Copy link
Collaborator Author

This PR should solve NAP and AXLearn support as well. Would prefer to get this merged and will check with Luke on why it wasn't working for him.

@scaliby scaliby changed the base branch from develop to main October 28, 2025 10:01
Copy link
Collaborator

@SikaGrr SikaGrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, this is no longer needed and should be closed. Flags were added to set CPU and memory limits manually when needed.

@samos123
Copy link
Collaborator Author

It's still needed for us to be able to run AXLearn on xpk clusters.

@jamOne-
Copy link
Collaborator

jamOne- commented Oct 28, 2025

I think it's set in

"coveredResources": ["cpu", "memory"],
. @samos123 could you verify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants