Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karpenter_cloudprovider_instance_type_offering_available is returning the wrong data #7758

Open
jonathan-innis opened this issue Feb 21, 2025 · 2 comments
Labels
bug Something isn't working good-first-issue Good for newcomers help-wanted Extra attention is needed triage/accepted Indicates that the issue has been accepted as a valid issue

Comments

@jonathan-innis
Copy link
Contributor

Description

Observed Behavior:

The karpenter_cloudprovider_instance_type_offering_available should return the offering availability as a global availability so that we reduce cardinality (and mainly because having a per-NodePool availability doesn't make a ton of sense). Right now, because of subnet discovery affecting availability through GetInstanceTypes, this metric will change as different NodePools call this GetInstanceTypes function.

We need to compute the availability before we filter out offerings due to subnets -- we basically need to just consider the overall offering availability -- removing offerings that are currently stored in our ICE cache

Expected Behavior:

karpenter_cloudprovider_instance_type_offering_available should return the global offering availability

Reproduction Steps (Please include YAML):

  • Create two NodePools with different subnet selectors
  • Note that the availability metric will oscillate depending on which NodePool is being used to call GetInstanceTypes()
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added bug Something isn't working needs-triage Issues that need to be triaged help-wanted Extra attention is needed good-first-issue Good for newcomers triage/accepted Indicates that the issue has been accepted as a valid issue and removed needs-triage Issues that need to be triaged labels Feb 21, 2025
@jmdeal
Copy link
Contributor

jmdeal commented Feb 22, 2025

and mainly because having a per-NodePool availability doesn't make a ton of sense

I don't think this is actually the case, at least in light of #7726. As we've discussed in that PR, we don't want to increase the cardinality of this metric to include reservation ID. Consider the case where we have multiple reserved offerings for the same zone spread across different NodePools. In that case it would be possible for one NodePool to be available, and another not to be.

We could definitely still take the stance that this metric will reflect if capacity is available across any NodePool for a given instance pool, but I do think there's an argument for labeling by NodePool. Maybe the cardinality makes it a non-starter, but I do think the use case exists.

@jonathan-innis
Copy link
Contributor Author

Consider the case where we have multiple reserved offerings for the same zone spread across different NodePools. In that case it would be possible for one NodePool to be available, and another not to be.

Agreed with the technical details but disagree with the utility -- at least in light of the trade-offs that would have to be made on metric cardinality for anyone with 10+ NodePools e.g. 800 instance types * 3 capacity types * 6 (max) zones = 14400 just for a single NodePool. I think the utility starts to go down if you are just showing details that are pretty easily discoverable. To me, the utility lies in surfacing ICE cache and general offering availability in the region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue Good for newcomers help-wanted Extra attention is needed triage/accepted Indicates that the issue has been accepted as a valid issue
Projects
None yet
Development

No branches or pull requests

2 participants