Skip to content

Conversation

jmdeal
Copy link
Member

@jmdeal jmdeal commented Dec 11, 2024

Fixes #N/A

Description
Adds status conditions for node drain and volume detachment to improve observability for the individual termination stages. This is a scoped down version of #1837, which takes these changes along with splitting each termination stage into a separate controller. I will continue to work on that refactor, but I'm decoupling to work on higher priority work.

Status Conditions:

Condition Unknown False True
Drained Karpenter hasn't attempted to drain the Node yet or the node is currently draining (marked with reason) N/A Karpenter has successfully drained the node and will proceed with the termination flow.
VolumesDetached Karpenter hasn't checked for volume attachments yet or there are volume attachments which are currently blocking termination. This won't transition out of unknown until Drained transitions to true. There are blocking volume attachments, but the nodeclaim's TerminationGracePeriod has elapsed. This state is only possible if TGP is configured. All blocking volume attachment objects have been deleted.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2024
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 4bb4d97 to fb3ac47 Compare December 11, 2024 20:31
@coveralls
Copy link

coveralls commented Dec 11, 2024

Pull Request Test Coverage Report for Build 15175010548

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 105 of 155 (67.74%) changed or added relevant lines in 4 files are covered.
  • 62 unchanged lines in 7 files lost coverage.
  • Overall coverage decreased (-0.1%) to 81.773%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/node/node.go 9 17 52.94%
pkg/controllers/node/termination/controller.go 85 127 66.93%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/provisioning/scheduling/nodeclaim.go 3 89.66%
pkg/controllers/disruption/singlenodeconsolidation.go 4 93.62%
pkg/controllers/node/termination/controller.go 5 67.56%
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
pkg/controllers/disruption/emptiness.go 8 87.3%
pkg/controllers/disruption/multinodeconsolidation.go 8 86.86%
pkg/controllers/disruption/validation.go 27 81.92%
Totals Coverage Status
Change from base Build 15120919025: -0.1%
Covered Lines: 10220
Relevant Lines: 12498

💛 - Coveralls

@engedaam
Copy link
Contributor

/assign @engedaam

Copy link

github-actions bot commented Jan 2, 2025

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2025
@jmdeal
Copy link
Member Author

jmdeal commented Jan 11, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from b527992 to 21176e1 Compare January 15, 2025 20:02
@jmdeal
Copy link
Member Author

jmdeal commented Jan 15, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
@engedaam
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 15, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from d536a96 to 43949ef Compare January 16, 2025 17:21
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 39b868c to aba9dfe Compare April 21, 2025 18:19
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from aba9dfe to 3fc5cac Compare April 28, 2025 18:52
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 655f7a5 to a34eca9 Compare May 3, 2025 00:08
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from a34eca9 to 8361363 Compare May 20, 2025 20:43
@jmdeal
Copy link
Member Author

jmdeal commented May 20, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2025
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: engedaam, jmdeal, jonathan-innis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jmdeal,jonathan-innis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jmdeal
Copy link
Member Author

jmdeal commented May 27, 2025

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2025
@k8s-ci-robot k8s-ci-robot merged commit 5a6707a into kubernetes-sigs:main May 27, 2025
22 of 24 checks passed
jonathan-innis added a commit to jonathan-innis/karpenter that referenced this pull request May 29, 2025
DerekFrank added a commit to DerekFrank/karpenter-testing-fork that referenced this pull request Jun 3, 2025
harshad3339 added a commit to acquia/karpenter that referenced this pull request Jul 31, 2025
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229)

* perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232)

* test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231)

* test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234)

* refactor: convert validation to an interface (kubernetes-sigs#2220)

* fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206)

* perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233)

* feat: add metrics for disruption candidate validation (kubernetes-sigs#2239)

* perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241)

* docs: update issue triage meeting schedule (kubernetes-sigs#2244)

* test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240)

* perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247)

* perf: Improve OrderByPrice performance (kubernetes-sigs#2250)

* test: add validating admission policy for nodeclass status (kubernetes-sigs#2251)

Co-authored-by: Jonathan Innis <[email protected]>

* feat: drain and volume detachment status conditions (kubernetes-sigs#1876)

* fix: show the cron parse error to users to allow them to debug (kubernetes-sigs#2258)

* perf: Don't deep-copy nodes and nodeclaims in our synced check (kubernetes-sigs#2260)

* chore: Fix getting current script directory in install-kwok.sh (kubernetes-sigs#2262)

* perf: Perform quick checks in node health first (kubernetes-sigs#2264)

* chore: Update pod metrics when pod is completed (kubernetes-sigs#2259)

* fix: Correctly build nodepool mapping for complex clusters (kubernetes-sigs#2263)

* fix: fail open for missing nodeclaims in termination (kubernetes-sigs#2266)

* perf: Limit GetInstanceTypes() calls per-NodeClaim (kubernetes-sigs#2271)

* perf: Parallelize disruption execution actions (kubernetes-sigs#2270)

* fix: Fix node owner reference update (kubernetes-sigs#2274)

* perf: Be more resilient to deletion failures in disruption controller (kubernetes-sigs#2272)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2277)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283)

* chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285)

* chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284)

* chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore: cherry-pick kubernetes-sigs#2399 (kubernetes-sigs#2401)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: DerekFrank <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Todd Neal <[email protected]>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants