Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default behavior to always wait for reconciliation & deletion confirmation #438

Closed
karlkfi opened this issue Oct 15, 2021 · 7 comments · Fixed by #449
Closed

Change default behavior to always wait for reconciliation & deletion confirmation #438

karlkfi opened this issue Oct 15, 2021 · 7 comments · Fixed by #449
Assignees

Comments

@karlkfi
Copy link
Contributor

karlkfi commented Oct 15, 2021

The Applier and Destroyer each have WaitTasks used to wait until objects are Current (after apply) or NotFound (after prune/delete), but this behavior is a confusing "wait for some but not others" that is hard to describe and not relative to configuration, but rather the types of resources being applied together.

  1. ReconcileTimeout, PruneTimeout, and DeleteTimeout have no explicit default value, which means their implicit default is 0.
  2. But then the default of 0 is ignored by the Solver and 1 minute is used instead:
  3. But the default of 1 minutes is only used if numObjSets > 1, otherwise the WaitTask is skipped.
  4. This causes the WaitEvent to be skipped:

Then in addition to that inconsistency, WaitTasks use DeletePropagationBackground by default, which doesn't wait until child objects are deleted before deleting the parent object. So things like ReplicaSets and Pods may still exist after a Deployment is confirmed to be NotFound.

As a user of cli-utils, I would rather that the applier/destroyer either waited for every resource to be reconciled/deleted or not wait at all.

In the context of kpt and Config Sync, I think the default behavior should be to wait forever until cancelled by the caller (they can implement their own global timeout using context.WithTimeout or other cancellation using context.WithCancel), now that both the applier and Destroyer accept context as input:

I also think the default should be changed to DeletePropagationForeground so that the applier/destroyer don't exit until deletes are fully propagated.

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 15, 2021

Proposed fixes:

  • Make the default timeout actually zero in the apply options and destroy options (don't override it in the solver).
  • When the timeout is zero, never timeout (unless the whole run is cancelled by the caller)
  • Always add the WaitTask, so the WaitEvent is always sent
  • Switch to DeletePropagationForeground by default
  • (Tech Debt) Clean up the WaitTask timeout to just use a context.WithTimeout that wraps the caller's context, instead of a manual timer. Then we can just use the cancelFunc instead of ClearTimeout
  • (Tech Debt) Add an exposed Task method to replace checkCondition & completeIfWaitTask instead of having the baseRunner call multiple private functions on the WaitTask.

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 15, 2021

/assign @karlkfi

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 15, 2021

@ash2k @haiyanmeng @mortent
Can yall comment of how this will affect your usage of cli-utils?
Does it sound acceptable?

@ash2k
Copy link
Member

ash2k commented Oct 15, 2021

@karlkfi I agree with the list of issues and I agree with the plan to address them 👍

In my case, I don't mind if cli-utils breaks API compatibility (if that would be necessary) to address any issues as long as:

  • the change breaks compilation so that I can catch it and/or
  • the change is documented in release notes so that I can adjust how I use the library (especially if the breaking change doesn't cause a compilation failure).

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 18, 2021

Related effort to make cancellation better/faster: #440

@mortent
Copy link
Member

mortent commented Oct 26, 2021

I agree with the issues and the proposed solution.

I think it would also be useful to be able to specify no-wait, which would apply all resources but not wait for reconciliation. When there are dependencies, it would just skip resources that depends on previous resources that haven't reconciled (i.e. just check once). But we can handle this separately at a later time.

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 26, 2021

I think it would also be useful to be able to specify no-wait, which would apply all resources but not wait for reconciliation.

Yeah, it sounds like @haiyanmeng wants this for Config Sync, but it also sounds like a short-term fix, not a long term solution.

When there are dependencies, it would just skip resources that depends on previous resources that haven't reconciled (i.e. just check once)

This is actually a lot more work. It would involve replacing or augmenting the asynchronous poller with a synchronous mechanism. I think it makes the most sense to do as part of the future async apply effort (which I've talked about but not spec'd out yet). Basically, when moving to a fully asynchronous apply, each apply/prune task needs to check that its dependencies are met before executing. That would mean each task needs to know it's dependencies (which they currently don't) instead of flattening the graph into phases. If we already had that, then making it only check once and skip if not ready would be easy. But that's not a short term solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants