-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Introduce WG Checkpoint Restore #8508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @adrianreber! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adrianreber The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
33e97fb
to
abc1c26
Compare
/ok-to-test |
Looking at #8519, I see that we are missing a charter. |
In https://github.com/kubernetes/community/blob/master/sig-wg-lifecycle.md#GitHub is says to add a charter once this initial PR has been merged. That's why is skipped it. |
the integration of Checkpoint/Restore functionality into Kubernetes. | ||
charter_link: charter.md | ||
stakeholder_sigs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sig auth may have a big say in security of this whole restoration pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added sig auth to the list of stakeholder sigs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this showed up in the sig-auth meeting, we may have missed the discussion around this WG
if this WG is contemplating taking state from a running pod / saving it / letting it be consumed on another node or from another pod or another namespace, then sig-auth is definitely interested in making sure the permissions model around that exists and is ~consistent with similar things Kubernetes does elsewhere (like PVC / snapshots)
We're happy to consult on that, I'm not sure our awareness / involvement rises to the level of sponsoring the WG :)
cc @kubernetes/sig-auth-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod.. definitely needs an extra level of security due to customer data being serialized and available in the checkpoint, esp if not encrypted, but also due to windows of opportunity to do transactions/data manipulation.. then "undo" them by restoring a checkpoint
abc1c26
to
8bc6968
Compare
/assign ritazh (assigned as part of SIG Auth triage; to review the SIG Auth updates) |
@kubernetes/sig-node-leads are you all +1, officially? |
+1 from me |
wg-checkpoint-restore/charter.md
Outdated
|
||
- maintain a solid communication line between the Kubernetes groups and the | ||
wider CNCF community | ||
- submit a proposal to the KubeCon/CloudNativeCon maintainers track |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts if this incentivize the right behavior and will encourage people to build WG to get a slot in the kubecon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Antonio, this particular line should be removed, it's sufficient what the previous point shows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just copied from 0743da8, but I will remove it.
the integration of Checkpoint/Restore functionality into Kubernetes. | ||
charter_link: charter.md | ||
stakeholder_sigs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Janet here, but please make sure to show up and present the scope of this proposal to one of the future SIG-Apps calls.
wg-checkpoint-restore/charter.md
Outdated
The Checkpoint/Restore Working Group aims to solve the problem of transparently | ||
checkpointing and restoring workloads in Kubernetes, a functionality discussed | ||
for over five years. The group will deliver the design and implementation of | ||
Checkpoint/Restore functionality in Kubernetes, serving as a central hub for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it has to be a central part of Kubernetes, where multiple external solutions already exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i personally like the idea of having it be integrated because then the ecosystem can rely on it. for instance, we could make eviction or preemption less disruptive in kubelet/kueue respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it has to be a central part of Kubernetes, where multiple external solutions already exists?
Can you be more specific about what already exists? Not sure what you are referring to?
wg-checkpoint-restore/charter.md
Outdated
Checkpoint/Restore functionality in Kubernetes, serving as a central hub for | ||
community information and discussion. This initiative addresses a wide range of | ||
problems, including fault tolerance, improved resource utilization, and | ||
accelerated application startup times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first thing that I'd like to point out is that there are 2 main use cases:
- the whole control-plane snapshot
- workload
Which one this group is planning to cover? As I'm reading this document I'm seeing both used interchangeably which is very confusing. That's why I'd start with clearly drawing the line between the two and properly documenting which one of these two (or both) are you planning to tackle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do yo mean by "control-plane-snapshot"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our presentations, we use the following diagram to illustrate how checkpoint/restore operations work in Kubernetes:

Reference: Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes
|
||
- Ability to checkpoint and restore a container using kubectl | ||
- Ability to checkpoint and restore a pod using kubectl | ||
- Integration of container/pod checkpointing in scheduling decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pod checkpointing would have anything to do with scheduling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because, as far as I know, a pod is always scheduled on one node. It doesn't sound useful to base the scheduling on the possibility to migrate containers. Container migration is an important first step, but for automatic scheduling decisions, it would make more sense to be able to easily migrate a complete pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pod checkpointing would have anything to do with scheduling?
I agree that pod checkpointing needs to take scheduling into consideration. Let's assume the following scenario: after a Pod is checkpointed, it might not be deleted. However, since the memory, CPU, and GPU resources have already been dumped to a volume, the resources originally allocated to the Pod could then be reallocated to other Pods. When the Pod needs to be resumed later via restore, the scheduler should also be involved.
wg-checkpoint-restore/charter.md
Outdated
|
||
- maintain a solid communication line between the Kubernetes groups and the | ||
wider CNCF community | ||
- submit a proposal to the KubeCon/CloudNativeCon maintainers track |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Antonio, this particular line should be removed, it's sufficient what the previous point shows.
224da25
to
63e2024
Compare
63e2024
to
6df1e19
Compare
Co-authored-by: Viktória Spišaková <[email protected]> Co-authored-by: Antonio Ojea <[email protected]> Co-authored-by: Sergey Kanzhelev <[email protected]> Signed-off-by: Adrian Reber <[email protected]>
6df1e19
to
da35847
Compare
SIG Network questions (cc @danwinship @thockin ) does checkpoint restore move the network state ? as the established TCP connections? if affirmative, it requires IP mobility or uses Services or any other abstraction? |
Migrating a container with established TCP connections is possible, but only if the restored container has the same IP as the checkpointed container. In Podman we were able to implement but I am not sure if this is something easily doable in Kubernetes. From our side we would need the possibility to create a pod with a certain IP and then it would work. Is there something in Kuberentes which would allow pod creation with a given IP address? |
For Pod IPs, Is an IPAM choice of the CNI plugin, kubernetes network model allows that, but implementing it is more complex since most implementations have locality implied for the assigned Pod IPs, and to allow IP mobility it requires to deal with routing and ipam across the entire cluster to keep the IP connectivity that adds a considerable complexity. Another thing is if you just expose the Pod application as a Service or Gateway API or some high level abstraction that have VirtualIP, then it is simpler, since those are already abstraction with cluster scope ... This works for Ingress traffic to the application, for Egress traffic is also more complex |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Is there a slack channel where we can discuss C/R related ideas? Thanks |
You are not the first to ask. We kind of are waiting for the proposal to be accepted to have a slack channel. Not sure if there is a another way to have a slack channel without having the proposal merged. |
@lujinda Please reach out to us in the Kubernetes slack. You can find Viktoria, Adrian, and myself there :) |
As described in sig-wg-lifecycle.md this PR is the next step after sending an email to [email protected] about the creation of the Working Group Checkpoint Restore.
CC: @rst0git, @viktoriaas, @xhejtman