Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions OWNERS_ALIASES
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,11 @@ aliases:
- mwielgus
- soltysh
- swatisehgal
wg-checkpoint-restore-leads:
- adrianreber
- haircommander
- rst0git
- viktoriaas
wg-data-protection-leads:
- xing-yang
- yuxiangqian
Expand Down
1 change: 1 addition & 0 deletions liaisons.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ members will assume one of the departing members groups.
| [WG AI Gateway](wg-ai-gateway/README.md) | Stephen Augustus (**[@justaugustus](https://github.com/justaugustus)**) |
| [WG AI Integration](wg-ai-integration/README.md) | Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**) |
| [WG Batch](wg-batch/README.md) | Antonio Ojea (**[@aojea](https://github.com/aojea)**) |
| [WG Checkpoint Restore](wg-checkpoint-restore/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) |
| [WG Data Protection](wg-data-protection/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) |
| [WG Device Management](wg-device-management/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) |
| [WG etcd Operator](wg-etcd-operator/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) |
Expand Down
1 change: 1 addition & 0 deletions sig-api-machinery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-api-machinery:
* [WG AI Integration](/wg-ai-integration)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Structured Logging](/wg-structured-logging)


Expand Down
1 change: 1 addition & 0 deletions sig-apps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
The following [working groups][working-group-definition] are sponsored by sig-apps:
* [WG AI Integration](/wg-ai-integration)
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Data Protection](/wg-data-protection)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
1 change: 1 addition & 0 deletions sig-auth/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-auth:
* [WG AI Integration](/wg-ai-integration)
* [WG Checkpoint Restore](/wg-checkpoint-restore)


## Subprojects
Expand Down
1 change: 1 addition & 0 deletions sig-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md)
|[AI Gateway](wg-ai-gateway/README.md)|[ai-gateway](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-gateway)|* Multicluster<br>* Network<br>|* [Keith Mattix](https://github.com/keithmattix), Microsoft<br>* [Flynn](https://github.com/kflynn), Buoyant<br>* [Kellen Swain](https://github.com/kfswain), Google<br>* [Nir Rozenbaum](https://github.com/nirrozenbaum), IBM<br>* [Shane Utt](https://github.com/shaneutt), Red Hat<br>* [Xunzhuo](https://github.com/xunzhuo), Tencent<br>|* [Slack](https://kubernetes.slack.com/messages/wg-ai-gateway)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-gateway)|* WG AI Gateway Bi-Weekly Meeting (Earlier Option): [Mondays at 12PM UTC (bi-weekly)]()<br>* WG AI Gateway Bi-Weekly Meeting (Later Option): [Thursdays at 6PM UTC (bi-weekly)]()<br>
|[AI Integration](wg-ai-integration/README.md)|[ai-integration](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-integration)|* API Machinery<br>* Apps<br>* Architecture<br>* Auth<br>* CLI<br>|* [Arda Guclu](https://github.com/ardaguclu), Red Hat<br>* [Arush Sharma](https://github.com/rushmash91), Amazon<br>* [Zvonko Kaiser](https://github.com/zvonkok), NVIDIA<br>|* [Slack](https://kubernetes.slack.com/messages/wg-ai-integration)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-integration)|* WG AI Integration Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=71ef14cc0995618018b12614c63ca482d667e2922ff5b94d9fb0cfd32d4efada%40group.calendar.google.com)): [Wednesdays at 10:00 PT (Pacific Time) (weekly)](https://zoom.us/j/95637970280?pwd=3Ys5MQF5hKoeWDazUsMdgt5FiRxbSs.1)<br>
|[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps<br>* Autoscaling<br>* Node<br>* Scheduling<br>|* [Kevin Hannon](https://github.com/kannon92), Red Hat<br>* [Marcin Wielgus](https://github.com/mwielgus), Google<br>* [Maciej Szulik](https://github.com/soltysh), Defense Unicorns<br>* [Swati Sehgal](https://github.com/swatisehgal), Red Hat<br>|* [Slack](https://kubernetes.slack.com/messages/wg-batch)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)<br>
|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* API Machinery<br>* Apps<br>* Auth<br>* Node<br>* Scheduling<br>|* [Adrian Reber](https://github.com/adrianreber), Red Hat<br>* [Peter Hunt](https://github.com/haircommander), Red Hat<br>* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford<br>* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University<br>|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore)|
|[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps<br>* Storage<br>|* [Xing Yang](https://github.com/xing-yang), VMware<br>* [Xiangqian Yu](https://github.com/yuxiangqian), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)<br>
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture<br>* Autoscaling<br>* Network<br>* Node<br>* Scheduling<br>|* [John Belamaric](https://github.com/johnbelamaric), Google<br>* [Kevin Klues](https://github.com/klueska), NVIDIA<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle<br>* etcd<br>|* [Benjamin Wang](https://github.com/ahrtr), VMware<br>* [Ciprian Hacman](https://github.com/hakman), Microsoft<br>* [Josh Berkus](https://github.com/jberkus), Red Hat<br>* [James Blair](https://github.com/jmhbnz), Red Hat<br>* [Justin Santa Barbara](https://github.com/justinsb), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)<br>
Expand Down
1 change: 1 addition & 0 deletions sig-node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-node:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
1 change: 1 addition & 0 deletions sig-scheduling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-scheduling:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
39 changes: 39 additions & 0 deletions sigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3724,6 +3724,45 @@ workinggroups:
liaison:
github: aojea
name: Antonio Ojea
- dir: wg-checkpoint-restore
name: Checkpoint Restore
mission_statement: >
This working group aims to provide a central location for the community to discuss
the integration of Checkpoint/Restore functionality into Kubernetes.

charter_link: charter.md
stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig auth may have a big say in security of this whole restoration pipeline

Copy link
Member

@rst0git rst0git Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sig auth to the list of stakeholder sigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this showed up in the sig-auth meeting, we may have missed the discussion around this WG

if this WG is contemplating taking state from a running pod / saving it / letting it be consumed on another node or from another pod or another namespace, then sig-auth is definitely interested in making sure the permissions model around that exists and is ~consistent with similar things Kubernetes does elsewhere (like PVC / snapshots)

We're happy to consult on that, I'm not sure our awareness / involvement rises to the level of sponsoring the WG :)

cc @kubernetes/sig-auth-leads

Copy link
Member

@mikebrow mikebrow Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod.. definitely needs an extra level of security due to customer data being serialized and available in the checkpoint, esp if not encrypted, but also due to windows of opportunity to do transactions/data manipulation.. then "undo" them by restoring a checkpoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable initiative. The charter mentions that the scope includes checkpointing and restoring 'workloads' and providing 'guidance for developers on checkpoint-friendly app design.' Given this focus, it's essential for SIG Apps to be involved as a key stakeholder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo This is a good idea, thank you so much for suggesting it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks @janetkuo. I added SIG Apps to the proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Janet here, but please make sure to show up and present the scope of this proposal to one of the future SIG-Apps calls.

- API Machinery
- Apps
- Auth
- Node
- Scheduling
label: checkpoint-restore
leadership:
chairs:
- github: adrianreber
name: Adrian Reber
company: Red Hat
email: [email protected]
- github: haircommander
name: Peter Hunt
company: Red Hat
email: [email protected]
- github: rst0git
name: Radostin Stoyanov
company: University of Oxford
email: [email protected]
- github: viktoriaas
name: Viktória Spišaková
company: Masaryk University
email: [email protected]
meetings: []
contact:
slack: wg-checkpoint-restore
mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore
liaison:
github: BenTheElder
name: Benjamin Elder
- dir: wg-data-protection
name: Data Protection
mission_statement: >
Expand Down
38 changes: 38 additions & 0 deletions wg-checkpoint-restore/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!---
This is an autogenerated file!

Please do not edit this file directly, but instead make changes to the
sigs.yaml file in the project root.

To understand how this file is generated, see https://git.k8s.io/community/generator/README.md
--->
# Checkpoint Restore Working Group

This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes.

The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group.

## Stakeholder SIGs
* [SIG API Machinery](/sig-api-machinery)
* [SIG Apps](/sig-apps)
* [SIG Auth](/sig-auth)
* [SIG Node](/sig-node)
* [SIG Scheduling](/sig-scheduling)



## Organizers

* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat
* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat
* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford
* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University

## Contact
- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
- Steering Committee Liaison: Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**)
<!-- BEGIN CUSTOM CONTENT -->

<!-- END CUSTOM CONTENT -->
88 changes: 88 additions & 0 deletions wg-checkpoint-restore/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@

# WG Checkpoint Restore Charter

This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
the Roles and Organization Management outlined in [sig-governance].

## Scope

The Checkpoint/Restore Working Group aims to solve the problem of transparently
checkpointing and restoring workloads in Kubernetes, a [functionality discussed
for over five years][kep2008]. The group will deliver the design and
implementation of Checkpoint/Restore functionality in Kubernetes, serving as a
central hub for community information and discussion. This initiative addresses
a wide range of problems, including fault tolerance, improved resource
utilization, and accelerated application startup times.

### In scope

- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration,
fault tolerance, debugging, snapshotting) and gather stakeholder requirements.
- Investigate and propose Kubernetes APIs for checkpoint/restore operations.
- Work with SIGs for the best integration of checkpoint/restore functionality
and APIs.
- Provide guidance for developers on checkpoint-friendly app design and
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be API needed to communicate between the app and API server that the checkopoint is requested AND/OR that the app is ready for checkpoint. Something that is beyond just guidance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually something we discussed how to do in containers for years now (outside of Kubernetes). But we never found the right way how to do this. We were looking at kernel interfaces or systemd interfaces because for many applications it could be helpful to free temporary memory to reduce checkpoint size or even drop confidential information. Also after restore it would be good to tell the application that maybe certain cryptographic values need to be reset or regenerated. I will try to include something mentioning this. Thanks.

recommendations for operators on feature management.
- Work closely with relevant upstream projects (CRI-O, containerd, CRIU, gVisor)
for alignment and integration.
- Revisit the existing implementations to find and remedy possible inefficiencies.
One example is the existing checkpoint archive format which has already been
identified as being a major source of slowdown.

### Out of scope

- Not focused on general OS-level checkpointing outside Kubernetes
pods/containers.
- Will not dictate internal application checkpointing logic; focuses on
Kubernetes platform orchestration of *container/pod state.

## Stakeholders

Stakeholders in this working group span multiple SIGs that own parts of the
code in core kubernetes components and addons.

- SIG API Machinery
- SIG Node
- SIG Scheduling
- SIG Auth
- SIG Apps

## Deliverables

The list of deliverables include the following high level features:

- In the early stage, we mainly want to offer a well-defined location for the
community to find information, ask questions, and discuss the next steps of
enabling checkpoint and restore in Kubernetes.

Later:

- Ability to checkpoint and restore a container using kubectl
- Ability to checkpoint and restore a pod using kubectl
- Integration of container/pod checkpointing in scheduling decisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pod checkpointing would have anything to do with scheduling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, as far as I know, a pod is always scheduled on one node. It doesn't sound useful to base the scheduling on the possibility to migrate containers. Container migration is an important first step, but for automatic scheduling decisions, it would make more sense to be able to easily migrate a complete pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our use-case is similar to how CRIU is integrated with Google's Borg 1 and Microsoft's Singularity 2 to enable preemptive and elastic scheduling.

Footnotes

  1. Task Migration at Scale Using CRIU

  2. Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pod checkpointing would have anything to do with scheduling?

I agree that pod checkpointing needs to take scheduling into consideration. Let's assume the following scenario: after a Pod is checkpointed, it might not be deleted. However, since the memory, CPU, and GPU resources have already been dumped to a volume, the resources originally allocated to the Pod could then be reallocated to other Pods. When the Pod needs to be resumed later via restore, the scheduler should also be involved.


## Roles and Organization Management

This WG adheres to the Roles and Organization Management outlined in [wg-governance]
and opts-in to updates and modifications to [wg-governance].

[wg-governance]: /committee-steering/governance/wg-governance.md

Additionally, the WG commits to:

- maintain a solid communication line between the Kubernetes groups and the
wider CNCF community

## Timelines and Disbanding

As a first mandate, the WG will propose a draft roadmap and identify key tasks in the first quarter of operation.

After that, the WG will facilitate collaboration among community members to explore possible APIs and draft proposals for their integration into Kubernetes, which will then be presented to the relevant SIGs.

Achieving the aforementioned deliverables, also mentioned in the `In Scope`
section, will allow us to decide when to disband this WG. There is no
expectations that the Working Group will be converted into a SIG long term.

[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md
[kep2008]: https://github.com/kubernetes/enhancements/issues/2008