diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index 592ce4e6cc8..8b4c7a5f30c 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -146,6 +146,11 @@ aliases: - mwielgus - soltysh - swatisehgal + wg-checkpoint-restore-leads: + - adrianreber + - haircommander + - rst0git + - viktoriaas wg-data-protection-leads: - xing-yang - yuxiangqian diff --git a/liaisons.md b/liaisons.md index 5640e615b20..cd56458fd61 100644 --- a/liaisons.md +++ b/liaisons.md @@ -58,6 +58,7 @@ members will assume one of the departing members groups. | [WG AI Gateway](wg-ai-gateway/README.md) | Stephen Augustus (**[@justaugustus](https://github.com/justaugustus)**) | | [WG AI Integration](wg-ai-integration/README.md) | Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**) | | [WG Batch](wg-batch/README.md) | Antonio Ojea (**[@aojea](https://github.com/aojea)**) | +| [WG Checkpoint Restore](wg-checkpoint-restore/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) | | [WG Data Protection](wg-data-protection/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) | | [WG Device Management](wg-device-management/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) | | [WG etcd Operator](wg-etcd-operator/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) | diff --git a/sig-api-machinery/README.md b/sig-api-machinery/README.md index 2fa05b2f5d3..523eb8d1e8f 100644 --- a/sig-api-machinery/README.md +++ b/sig-api-machinery/README.md @@ -55,6 +55,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-api-machinery: * [WG AI Integration](/wg-ai-integration) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-apps/README.md b/sig-apps/README.md index ce44ac07645..02d0754e209 100644 --- a/sig-apps/README.md +++ b/sig-apps/README.md @@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-apps: * [WG AI Integration](/wg-ai-integration) * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Data Protection](/wg-data-protection) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-auth/README.md b/sig-auth/README.md index 39110757e10..2615e7b2088 100644 --- a/sig-auth/README.md +++ b/sig-auth/README.md @@ -66,6 +66,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-auth: * [WG AI Integration](/wg-ai-integration) +* [WG Checkpoint Restore](/wg-checkpoint-restore) ## Subprojects diff --git a/sig-list.md b/sig-list.md index a2de706ba1f..a8ce5be5bd2 100644 --- a/sig-list.md +++ b/sig-list.md @@ -65,6 +65,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md) |[AI Gateway](wg-ai-gateway/README.md)|[ai-gateway](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-gateway)|* Multicluster
* Network
|* [Keith Mattix](https://github.com/keithmattix), Microsoft
* [Flynn](https://github.com/kflynn), Buoyant
* [Kellen Swain](https://github.com/kfswain), Google
* [Nir Rozenbaum](https://github.com/nirrozenbaum), IBM
* [Shane Utt](https://github.com/shaneutt), Red Hat
* [Xunzhuo](https://github.com/xunzhuo), Tencent
|* [Slack](https://kubernetes.slack.com/messages/wg-ai-gateway)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-gateway)|* WG AI Gateway Bi-Weekly Meeting (Earlier Option): [Mondays at 12PM UTC (bi-weekly)]()
* WG AI Gateway Bi-Weekly Meeting (Later Option): [Thursdays at 6PM UTC (bi-weekly)]()
|[AI Integration](wg-ai-integration/README.md)|[ai-integration](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-integration)|* API Machinery
* Apps
* Architecture
* Auth
* CLI
|* [Arda Guclu](https://github.com/ardaguclu), Red Hat
* [Arush Sharma](https://github.com/rushmash91), Amazon
* [Zvonko Kaiser](https://github.com/zvonkok), NVIDIA
|* [Slack](https://kubernetes.slack.com/messages/wg-ai-integration)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-integration)|* WG AI Integration Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=71ef14cc0995618018b12614c63ca482d667e2922ff5b94d9fb0cfd32d4efada%40group.calendar.google.com)): [Wednesdays at 10:00 PT (Pacific Time) (weekly)](https://zoom.us/j/95637970280?pwd=3Ys5MQF5hKoeWDazUsMdgt5FiRxbSs.1)
|[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps
* Autoscaling
* Node
* Scheduling
|* [Kevin Hannon](https://github.com/kannon92), Red Hat
* [Marcin Wielgus](https://github.com/mwielgus), Google
* [Maciej Szulik](https://github.com/soltysh), Defense Unicorns
* [Swati Sehgal](https://github.com/swatisehgal), Red Hat
|* [Slack](https://kubernetes.slack.com/messages/wg-batch)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)
+|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* API Machinery
* Apps
* Auth
* Node
* Scheduling
|* [Adrian Reber](https://github.com/adrianreber), Red Hat
* [Peter Hunt](https://github.com/haircommander), Red Hat
* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford
* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University
|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore)| |[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps
* Storage
|* [Xing Yang](https://github.com/xing-yang), VMware
* [Xiangqian Yu](https://github.com/yuxiangqian), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture
* Autoscaling
* Network
* Node
* Scheduling
|* [John Belamaric](https://github.com/johnbelamaric), Google
* [Kevin Klues](https://github.com/klueska), NVIDIA
* [Patrick Ohly](https://github.com/pohly), Intel
|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle
* etcd
|* [Benjamin Wang](https://github.com/ahrtr), VMware
* [Ciprian Hacman](https://github.com/hakman), Microsoft
* [Josh Berkus](https://github.com/jberkus), Red Hat
* [James Blair](https://github.com/jmhbnz), Red Hat
* [Justin Santa Barbara](https://github.com/justinsb), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)
diff --git a/sig-node/README.md b/sig-node/README.md index 1ef3742dbb3..1826db21a11 100644 --- a/sig-node/README.md +++ b/sig-node/README.md @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-node: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-scheduling/README.md b/sig-scheduling/README.md index 1d6b8c590f0..f0365b30bc6 100644 --- a/sig-scheduling/README.md +++ b/sig-scheduling/README.md @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-scheduling: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sigs.yaml b/sigs.yaml index 2c0af04b965..b94a7b69594 100644 --- a/sigs.yaml +++ b/sigs.yaml @@ -3724,6 +3724,45 @@ workinggroups: liaison: github: aojea name: Antonio Ojea +- dir: wg-checkpoint-restore + name: Checkpoint Restore + mission_statement: > + This working group aims to provide a central location for the community to discuss + the integration of Checkpoint/Restore functionality into Kubernetes. + + charter_link: charter.md + stakeholder_sigs: + - API Machinery + - Apps + - Auth + - Node + - Scheduling + label: checkpoint-restore + leadership: + chairs: + - github: adrianreber + name: Adrian Reber + company: Red Hat + email: areber@redhat.com + - github: haircommander + name: Peter Hunt + company: Red Hat + email: pehunt@redhat.com + - github: rst0git + name: Radostin Stoyanov + company: University of Oxford + email: radostin.stoyanov@eng.ox.ac.uk + - github: viktoriaas + name: Viktória Spišaková + company: Masaryk University + email: spisakova@ics.muni.cz + meetings: [] + contact: + slack: wg-checkpoint-restore + mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore + liaison: + github: BenTheElder + name: Benjamin Elder - dir: wg-data-protection name: Data Protection mission_statement: > diff --git a/wg-checkpoint-restore/README.md b/wg-checkpoint-restore/README.md new file mode 100644 index 00000000000..b0ae98da899 --- /dev/null +++ b/wg-checkpoint-restore/README.md @@ -0,0 +1,38 @@ + +# Checkpoint Restore Working Group + +This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes. + +The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group. + +## Stakeholder SIGs +* [SIG API Machinery](/sig-api-machinery) +* [SIG Apps](/sig-apps) +* [SIG Auth](/sig-auth) +* [SIG Node](/sig-node) +* [SIG Scheduling](/sig-scheduling) + + + +## Organizers + +* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat +* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat +* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford +* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University + +## Contact +- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore) +- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore) +- Steering Committee Liaison: Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) + + + diff --git a/wg-checkpoint-restore/charter.md b/wg-checkpoint-restore/charter.md new file mode 100644 index 00000000000..64166643e2c --- /dev/null +++ b/wg-checkpoint-restore/charter.md @@ -0,0 +1,88 @@ + +# WG Checkpoint Restore Charter + +This charter adheres to the conventions described in the [Kubernetes Charter README] and uses +the Roles and Organization Management outlined in [sig-governance]. + +## Scope + +The Checkpoint/Restore Working Group aims to solve the problem of transparently +checkpointing and restoring workloads in Kubernetes, a [functionality discussed +for over five years][kep2008]. The group will deliver the design and +implementation of Checkpoint/Restore functionality in Kubernetes, serving as a +central hub for community information and discussion. This initiative addresses +a wide range of problems, including fault tolerance, improved resource +utilization, and accelerated application startup times. + +### In scope + +- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration, + fault tolerance, debugging, snapshotting) and gather stakeholder requirements. +- Investigate and propose Kubernetes APIs for checkpoint/restore operations. +- Work with SIGs for the best integration of checkpoint/restore functionality + and APIs. +- Provide guidance for developers on checkpoint-friendly app design and + recommendations for operators on feature management. +- Work closely with relevant upstream projects (CRI-O, containerd, CRIU, gVisor) + for alignment and integration. +- Revisit the existing implementations to find and remedy possible inefficiencies. + One example is the existing checkpoint archive format which has already been + identified as being a major source of slowdown. + +### Out of scope + +- Not focused on general OS-level checkpointing outside Kubernetes + pods/containers. +- Will not dictate internal application checkpointing logic; focuses on + Kubernetes platform orchestration of *container/pod state. + +## Stakeholders + +Stakeholders in this working group span multiple SIGs that own parts of the +code in core kubernetes components and addons. + +- SIG API Machinery +- SIG Node +- SIG Scheduling +- SIG Auth +- SIG Apps + +## Deliverables + +The list of deliverables include the following high level features: + +- In the early stage, we mainly want to offer a well-defined location for the + community to find information, ask questions, and discuss the next steps of + enabling checkpoint and restore in Kubernetes. + +Later: + +- Ability to checkpoint and restore a container using kubectl +- Ability to checkpoint and restore a pod using kubectl +- Integration of container/pod checkpointing in scheduling decisions + +## Roles and Organization Management + +This WG adheres to the Roles and Organization Management outlined in [wg-governance] +and opts-in to updates and modifications to [wg-governance]. + +[wg-governance]: /committee-steering/governance/wg-governance.md + +Additionally, the WG commits to: + +- maintain a solid communication line between the Kubernetes groups and the + wider CNCF community + +## Timelines and Disbanding + +As a first mandate, the WG will propose a draft roadmap and identify key tasks in the first quarter of operation. + +After that, the WG will facilitate collaboration among community members to explore possible APIs and draft proposals for their integration into Kubernetes, which will then be presented to the relevant SIGs. + +Achieving the aforementioned deliverables, also mentioned in the `In Scope` +section, will allow us to decide when to disband this WG. There is no +expectations that the Working Group will be converted into a SIG long term. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md +[kep2008]: https://github.com/kubernetes/enhancements/issues/2008