2.0.0
Introduction
Version 2.0.0 brings newer capabilities to the LitmusChaos platform, enabling a more efficient practice of chaos engineering. The major version upgrade is being carried out to reflect significant improvements and new features in the platform - many of which were introduced & curated across several preceding 2.0 beta releases with community feedback (thanks to all the early adopters & beta testers for your continued support. Some of these changes, especially. newer experiments and observability improvements have been made available in 1.x too).
Litmus 1.x brought a cloud-native approach to chaos engineering to the definition and execution of chaos intent, along with a ready set of experiments maintained in the ChaosHub. Along the way, newer requirements were incorporated into the project, most notably around a centralized management approach for managing chaos across environments (K8s clusters and cloud instances) and the ability to define workflows to stitch together multiple experiments as part of a complex scenario.
The 2.0 GA release brings these features into the mainstream, having been validated for their usefulness & architecture. Subsequent improvements to these will be carried out in 2.x releases. Some salient features are described briefly in below sections:
Chaos Center
- A chaos control plane or portal which provides centralized management of chaos operations on multiple clusters across datacenters/cloud. The control plane carries out experiments through agents installed on the registered clusters.
- Comprises documented APIs that can be used to invoke chaos programmatically
- Provides visualization capabilities and analytics around chaos execution.
- Supports a project-teams-users structure to enable collaboration within teams for chaos operations.
Litmus Workflows
- Introduces chaos workflows - to (a) automate dependency setup (b) aid creation of complex chaos scenarios with multiple faults (c) support definition of load/validation jobs along with chaos injection
- Provides flexibility in creating/running workflows in different ways - via templates, from an integrated hub, and custom uploads.
Multi-Tenancy
- Supports setup (control plane & agents) and execution of chaos experiments in both: cluster-scoped and namespace-scoped modes to help operations in shared clusters with a self-service model
Observability & Steady State Hypothesis Validation
- Provides an increased set of Prometheus metrics with additional filters - which can be used for instrumenting application dashboards to observe chaos impact
- Provides diverse set of probes to automate validation of steady-state hypothesis - thereby improving the efficiency of running automated chaos experiments
GitOps for Chaos
- Integrates with Git-based SCM to provide a single-source-of-truth for chaos artifacts (workflows), such that changes are synchronized bi-directionally b/w the git source and the chaos center - thereby pulling the latest artifact for execution.
- Provides an event-tracker microservice to automatically launch “subscribed” chaos workflows upon app upgrades effected by GitOps tools like ArgoCD, Flux
Non-Kubernetes Chaos
Adds experiments to inject chaos on infrastructure (cloud) resources such as VMs/instances and disks (AWS, GCP, Azure, VMWare) - irrespective of whether they host a Kubernetes cluster or not.
Release Cadence & Versioning
The release cadence & naming conventions continue to adhere to the principles followed thus far in the Litmus project: the monthly minor version releases (2.x.0) will happen on the 15th, with patch releases/hotfixes going into 2.x.x, on a need/demand basis. The 1.x version will be stopped at 1.13.x (1.13.8 at this point) and further patches will be made only upon request/community need.
Backward Compatibility
Having said that, Litmus 2.x completely lends itself to the 1.x mode of execution the users are familiar with, i.e., you could still continue to deploy the latest chaos-operator deployment in admin/namespace mode, pull ChaosExperiment templates/CRs from the ChaosHub & trigger chaos by applying the ChaosEngine CR. The latest chaos-exporter & chaos-scheduler will continue to be operable as they are. However, the introduction of the Chaos-Center (also commonly referenced as Litmus Portal by the beta test community) simplifies the above process greatly while giving you additional nuts & bolts.
Migration from 1.x to 2.x
To make use of the Chaos Center and other capabilities of Litmus 2.0, please remove any existing ChaosEngines, uninstall the chaos operator deployment & follow the Litmus 2.x installation instructions.
If you would like to consume just the backend infrastructure components (chaos operator, crds et al), please follow the regular procedure in applying the latest operator manifest or start using the operator helm chart to allow for subsequent helm upgrades.
If you are a beta user on 2.0.0-beta9, follow the upgrade procedure to start using the Litmus 2.0 GA build.
Documentation
The documentation has undergone considerable changes - in terms of content and structure and it continues to undergo improvements as of the 2.0 release. We expect that a few more iterations are needed to sort out the Information Architecture.
The installation details for the 2.0 platform along with detailed introductions to concepts, architecture as well as a user guide are now available at https://docs.litmuschaos.io/
The latest chaos experiment details along with chaos custom resource schema specifications (tunables, examples, etc.,) and detailed FAQs & troubleshooting info can be found in https://litmuschaos.github.io/litmus/
For those continuing to use 1.x releases, please note that the docs are now moved to: https://v1-docs.litmuschaos.io/
Misc (monthly changelog between 15/07/2021 to 15/08/2021)
Notes on changes to control plane (chaos center) since 2.0.0-beta9
- Added new API routes to check the status of the authentication server and to update the user details
- Added an API to terminate chaos workflow
- Added namespace scope support for event tracker
- Bugs fixes/enhancement in the frontend
- Typo in the nodeSelector schema key
- Adheres to correct schema in the steady-state validation wizard for Litmus Probes
- Fixes the inability to login/authenticate after upgrade of chaos-center
Notes on changes to backend execution infrastructure (chaos operator, experiments) since 1.13.8
- Supports VM belonging to scale-sets (VMSS) as target resources in the Azure instance stop experiment
- Fixes the limitation/inability to perform abort operations in the namespaced mode of operation in a chaos operator.
- Fixes an issue (edge case in scaled scenarios) within the abort functionality for the “exec” based chaos experiments (pod-cpu-hog-exec & pod-memory-hog-exec) wherein chaos injection continues to occur even post issual of abort.
- Adds fix to fail faster when helper pods do not run successfully in an experiment (fail immediately upon identifying helper failure instead of waiting for the customary statusCheckTimeout of 180s, as the helper pods are usually brought up with restartNever policy)
- Fixes the inability of certain experiments (pod-cpu-hog-exec, pod-memory-hog-exec and pod-dns-error, pod-dns-spoof) to select targets serially for cases where 0 > PODS_AFFECTED_PERC <= 100.
- Adds a condition to error out/call out the engine schema when neither the .spec.appinfo.applabel nor TARGET_PODS env are specified.
- Adds missing ability to perform auxiliary application health check in the node-memory-hog experiment and missing support for specifying multiple target nodes via a comma-separated list in node-cpu-hog, node-memory-hog & node-io-stress experiment.
- Fixes a regression in recent 1.13.x experiments wherein the .spec.appinfo.appkind is mandated (in order to derive parent controller name for pods - as this is used to patch the chaosresult status with target info) even when .spec.annotationCheck set to false. With this fix, you will be able to see older behavior wherein appkind can be left empty for cases where annotationCheck is set to false in the ChaosEngine CR.