3.0-Beta
#3801
Replies: 2 comments 1 reply
-
I'm really looking forward to |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
3.0 Beta Programme
Hey Litmus community! At the outset, thank you for all your support to the LitmusChaos project. The project team is excited to inform you that we have embarked on the Litmus 3.0 Beta programme (set to continue over the next several months) to incubate important enhancements and seek feedback, before it's GA.
Here are some of the motivations for a major version change in the project, things we hope will help you further your adoption of the tool and prove even more effective in your chaos engineering initiatives:
More Robust
Improved Chaos Orchestration: Today, litmus provides the several off-the-shelf faults as well as the framework to construct new experiments (BYOC) - either from scratch OR by reusing existing (in-house) tooling/scripts. To achieve this, it uses a balance of declarative (ex: app selection, steady-state-checks, state management etc.,) and imperative definitions (ex: fault injection script executed by experiment pods). The latter imposes certain challenges under certain conditions to ensure there is residual chaos resources in the system (say, cases where the experiment pods are evicted/killed etc..,). With 3.0 this will be mitigated and no orphaned/residual chaos resources will remain in the system, making it even truer in spirit when it comes to being a fully "reconciled" action. The changes will be made in the core experiment execution infrastructure (chaos-operator, chaos-runner, chaos-experiments)
Helm-Based Automation: As the earliest chaos project with support for multi-tenant chaos, both in terms of user-management (via chaos projects) and targets onboarded (remote/multi cluster OR namespace support through chaos agent/subscribers) - Litmus already provides the ability to do chaos across teams and across a fleet of clusters. However, the time-to-activation (time taken from agent-setup via
litmusctl
to successful chaos execution by selecting the remote chaos agent) when it comes to adding remote clusters can be further reduced and the process simplified. With 3.0, chaos-agent setup will be simplified via a dedicated Helm Chart that will help users to audit the specifications, tune for topology and resource constraints, as well as bootstrap clusters to become chaos-enabled right from bring-up. This area will also see some interesting community contributions becoming "mainstream" and "supported" - essentially helm-based automation which will help in creating bespoke chaos workspaces for users based on simple profile(s).Simplified UX: While Litmus already provides an easy path to construct complex chaos scenarios, the user experience around "what is ongoing" and "what is the impact/what the experiment run tells me" is something that is paramount for users. With 3.0, this will be made more effective.
Leaner
Native Workflows: Argo was the engine of choice when it came to the infra that powered our chaos workflows - with its excellent flexibility (sequencing of tasks, support for custom steps, artifact templates & sinks etc.,). However, this also brings with it certain resource & speed trade-offs (each experiment OR step goes through with more number of pods). While Argo will continue to power our chaos workflows, 3.0 will also see the introduction of Litmus-native workflows as well as the ability to directly launch ChaosEngines on desired targets. The latter is especially useful in cases where the chaos scenarios are simpler & straight-forward in nature.
Better Experiment Scalability through per-node helpers: Helper pods are Litmus's way of ensuring you do not have to perennially run privileged daemonsets on your system as chaos agents. They are transient and target-app specific. However, today's mapping of one helper pod per each target can cause scalability challenges when doing chaos against multi-replica deployments (100s) and higher POD_AFFECTED_PERCENTAGE settings. With 3.0 the helper model is being enhanced to a per-node model, while still retaining its transient and target-app specific properties. This will help further Litmus usage in more resource-heavy environments.
Added Developer Focus
Workflow Support in
chaos-ci-lib
: While the SRE driven usecases for chaos are still the primary drivers for the adoption of the practice, the newer persona really pushing usage is that of QA teams & developers - primarily because of the benefits left-shifting provides. For this group of users, the ease of execution of chaos "tests" within pipelines is a major consideration. With 3.0,. the Litmus chaos-ci-lib will be enhanced to work with the latest control/execution planes (workflow-driven) (it is currently based off 1.x) with executions via the lib tracked on the Chaos-Center.Codebase Refactor: Contributions to Litmus chaos experiments will be further simplified post the ongoing refactor of the experiments codebase, with less duplicated code and more modular approach to steady-state checks and fault-injections across target-types. Developers will get to spend lesser time making their fixes or enhancing functionality.
Improved SDK: The SDK will be equipped with better test capabilities (including experiment completeness score) and documentation, with additional "templates" for chaos (i.e., experiments with a helper-based chaos injection, exec based chaos injection, 3rd-party-api-based chaos injection) so that developers have increased scaffolding support, and hence can spend more time on business logic.
Beta Was this translation helpful? Give feedback.
All reactions