In this interview with Help Net Security, Aaron Rinehart, CTO at Verica, explains the discipline of chaos engineering and how it can help organizations build more resilient systems.
Could you explain the discipline of chaos engineering?
The practice arose at Netflix, when the company was dealing with significant outages and service disruptions, and needed a way to improve their reliability at scale. Chaos engineering is a proactive discipline of experimentation to help navigate complexity within distributed systems in order to build confidence in the system’s capability to withstand turbulent conditions in production. The goal of this experimentation is to form new knowledge that you can use to improve your systems.
This differentiates chaos engineering from simply testing, which requires pre-established knowledge of specific properties of the system in order to write tests that validate those properties. Instead, chaos engineering seeks to verify if the output of the system works as expected—if it does not, this new knowledge indicates some form of vulnerability is present in the system and needs to be investigated and remedied.
It is also worth noting that chaos engineering is not, as is often assumed, simply “breaking things in production.” Chaos engineering seeks to illustrate where flaws and performance boundaries exist in complex systems in a safe and controlled manner.
How can chaos engineering help organizations build more secure and resilient systems?
Security chaos engineering is about increasing confidence that our security mechanisms are effective at performing under the conditions for which we designed them. The identification of security control failures through proactive experimentation builds confidence in your system’s ability to defend against malicious conditions in production.
Through continuous security experimentation you reduce the likelihood of being caught off guard by unforeseen disruptions. These practices better prepare you as professionals, teams, and the organizations you represent to be effective and resilient when faced with security unknowns.
What are the phases of chaos engineering?
To specifically address the uncertainty of distributed systems at scale, chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.
Continuous verification is an additional phase in which chaos engineering experiments run continuously, to verify that the output of the system is in line with expectations in an ongoing manner.
What are the main principles of chaos engineering?
1. Distributed systems are inherently unpredictable and chaotic.
Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
2. Be proactive, not reactive.
We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production.
3. Manage complexity (vs trying to reduce it).
We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.
4. Learn via experimentation.
An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment.
Many chaos engineering tools simply inject a failure into a system. Precious few do so in a manner designed to safely experiment rather than blindly disrupt. Teams interested in chaos engineering will want to look for a solution that focuses on identifying and communicating the safety margin of whichever systems they want to investigate, as this is the only way to move the needle on improving availability or security.
Source: https://www.helpnetsecurity.com/2022/02/03/chaos-engineering/