universal security Commons: 3/5

Chaos Engineering for Security

Also known as:

1. Overview

Chaos Engineering for Security, often referred to as Security Chaos Engineering (SCE), is the discipline of proactively and deliberately introducing security-related failures into a system to test and improve its resilience against attacks. The fundamental problem it solves is the validation of security controls and incident response procedures in a controlled and empirical manner, moving beyond theoretical assessments and assumptions. By simulating real-world attack scenarios, such as misconfigurations, network failures, or unauthorized access attempts, organizations can uncover hidden vulnerabilities and weaknesses in their security posture before they are exploited by malicious actors. This approach allows teams to build confidence in their system’s ability to withstand turbulent conditions and maintain its critical functions even when under attack.

The historical context of this pattern is rooted in the broader field of Chaos Engineering, which was pioneered by Netflix in the early 2010s with the creation of their “Chaos Monkey” tool. Originally designed to test the resilience and availability of their IT infrastructure by randomly disabling production instances, the principles of Chaos Engineering have since been adapted and extended to the domain of cybersecurity. The realization that the same experimental approach could be used to verify security assumptions and improve defensive capabilities led to the emergence of Security Chaos Engineering. This evolution represents a significant shift in security thinking, moving from a reactive, compliance-driven mindset to a proactive, engineering-based approach that embraces failure as a means of learning and continuous improvement.

For organizations and the commons, this pattern is of paramount importance in an increasingly complex and hostile digital landscape. As systems become more distributed and interconnected, the attack surface expands, making it nearly impossible to anticipate and prevent every possible security failure. Security Chaos Engineering provides a structured and scientific method for building cyber resilience, which is the ability to anticipate, withstand, recover from, and adapt to adverse conditions, stresses, attacks, or compromises. By regularly conducting security experiments, organizations can not only identify and remediate specific weaknesses but also foster a culture of security awareness and collaboration between development, operations, and security teams. This proactive stance on security is essential for protecting critical infrastructure, sensitive data, and the public services that rely on them, ultimately contributing to a more secure and resilient digital commons.

2. Core Principles

Hypothesize about Steady State: Before injecting any failures, it is crucial to define and understand the system’s normal operating behavior, or “steady state.” This involves establishing measurable metrics that indicate the system is healthy and secure. The hypothesis of a security chaos experiment is that the system will maintain this steady state even when a specific failure is introduced, thus proving its resilience to that particular threat.
Vary Real-World Events: The failures injected into the system should be based on real-world threats and attack vectors. This can include simulating misconfigurations, network segmentation failures, credential compromise, or the behavior of known malware. By using realistic scenarios, organizations can gain a more accurate understanding of how their systems will behave during an actual security incident.
Minimize Blast Radius: Security chaos experiments should be designed to have the smallest possible impact on the production environment. This is achieved by starting with small, controlled experiments in pre-production or isolated environments and gradually increasing the scope as confidence in the system’s resilience grows. The goal is to learn from failures without causing a major disruption to users or business operations.
Automate Experiments to Run Continuously: To keep pace with the rapid rate of change in modern software systems, security chaos experiments should be automated and integrated into the CI/CD pipeline. Continuous experimentation ensures that the system’s security posture is constantly being validated as new code is deployed and the infrastructure evolves. Automation also allows for a wider range of experiments to be run on a regular basis, providing a more comprehensive assessment of the system’s resilience.
Believe in the System, but Verify: A core tenet of Security Chaos Engineering is to move beyond assumptions and to empirically verify that security controls and procedures work as intended. While it is important to have confidence in the security measures that have been put in place, it is equally important to test them under realistic conditions to ensure they are effective.

3. Key Practices

Threat Modeling: Before conducting any experiments, it is essential to perform a thorough threat modeling exercise to identify potential vulnerabilities and attack vectors. This involves analyzing the system architecture, data flows, and trust boundaries to understand where the system is most likely to be attacked. The output of the threat modeling process can then be used to design targeted security chaos experiments.
GameDays: GameDays are structured events where teams come together to run a series of security chaos experiments in a controlled environment. These exercises provide a valuable opportunity for developers, operations engineers, and security professionals to collaborate, share knowledge, and practice their incident response procedures. GameDays can help to build muscle memory and improve the organization’s overall security readiness.
Integrate with MITRE ATT&CK®: The MITRE ATT&CK® framework provides a comprehensive knowledge base of adversary tactics and techniques based on real-world observations. By mapping security chaos experiments to the ATT&CK framework, organizations can ensure that they are testing for the most relevant and up-to-date threats. This also provides a common language for describing and sharing information about security experiments.
Measure and Improve: The goal of Security Chaos Engineering is not just to find weaknesses, but to drive continuous improvement. This requires a focus on measurement and data analysis. Key metrics to track include Mean Time to Detect (MTTD), Mean Time to Respond (MTTR), and the percentage of experiments that result in a deviation from the steady state. These metrics can be used to identify areas for improvement and to demonstrate the value of the security chaos engineering program over time.
Start Small and Iterate: When first adopting Security Chaos Engineering, it is important to start with small, simple experiments and to gradually increase the complexity and scope over time. This allows the team to build confidence and to learn from their mistakes in a low-risk environment. A good starting point is to focus on a single, well-understood system and to run a few basic experiments, such as testing the response to a port scan or a simple denial-of-service attack.

4. Implementation

Implementing a Security Chaos Engineering practice involves a systematic, multi-stage approach. The first step is to identify a suitable target system, ideally a non-critical application or service in a pre-production environment. Once a target has been selected, the team should conduct a thorough threat modeling exercise to identify potential vulnerabilities and to formulate a set of hypotheses about how the system will behave under different attack scenarios. For example, a hypothesis might be: “If a key authentication service becomes unavailable, the application will gracefully degrade and prevent unauthorized access to sensitive data.”

With a clear set of hypotheses in place, the next step is to design and execute the experiments. This involves selecting the appropriate tools and techniques to inject the desired failures into the system. There are a number of open-source and commercial tools available for Security Chaos Engineering, such as Chaos Toolkit, Gremlin, and the major cloud providers’ fault injection services. The experiments should be run in a controlled manner, with careful monitoring of the system’s behavior to detect any deviations from the expected steady state. After each experiment, the results should be analyzed to determine whether the hypothesis was validated or refuted. Any identified weaknesses or vulnerabilities should be documented and prioritized for remediation.

To ensure the long-term success of a Security Chaos Engineering program, it is essential to establish a set of key performance indicators (KPIs) to track progress and to demonstrate value to the business. These metrics might include the number of experiments run, the percentage of experiments that reveal a vulnerability, and the reduction in Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) for security incidents. It is also important to foster a culture of continuous learning and improvement, where the insights gained from security chaos experiments are used to inform the design of more resilient systems and to improve the organization’s overall security posture.

5. 7 Pillars Assessment

Pillar	Score (1-5)	Rationale -	**-	————-	———–
Purpose	4	The purpose of Security Chaos Engineering is clear and compelling: to proactively improve system resilience against cyber attacks. It provides a tangible way to move beyond theoretical security and into empirical validation, which is a significant step forward for any organization. -
Governance	3	While the principles of Security Chaos Engineering provide a good framework for governance, implementing it effectively requires a high degree of maturity and discipline. Establishing clear rules of engagement, defining the blast radius, and ensuring that experiments are conducted safely can be challenging, especially in complex environments. -
Culture	4	Security Chaos Engineering can have a profoundly positive impact on an organization’s culture. It encourages a shift from a culture of blame to a culture of learning, where failures are seen as opportunities for improvement. It also fosters collaboration between development, operations, and security teams, breaking down silos and creating a shared sense of ownership for security. -
Incentives	3	The incentives for adopting Security Chaos Engineering are primarily intrinsic: the desire to build more secure and resilient systems. However, it can be challenging to quantify the ROI of the practice in the short term, which can make it difficult to secure budget and resources. To address this, it is important to track metrics such as the reduction in security incidents and the improvement in response times. -
Knowledge	5	Security Chaos Engineering is a powerful tool for generating knowledge and insights about a system’s security posture. By conducting experiments and analyzing the results, organizations can gain a deep understanding of how their systems behave under stress and can identify hidden vulnerabilities that would be difficult to find through other means. This knowledge can then be used to improve the design of the system and to enhance the skills of the security team. -
Technology	4	A growing ecosystem of open-source and commercial tools is available to support Security Chaos Engineering. These tools make it easier to design and execute experiments, to inject failures into the system, and to monitor the results. The major cloud providers also offer fault injection services that can be used to simulate a wide range of failures. -
Resilience	5	The ultimate goal of Security Chaos Engineering is to improve system resilience, and it is highly effective in this regard. By proactively identifying and remediating weaknesses, organizations can build systems that are better able to withstand attacks and to recover quickly from failures. This leads to a more robust and reliable security posture, which is essential in today’s threat landscape. -
Overall	4.0	Security Chaos Engineering is a powerful and effective practice for improving cyber resilience, with a strong focus on knowledge generation and cultural change. -

6. When to Use

To validate security controls: Use Security Chaos Engineering to empirically verify that your security controls, such as firewalls, intrusion detection systems, and access controls, are working as expected.
To prepare for incident response: By simulating real-world attack scenarios, you can train your incident response team and identify gaps in your playbooks and procedures.
To assess the security of new features: Before deploying new features to production, you can use Security Chaos Engineering to assess their resilience to common attack vectors.
To improve security awareness and culture: Running security chaos experiments can help to raise awareness of security issues across the organization and to foster a culture of shared responsibility for security.
To meet compliance requirements: In some industries, such as finance and healthcare, there are regulatory requirements for regular security testing. Security Chaos Engineering can be a valuable tool for meeting these requirements.

7. Anti-Patterns & Gotchas

Running experiments in production without proper safeguards: This can lead to major disruptions and outages. It is essential to start in a pre-production environment and to have a clear understanding of the potential blast radius.
Not having a clear hypothesis: Without a clear hypothesis, it is difficult to design a meaningful experiment and to interpret the results. Each experiment should be designed to answer a specific question about the system’s security.
Focusing only on breaking things: The goal of Security Chaos Engineering is not just to break things, but to learn from failures and to drive continuous improvement. It is important to have a process in place for analyzing the results of experiments and for implementing remediation measures.
Not involving the right people: Security Chaos Engineering should be a collaborative effort between development, operations, and security teams. Excluding any of these groups can lead to a lack of buy-in and to a less effective program.
Ignoring the results of experiments: The insights gained from security chaos experiments are only valuable if they are acted upon. It is important to have a process in place for tracking and prioritizing the remediation of identified vulnerabilities.

8. References

[Security-focused chaos engineering experiments for the cloud Datadog](https://www.datadoghq.com/blog/chaos-engineering-for-security/)
[Security Chaos Engineering 101: Fundamentals Mitigant](https://mitigant.io/en/blog/security-chaos-engineering-101-fundamentals)
Security Chaos Engineering: A new paradigm for cybersecurity
The Principles of Chaos Engineering
MITRE ATT&CK®