Chaos Engineering

Context:
- It can be used to achieve resilience against:
  - Infrastructure failures
  - Network failures
  - Application failures
See: Quality of Service, Fault Tolerance.

References

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Chaos_engineering Retrieved:2021-8-31.
- Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.

14 Essential software engineering practices for your agile project."
- QUOTE: Chaos Engineering: Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic. You need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. You must address the most significant weaknesses proactively before they affect our customers in production. You need a way to manage the chaos inherent in these systems, take advantage of increased flexibility and velocity, and have confidence in your production deployments despite the complexity that they represent. An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. This is called Chaos Engineering. A good example of this would be the Chaos Monkey of Netflix.