Guidelines for Chaos Engineering, Part 1

Tyler Treat
Real Kinetic Blog
Published in
5 min readJul 6, 2020

--

Written by Nick Joyce

In this two-part series, we’ll discuss chaos testing as an engineering discipline. First, in part one, we’ll define what chaos testing is, the goals, and how to implement it effectively. This includes stepping through the iterative process of defining the steady-state, forming a hypothesis, running the experiment, and adapting the system.

In part two, we’ll talk through how to go about introducing chaos engineering as a practice within your organization. That is to say we’ll start with some specific tactics for performing chaos testing and then later discuss some higher-level strategy around establishing a chaos engineering practice.

What is Chaos Testing?

Chaos testing is a technique used to determine and predict how a system behaves in the face of failure. It is an important part of building and maintaining resilient applications as well as ensuring adequate monitoring is in place. These types of tests are also referred to as “gameday exercises.”

Goals

The goal of chaos testing is twofold:

  1. Understand the behavior of systems in the face of failure or non-ideal conditions and ensure it aligns with expectations.
  2. Identify gaps in the monitoring and observability of systems and your team’s ability to respond.

Chaos testing isn’t about building perfect systems. We can’t prevent complex systems from failing, so it’s important we can quickly detect and recover from failure.

How to Perform Chaos Engineering

Chaos engineering works by injecting artificial events into a system such as errors, traffic spikes, or latency to understand how the system as a whole reacts when individual components fail or face non-ideal circumstances.

Define Steady-State

Chaos testing starts with defining “steady-state” — that is, determining a set of metrics that measure business value. Business metrics that measure user engagement are the most suitable. Examples include “orders per minute” or “user sign-ups.” System metrics such as requests per second or CPU utilization are poor choices because they do not directly reflect the overall user experience of the service provided by the business. Thus, these steady-state metrics might correspond more closely to Service Level Objectives (SLOs) than Service Level Indicators (SLIs).

Observability and monitoring of the system is crucial to being able to measure what steady-state looks like accurately. Tooling for monitoring, logging, and tracing should all be in place to know when the system is not performing as expected.

Once steady-state has been determined, experiments are run to test the components of the system. The purpose of these experiments is to ensure the system reacts as we predict when one or more components are in a failed or degraded state.

Form a Hypothesis

In order to conduct effective tests, a testable hypothesis is necessary. These should be of the form: “there will be no change to the steady-state when X is injected into the system.”

The most impactful experiments may result in the loss of availability of one or more components of the system. Types of testing include:

  • Hardware failure (or virtual equivalent)
  • Changes to network latency/failure
  • Resource starvation/overload
  • Dependency failures (e.g. database)
  • Retry storms (e.g. thundering herd)
  • Functional bugs (exceptions)
  • Race conditions (threading and concurrency)

Run the Experiment

Once a hypothesis has been determined, run the experiment while observing the system. If it reacted and behaved as expected, the experiment is considered a success. If there was an unexpected behavior or the service was impacted in any meaningful way, then the experiment is considered a failure (though for the team, this is a big win because we now have a better understanding of our system!). Having a deep understanding of how the system failed is crucial. We need deep understanding to determine what changes need to occur in order to be able to re-run the experiment with a successful result.

There will be times where it is necessary to put the system in a particular state, making it vulnerable to failure. For example, increasing the CPU or memory pressure or changing the variability of the network latency to simulate rare situations. This can be achieved through external load profiling or adding artificial processes to the system. These will expose some hard-to-find issues that are potentially difficult issues to debug.

Adapt

If an experiment failed, that is the system behavior did not match our expectations when running the chaos experiment, it is important to establish a thorough understanding of the reasons why the experiment failed. Using available observability tooling such as logging, metrics, and tracing should help to understand the root cause. If the tooling was not sufficient, this too must be improved. One of the goals of chaos testing is to identify gaps in monitoring and observability of the system. Performing a root-cause analysis of why the experiment failed may be a cross-team effort. Chaos engineering doesn’t have to be confined to a single team!

Once the root cause(s) and observability deficiencies have been identified, work must be carried out to improve the resiliency of the components affected during the experiment. This may be making architectural changes, improving infrastructure and application resilience, or updating monitoring and alerting policies.

When these changes have been made, the experiment should be re-run to ensure that they helped improve the resilience of the application.

Into the Great Unknown

You’ll notice that this iterative process of chaos engineering involves defining steady-state metrics, forming a hypothesis, and testing said hypothesis by applying various scenarios. This can be an effective way to better understand our systems and the effects of those scenarios, but that’s only part of the picture. Specifically, this process is dealing with the known unknowns — the things that we are aware of but don’t understand until they are tested. This doesn’t — and cannot — account for the unknown unknowns. These are the things that we are neither aware of nor understand. Simply stated, we cannot define steady-state or form a hypothesis for things that we are not even aware of. Thus, observability becomes paramount to improving our understanding of systems and informing our chaos engineering practice.

In part two, we’ll discuss how to go about introducing chaos engineering as a practice within your organization.

Real Kinetic helps companies adopt chaos testing within their engineering organizations. Learn more about working with us.

--

--

Managing Partner at Real Kinetic. Interested in distributed systems, messaging infrastructure, and resilience engineering.