What do we mean by ‘High Availability’ in the Cloud?

Published in

Real Kinetic Blog

3 min readMay 25, 2023

Summary

This article is intended to provide an overview of high availability in the cloud, along with the rough levels of costs to achieve different targets of availability like five nines, as well as recommendations for how to think about availability as an engineering organization.

Availability overview

Availability is usually expressed as a percentage of uptime within a given time period.

The table above illustrates that each additional nine results in a substantial reduction in the amount of downtime allowed to meet that availability target.

Availability requires more than just infrastructure

Availability is not purely a concern for infrastructure teams. Meeting availability targets is a multi-team effort. Infrastructure, application, and support teams must work together to maintain the minimum availability requirements set by the business. When there is downtime, a comprehensive Root Cause Analysis (RCA) should be performed in a blameless manner in order to get a clear and precise understanding of the factors contributing to the failure and the steps needed to mitigate the issue now and in the future. Typically, it will take several cross-discipline teams working together to reduce downtime and availability issues when they arise.

Cost to increase from 99% to 99.9%

It is not uncommon to see a 2x order of magnitude investment in infrastructure, application design, data availability and integrity in order to meet a 99.9% availability target. This includes changes and considerations for things like:

Multi-master availability for databases
Multi-zonal deployments for all infrastructure and application components
Every component of the system requires at least one failover that is well tested and exercised regularly.
Some multi-regional support for critical business systems

Cost to increase from 99.9% to 99.99%

Another 5x order of magnitude investment is required to reach 99.99% uptime. Every edge case, even if obscure, now has a significant impact on the downtime of the application. Deployments and rollbacks must be very carefully managed and monitored. Fail-overs will likely need to be automated for stateful systems that do not support active-active deployments. Multi-region availability will be necessary for more systems.

Cost to increase from 99.99% to 99.999%

“Five nines” is often regarded within the industry as the gold standard for availability. However, there are exceedingly few applications that meet the rigorous criteria for this level of availability and the cost associated with it, including Google Search. The order of magnitude investment here is estimated to be 10x compounded on top of four nines. Many regions are required, nearly instant failover of large amounts of infrastructure must happen regularly and be automated. Each component must have many failure isolation zones. Data must be replicated and kept in sync across all the regions. Organizations of considerable size and resources are required to achieve and maintain this level of availability.

An alternative approach to availability

Generally, it’s more cost-effective to optimize incident detection (Mean Time to Detect or MTTD), recovery (Mean Time to Recovery or MTTR), and supporting partial failure modes than to achieve five-nines (or other high levels of) availability. This is particularly true of systems composed of many integrated components or services since availability compounds. Implementing multi-region infrastructure adds a significant amount of cost and complexity, particularly around data, which is why multi-zone architectures combined with mature incident response processes can be more effective and attainable for most organizations.

Want to know more?

We have extensive experience designing and implementing highly available systems in the cloud. If there are specific topics you’d like to discuss in depth, we’d love to hear from you. These emails come directly to us, and we respond to every one.