A couple of months back a customer asked why we are proposing a three Availability Zone (AZ in short) architecture instead of two. Their main point was which failure modes 3 AZs guard against that 2 AZs can’t do. We gave the following two reasons:

- We proposed 3 AZs for improved availability. Also, since services and instances will be deployed across 3 AZs then if one AZ goes down then with 3 AZs you lose 1/3 capacity. With two AZs you can lose half the capacity.
- If there are services(like you want run your own Cassandra or something) where we need to manage quorum it is better to have three

They were not very convinced so we agreed to start with the two AZs solution.

Last two months I have thought about this question multiple times and I always thought there has to be a better argument in favor of 3 AZs. I googled and researched but couldn’t find much.

Today, while reading a book[1] I think I have figured out the right reason why three AZs are better than two AZs. If my understanding is incorrect please post a constructive message.

Let’s take a simple case where we have a single Microservice that needs to handle 100,000 requests per second. Each instance of the Microservice can handle 1000 requests per second. So, if we do the calculation we need 100 instances. For now, we will assume one instance of service is running one node(server).

You don’t want all your servers to be 100% utilized so you have to provision extra resources to give the possibility to handle rolling updates and node failures. Assuming you want servers to be 80% utilized you will need 125 servers(or instances of your service).

In a single AZ solution you will need 125 servers to give yourself enough room to handle server failures and rolling deployments.

What happens if you want to survive a single AZ failure? If your application is running in a single AZ and that AZ goes down then your service will have downtime. You probably don’t want that. So, the answer is to have more than one AZ.

Now, the question is should we have two or three AZs.

Let’s start with three AZs first and then we will come to two AZs. To keep math simple I have used 126 nodes instead of 125 so that each AZ gets equal nodes.

If we lose one AZ we lose 33% of our capacity. We would go down to 84 nodes. Our system will no longer be able to support 100,000 requests per second. We need at least 100 nodes for that. With single AZ failure each node will have to handle 1190 requests per second. This is greater than 1000 requests per second that they can handle.

So, how many servers do we need in order to handle one AZ failure in a three AZ solution?

The answer is 50 nodes per AZ. This is calculated using the simple formula

Number of nodes per AZ = minimum number of server / (number of AZ -1 )

Number of nodes per AZ = 100 / (3 -1) = 50

The total number of nodes become 50*3 = 150 instead of 126.

As you can see to safeguard against AZ failure you need more nodes than what you had provisioned in a single AZ solution.

Now, let’s talk about the two AZs solution. With two AZs applying our simple formula we get 100 nodes per AZ or a total of 200 nodes.

Number of nodes per AZ = minimum number of server / (number of AZ -1 )

Number of nodes per AZ = 100 / (2 -1) = 100

The total number of nodes become 100*2 = 200 instead of 126.

I will use the conclusion stated in the book

To ensure the ability to recover from a data center outage, the more data centers you have, the fewer nodes you need overall spread across those data centers.

This post was covered in KubeWeekly #327 – Link

## References

- Chapter 2 of Architecting for Scale, 2nd Edition by Lee Atchison – Link