If you have done any experimentation in the cloud, you have likely realized that virtual server instances in the Amazon cloud are much less reliable than their real world counterparts. How do you compare availability in the cloud to a physical infrastructure and leverage the cloud to increase overall availability.
Uptime is measured as a percentage of a period (generally a month, a quarter, or a year) you can expect a system to be available. You calculate it by subtracting the sum of the expected downtimes of the components and dividing by the amount of time in the period.
If you have a server that is almost certain to fail at least once in a 3 year period and it takes 24 hours to fix or replace that server, your expected downtime in a year is:
33% * 24 hours= 7.92 hours
To translate that into uptime, you subtract the downtime from the total available time in the period and divide by 8760 (the number of hours in a year):
(8760 - (33%*24 hours))/8760 hours = 99.91%
That's not five nine's (99.999%), but then again you would not expect that from a single server. It still looks good. So why do simple environments fail to meet reasonably high availability expectations? Let's see what happens when we add a database server:
(8760 - ((33%*24 hours) + (33%*24 hours)))/8760 hours = 99.82%
Now, let's add in a piece of server software that you automatically restart every day at 2am to avoid crashes (and each restart takes 6 minutes):
(8760 - ((36500%*0.1 hours) + (33%*24 hours) + (33%*24 hours)))/8760 hours = 99.40%
In other words, every time you add a component—no matter how reliable or unreliable that component—your availability rating suffers. When talking about the uptime of a web application, you normally have the following components in play:
- A network pipe into your server
- Networking equipment
- Database and application servers
- Server software
- Database software
I am sure I have left some important things out as well.
The way around the problem of availability in increasingly complex infrastructures is redundancy. For example, if you add a second application server into the mix, the expected downtime from the logical application server decreases since the overall system remains operating even when one of the two servers fails. The availability of a component with redundancy is based on the likelihood that both parts of the redundant component fail at the same time:
(33% * (24 hours^2))/(8760^(2-1)) = .022 hours = 99.9998%
Many people look at the cloud as a critical tool in inexpensively adding redundancy to their infrastructure. After all, when you look at increasing availability with a physical IT infrastructure, you add hardware redundancies. Unfortunately, adding hardware redundancies is very expensive. Thus the look to the cloud.
Kill the Downtime
What people miss in the cloud is that cheap redundancy is not the true advantage of a cloud infrastructure. It's cutting downtime out of the picture. In the initial 99.91% availability calculation for a single server, I estimated 24 hours as the time to replace. Obviously, if you have hot spares lying around, you can decrease that time frame. But that adds cost just as sure as having a redundant environment.
On the other hand, the clouds allows you to recover in no worse than 10 minutes (assuming automated infrastructure management tools like enStratus):
33% * .17 hours = 0.056 hours = 99.9993%
The problem here is that virtual cloud instances lack the availability of your standard physical server. Based on my anecdotal experience, I would estimate an 80% chance of failure of a given instance over the course of a year:
80% * .17 hours = 0.136 hours = 99.9985%
In short, the impact of reduced downtime expectations in the event of failure in the cloud is where the cloud truly adds value to your infrastructure availability expectations.