The first step is to build in the ability to handle "mundane" failures. Mundane failures are things that impact parts of the cloud in which you are operating but leave other parts operational. The key to redundancy is thus to spread any given component in your application architecture across two or more data centers or availability zones. Where that's not entirely possible (some load balancers, database masters), make sure you have data replication in place that will enable you to rapidly recover from a mundane failure.

Develop a Multi-cloud Strategy
You should work with at least two cloud providers. It doesn't matter if you are operating across both clouds equally or if one serves as the primary with the other serving as the backup. You need accounts in at least two clouds and you need to understand how your systems will operate in each cloud.

Backup Your Data—Outside of the Cloud
AWS snapshots are a great feature—and a terrible backup strategy. They're great for guaranteeing minimal data loss in the face of mundane failures, but they won't help you if the whole of AWS encounters a failure situation. You need to make sure your backups are copied into at least two clouds (part of the importance of a multi-cloud strategy. Furthermore, those backups should have no dependencies on the source cloud in which they were operating other than the need to alter configuration files.

Assume Your Tools Will Fail Too
If you have a cloud-based tool like enStratus that is manages your cross-cloud strategy, you should assume it would fail at exactly the same time as your primary cloud provider. While highly unlikely, the world is filled with stories of multiple independent failures working together to create a grand disaster. You should therefore regularly test the backups being generated by your tools and have a means to access those backups and manually execute their recovery into a backup cloud without the use of the tool.

You might also be interested in:


I for one would be reluctant to buy from a Vendor who's sales pitch is "assume we will fail at exactly the same time as your primary cloud provider."

Wow, just wow. Where do i sign up?


I very much appreciate the thinking behind your post. Back in the day, ~15 years ago, I ran a small but highly-available set of web servers for Apple - doing Apple's first e-commerce, before the iTunes store existed.

I would go out and preach the mantra: each and every component of your system _will_ fail. Make a list. Prioritize it, at least in order of what you believe is most likely to fail. Make a plan for everything on the list. For example, when _xxx_ fails, _yyy_ will need to be done. Figure out how to make those things happen quickly or automatically if you can.

Recently, I have had this brought home again in a big way. Our younger daughter now relies on an insulin pump. The medical team says: -it's a device, it _will_ fail. Make sure you have a plan for what to do _when_, not _if_ it does. And that you practice being ready to execute your plan.



I guess it's the perfect segue from the '00's: shed all responsibility by putting your data in someone else's hands. No matter how much breakage or leakage that happens, the excuse is: "Well, we saved a couple of bucks".

News Topics

Recommended for You

Got a Question?


Failure Is a Feature of Reality

By George Reese
May 14, 2010 | Comments: 3

It's been a rough week or so for Amazon Web Services. They've suffered three independent outages as well as a number of other minor hardware failures. As with all well-publicized cloud outages, the media and others have latched on to these incidents to question the readiness of the cloud to handle important workloads. Failure, however, is not a cloud feature; failure is a feature of reality.

All systems—both natural and artificial—fail. In the computing world, the best we can hope for is the creation of redundant systems and backup systems that help minimize the impact of those failures. Where people run into trouble in the cloud is when they believe that "putting a system in the cloud" means not having to worry about redundancy and backup systems.

The survival of your applications and data is ultimately not something you can rely on any external entity to provide—you have to do it yourself. Amazon's response to the incidents reflects this sentiment. They take responsibility for the underlying failures and have indicated they are working to prevent similar issues in the future. At the end of their message, however, they state:

We also want to remind users to take advantage of the Amazon EC2 features designed to help mitigate this and other types of instance failures.

This statement reflects their overall philosophy that customers should design for failure. It's also the starting point for a great discussion on where planning for failure ceases to be the cloud provider's responsibility and becomes the customer's responsibility.

The first point in this discussion is that Amazon should not be held up as some sort of poster child based on this event. This month it was Amazon. Next month, it will be someone else. We are going to see failures in the cloud until the end of time. Given the hype being applied to the cloud, it's natural that each failure in the short-term will generate hype-backlash.

Ask yourself a basic question: when my cloud provider (or MSP, or outsourcing partner) fails, what is acceptable downtime/data loss and what can I do to guarantee that, no matter how badly my cloud provider fails, I will always be able to return to operation in less time than my acceptable downtime. Once you have an answer for this question, you have the outline for the steps you need to take to survive in the cloud or any other environment.

Note that this framework is not based on how reliable the underlying cloud provider is. Dealing with cloud failure is dependent entirely on how much downtime you are willing to absorb under the assumption that your cloud provider will fail. Because they will fail. Just like your own IT department fails from time to time; just like your managed services provider fails from time to time.

The reliability of your cloud provider comes into play only when deciding how many failures you are willing to put up with. The only way you don't take any responsibility for failure is if you believe your cloud provider will never suffer a failure. That attitude, however, is nothing short of foolish. Here are some basic steps you can put in place to protect yourself:

Build in Redundancy