It's been a rough week or so for Amazon Web Services. They've suffered three independent outages as well as a number of other minor hardware failures. As with all well-publicized cloud outages, the media and others have latched on to these incidents to question the readiness of the cloud to handle important workloads. Failure, however, is not a cloud feature; failure is a feature of reality.
All systems—both natural and artificial—fail. In the computing world, the best we can hope for is the creation of redundant systems and backup systems that help minimize the impact of those failures. Where people run into trouble in the cloud is when they believe that "putting a system in the cloud" means not having to worry about redundancy and backup systems.
The survival of your applications and data is ultimately not something you can rely on any external entity to provide—you have to do it yourself. Amazon's response to the incidents reflects this sentiment. They take responsibility for the underlying failures and have indicated they are working to prevent similar issues in the future. At the end of their message, however, they state:
We also want to remind users to take advantage of the Amazon EC2 features designed to help mitigate this and other types of instance failures.
This statement reflects their overall philosophy that customers should design for failure. It's also the starting point for a great discussion on where planning for failure ceases to be the cloud provider's responsibility and becomes the customer's responsibility.
The first point in this discussion is that Amazon should not be held up as some sort of poster child based on this event. This month it was Amazon. Next month, it will be someone else. We are going to see failures in the cloud until the end of time. Given the hype being applied to the cloud, it's natural that each failure in the short-term will generate hype-backlash.
Ask yourself a basic question: when my cloud provider (or MSP, or outsourcing partner) fails, what is acceptable downtime/data loss and what can I do to guarantee that, no matter how badly my cloud provider fails, I will always be able to return to operation in less time than my acceptable downtime. Once you have an answer for this question, you have the outline for the steps you need to take to survive in the cloud or any other environment.
Note that this framework is not based on how reliable the underlying cloud provider is. Dealing with cloud failure is dependent entirely on how much downtime you are willing to absorb under the assumption that your cloud provider will fail. Because they will fail. Just like your own IT department fails from time to time; just like your managed services provider fails from time to time.
The reliability of your cloud provider comes into play only when deciding how many failures you are willing to put up with. The only way you don't take any responsibility for failure is if you believe your cloud provider will never suffer a failure. That attitude, however, is nothing short of foolish. Here are some basic steps you can put in place to protect yourself:
- Build in Redundancy

Print
Listen
By 
I for one would be reluctant to buy from a Vendor who's sales pitch is "assume we will fail at exactly the same time as your primary cloud provider."
Wow, just wow. Where do i sign up?
George,
I very much appreciate the thinking behind your post. Back in the day, ~15 years ago, I ran a small but highly-available set of web servers for Apple - doing Apple's first e-commerce, before the iTunes store existed.
I would go out and preach the mantra: each and every component of your system _will_ fail. Make a list. Prioritize it, at least in order of what you believe is most likely to fail. Make a plan for everything on the list. For example, when _xxx_ fails, _yyy_ will need to be done. Figure out how to make those things happen quickly or automatically if you can.
Recently, I have had this brought home again in a big way. Our younger daughter now relies on an insulin pump. The medical team says: -it's a device, it _will_ fail. Make sure you have a plan for what to do _when_, not _if_ it does. And that you practice being ready to execute your plan.
Best,
Martin
I guess it's the perfect segue from the '00's: shed all responsibility by putting your data in someone else's hands. No matter how much breakage or leakage that happens, the excuse is: "Well, we saved a couple of bucks".