It's been a rough week or so for Amazon Web Services. They've suffered three independent outages as well as a number of other minor hardware failures. As with all well-publicized cloud outages, the media and others have latched on to these incidents to question the readiness of the cloud to handle important workloads. Failure, however, is not a cloud feature; failure is a feature of reality.
All systems—both natural and artificial—fail. In the computing world, the best we can hope for is the creation of redundant systems and backup systems that help minimize the impact of those failures. Where people run into trouble in the cloud is when they believe that "putting a system in the cloud" means not having to worry about redundancy and backup systems.
The survival of your applications and data is ultimately not something you can rely on any external entity to provide—you have to do it yourself. Amazon's response to the incidents reflects this sentiment. They take responsibility for the underlying failures and have indicated they are working to prevent similar issues in the future. At the end of their message, however, they state:
We also want to remind users to take advantage of the Amazon EC2 features designed to help mitigate this and other types of instance failures.
This statement reflects their overall philosophy that customers should design for failure. It's also the starting point for a great discussion on where planning for failure ceases to be the cloud provider's responsibility and becomes the customer's responsibility.
The first point in this discussion is that Amazon should not be held up as some sort of poster child based on this event. This month it was Amazon. Next month, it will be someone else. We are going to see failures in the cloud until the end of time. Given the hype being applied to the cloud, it's natural that each failure in the short-term will generate hype-backlash.
Ask yourself a basic question: when my cloud provider (or MSP, or outsourcing partner) fails, what is acceptable downtime/data loss and what can I do to guarantee that, no matter how badly my cloud provider fails, I will always be able to return to operation in less time than my acceptable downtime. Once you have an answer for this question, you have the outline for the steps you need to take to survive in the cloud or any other environment.
Note that this framework is not based on how reliable the underlying cloud provider is. Dealing with cloud failure is dependent entirely on how much downtime you are willing to absorb under the assumption that your cloud provider will fail. Because they will fail. Just like your own IT department fails from time to time; just like your managed services provider fails from time to time.
The reliability of your cloud provider comes into play only when deciding how many failures you are willing to put up with. The only way you don't take any responsibility for failure is if you believe your cloud provider will never suffer a failure. That attitude, however, is nothing short of foolish. Here are some basic steps you can put in place to protect yourself:
- Build in Redundancy