How's your disaster recovery plan?
I am sure it's well documented and enables your organization's operations to continue in the face of a significant loss of facilities and human life. And you test it at least once each quarter.
Probably not. How many of you did I lose with "How's your disaster recovery plan?" From what I have seen out there in the world, I would be willing to bet most of you. And those that do have something that would qualify as "a plan" never test it or, worse, it is dependent on the survival of a handful of individuals in order to be executed.
We've seen a number of real disasters this past decade. A disaster is not simply a terrorist attack, a hurricane, or an earthquake. It's also a significant vendor suddenly declaring Chapter 7 bankruptcy. The challenge companies face in Disaster Recovery (DR) planning is that it has traditionally been complex and expensive. The cloud can remove many of the barriers to solid DR planning.
For whatever reason, let's say you have encountered a sudden and unrecoverable loss of an entire data center. Do you know what you need to do to resume operations in another data center? How much data will you lose in resuming operations? How long will it take for you to resume operations? Acceptable answers to those questions depend on your business. The loss of one hour of data may be unacceptable for some businesses, while the loss of a whole week won't hurt others.
The most basic problem is that businesses typically can't answer those two core questions to define what constitutes an acceptable response to a real disaster. The few businesses that can answer those questions don't actually know if the processes they put in place will support their answers.
The first challenge is that being able to move to another data center in a reasonable time frame means already having contracts in place with an alternate data center. Some organizations rely on a single entity with geographic redundancy—but this reliance on a single entity is a mistake. As I mentioned earlier, a form of disaster is your managed services provider going out of business without warning. You therefore need more than geographic redundancy—you need organizational redundancy.
The second key challenge is the development of data backup routines that move the data out of the primary data center (and out of the control of the entity running that data center) and ensure that they are accessible in the event of a disaster. There's also a human aspect of this challenge that people often overlook: people die in disasters and it could be that the disaster that has taken out your primary data center has also taken key employees who have root passwords and the most detailed understanding of your DR procedures.
Cloud computing can serve as an important foundation of a rapid recovery, low data loss DR solution. Imagine, for example, regularly synchronizing your production environments with Amazon S3. Assuming you have setup machine images that mirror your production environments, you should be able to rapidly recover into the cloud without paying to run an entirely redundant data center 24x7.
Here's how it works:
- Setup procedures for synchronizing data with Amazon S3 (or use a tool like enStratus to package the data regularly for DR deployment). These procedures should encrypt the data using the strongest encryption possible.
- Create machine images (AMIs) that have the same operating system, tools, core applications, and libraries as your production systems.
- (Optionally) Use a tool like enStratus to configure your DR environment to automate the DR process.
- Place all key authentication data as well as instructions for DR in a safety deposit box.
- Regularly test restoring your infrastructure based on the current data in the cloud.
If you are using a tool that automates the DR processes, you can programmatically run DR tests on a weekly basis and use testing tools to validate the success of the event.
The main benefit of this approach is that you simply know this DR system will work for you and you pay nearly nothing for it. Nevertheless, it comes with two key challenges:
- You have to tailor the data synchronization processes to the volume of data you are synchronizing with the cloud and the type of data being synchronized. This data synchronization can get expensive (but not nearly as expensive as a fully redundant data center).
- You will need a reasonable level of structure in the way you package and deploy your custom applications.
The first challenge exists for any kind of DR infrastructure. If you have 1T of Intranet data changing daily, you may not have enough Internet bandwidth to keep that data in sync with S3. Still, synchronizing it with a redundant data center is generally just as problematic.
The second challenge is really an example of something you should be doing, but probably don't. I often encounter custom applications that must be hand-deployed each time they are rolled out into a new environment. In fact, that's probably the rule more than the exception. That will work in the cloud, but you can't automate the DR processes and you run the risk that a key piece of deployment is stuck in the head of a critical employee who was lost in the disaster at hand. It's worth the time to standardize your custom application deployments and make it possible to automate their deployment in a DR scenario.