Using the Cloud for Disaster Recovery

By George Reese
April 11, 2009 | Comments: 7

How's your disaster recovery plan?

I am sure it's well documented and enables your organization's operations to continue in the face of a significant loss of facilities and human life. And you test it at least once each quarter.

Right?

Probably not. How many of you did I lose with "How's your disaster recovery plan?" From what I have seen out there in the world, I would be willing to bet most of you. And those that do have something that would qualify as "a plan" never test it or, worse, it is dependent on the survival of a handful of individuals in order to be executed.

We've seen a number of real disasters this past decade. A disaster is not simply a terrorist attack, a hurricane, or an earthquake. It's also a significant vendor suddenly declaring Chapter 7 bankruptcy. The challenge companies face in Disaster Recovery (DR) planning is that it has traditionally been complex and expensive. The cloud can remove many of the barriers to solid DR planning.

For whatever reason, let's say you have encountered a sudden and unrecoverable loss of an entire data center. Do you know what you need to do to resume operations in another data center? How much data will you lose in resuming operations? How long will it take for you to resume operations? Acceptable answers to those questions depend on your business. The loss of one hour of data may be unacceptable for some businesses, while the loss of a whole week won't hurt others.

The most basic problem is that businesses typically can't answer those two core questions to define what constitutes an acceptable response to a real disaster. The few businesses that can answer those questions don't actually know if the processes they put in place will support their answers.

The first challenge is that being able to move to another data center in a reasonable time frame means already having contracts in place with an alternate data center. Some organizations rely on a single entity with geographic redundancy—but this reliance on a single entity is a mistake. As I mentioned earlier, a form of disaster is your managed services provider going out of business without warning. You therefore need more than geographic redundancy—you need organizational redundancy.

The second key challenge is the development of data backup routines that move the data out of the primary data center (and out of the control of the entity running that data center) and ensure that they are accessible in the event of a disaster. There's also a human aspect of this challenge that people often overlook: people die in disasters and it could be that the disaster that has taken out your primary data center has also taken key employees who have root passwords and the most detailed understanding of your DR procedures.

Cloud computing can serve as an important foundation of a rapid recovery, low data loss DR solution. Imagine, for example, regularly synchronizing your production environments with Amazon S3. Assuming you have setup machine images that mirror your production environments, you should be able to rapidly recover into the cloud without paying to run an entirely redundant data center 24x7.

Here's how it works:

  • Setup procedures for synchronizing data with Amazon S3 (or use a tool like enStratus to package the data regularly for DR deployment). These procedures should encrypt the data using the strongest encryption possible.
  • Create machine images (AMIs) that have the same operating system, tools, core applications, and libraries as your production systems.
  • (Optionally) Use a tool like enStratus to configure your DR environment to automate the DR process.
  • Place all key authentication data as well as instructions for DR in a safety deposit box.
  • Regularly test restoring your infrastructure based on the current data in the cloud.

If you are using a tool that automates the DR processes, you can programmatically run DR tests on a weekly basis and use testing tools to validate the success of the event.
The main benefit of this approach is that you simply know this DR system will work for you and you pay nearly nothing for it. Nevertheless, it comes with two key challenges:

  • You have to tailor the data synchronization processes to the volume of data you are synchronizing with the cloud and the type of data being synchronized. This data synchronization can get expensive (but not nearly as expensive as a fully redundant data center).
  • You will need a reasonable level of structure in the way you package and deploy your custom applications.

The first challenge exists for any kind of DR infrastructure. If you have 1T of Intranet data changing daily, you may not have enough Internet bandwidth to keep that data in sync with S3. Still, synchronizing it with a redundant data center is generally just as problematic.

The second challenge is really an example of something you should be doing, but probably don't. I often encounter custom applications that must be hand-deployed each time they are rolled out into a new environment. In fact, that's probably the rule more than the exception. That will work in the cloud, but you can't automate the DR processes and you run the risk that a key piece of deployment is stuck in the head of a critical employee who was lost in the disaster at hand. It's worth the time to standardize your custom application deployments and make it possible to automate their deployment in a DR scenario.


You might also be interested in:

7 Comments

Awesome post, George. I am a big advocate for living and working in the cloud. Currently I back-up my data in several places. First, obviously, is my external drive. But then after that is a variety of online back-up services and cloud services: Carbonite, MS Mesh, Google Docs, Picasa, and Windows Live SkyDrive.

This way, no matter what happens, I have my data backed up in at least three places.

Robert Stanke

George,

This is another interesting (albeit not particularly new) idea. Back in 2007 I was working on repackaging Google Apps for disaster recovery but had troubles making it competitive given it's a kitchen sink solution and all I wanted was parts of it at a price point closer to the existing marketplace. Anyway switching on something like Postini is just a case of updating the MX records...

So this particular use case - Infrastructure BC & DR - is also something that VMware et al have been doing for a while. Being able to do it with an arbitrary cloud infrastructure services product or provider would obviously be useful. I've added the use case to the Open Cloud Computing Interface (OCCI) wiki and will consider how to cater for image management out of band (e.g. over rsync) which will make regular incremental updates possible. I could envisage a cron job which simply rsyncs a physical block device to a remote raw disk file - at least for Unix systems... gets a bit hairier with Windows.

Sam

Great points, but your assumption is quite key! "You will need a reasonable level of structure in the way you package and deploy your custom applications." This is a little like high school physics experiments that tell you "first, assume no friction." The level of structure in how applications are packaged and deployed is THE weak point of nearly every firm I've worked at or consulted to, and it severely limits the potential for any kind of DR approach, much less a cloud-oriented one.

What do you think of the pricing for these services in the DR context? I use Moxy for my home network and am not impressed by the combination of price/performance, given that it took a week to back up 50GB initially and is costing me about $30/month. Thinking about how that scales for enterprises with much more data, I'm wondering how it stacks up with doing it inhouse if you already have the second site.

As you point out, both backup and restore could be problematic for large data sets.

At BAM we've specified and engineered large Cloud Computing systems for companies like Cisco, so Danny's comment above and our team's overall experience here is relevant in the DR context.

One of the largest cost problems from a DR perspective is going to be the hidden cost of programming on customized applications. J2EE stacks with memory leaks and hard coded calls and puts fall apart in the cloud. You could end up with a major code push on top of the hardware cost.

Secondly, the other big challenge is how heterogeneous application arch use the messaging bus. Discrete objects and configuration items have to stand alone and only be used (ideally) when they are called. This is reffered to as "stateliness". So in a SOA cloud implementation, the bus is only passing messages to other vm's that become stateful with instantiation and then close the programming loop after they've done their job.

At Burton Asset Management, we've found that this required Software as a Service training and awareness efforts, Service Oriented Architecture discussions, and ultimately, having to onboard whole teams of programmers and deal with the "old" vs. the "new" players. From a SAAS standpoint, JAVA did so poorly in one instance that we had to move to a Flex/AIr programming model to make it work.

So, there could be loads of costs involved at the enterprise level when thing about this as a DR solution.

See more thinking at www.thinkbam.com

can anyone explain what are the effects of cloud computing on an exsisting Contingency plan in a company?

News Topics

Recommended for You

Got a Question?