The AWS Outage: The Cloud's Shining Moment

The Amazon Web Services outage has a silver lining.

By George Reese
April 23, 2011 | Comments: 69

So many cloud pundits are piling on to the misfortunes of Amazon Web Services this week as a response to the massive failures in the AWS Virginia region. If you think this week exposed weakness in the cloud, you don't get it: it was the cloud's shining moment, exposing the strength of cloud computing.

In short, if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.

The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.

The Dueling Models of Cloud Computing

Until this past week, there's been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: "design for failure" and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.

The Amazon model is the "design for failure" model. Under the "design for failure" model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.

Most cloud providers follow some variant of the "design for failure" model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn't matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.

The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.

The advantage of the "design for failure" model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the "design for failure" model is that you must "design for failure" up front.

The Five Levels of Redundancy

In a cloud computing environment, there are five possible levels of redundancy:

  • Physical
  • Virtual resource
  • Availability zone
  • Region
  • Cloud

When I talk about redundancy, I'm talking about a level of redundancy that enables you to survive failures with zero downtime. You have the redundancy that simply lets the system keep moving when faced with failures.

Physical redundancy encompasses all traditional "n+1" concepts: redundant hardware, data center redundancy, the ability to do vMotion or equivalents, and the ability to replicate an entire network topology in the face of massive infrastructural failure.

Traditional models end at physical redundancy. "Design for failure" doesn't care about physical redundancy. Instead, it allocates redundant virtual resources like virtual machines so that the failure of the underlying infrastructure supporting one virtual machine doesn't impact the operations of the other unless they are sharing the failed infrastructural component.

The fault tolerance of virtual redundancy generally ends at the cluster/cabinet/data center level (depending on your virtualization topology). To achieve better redundancy, you spread your virtualization resources across multiple availability zones. At this time, I believe only Amazon gives you full control over your availability zone deployments. When you have redundant resources across multiple availability zones, you can survive the complete loss of (n-1) availability zones (where n is the number of availability zones in which you are redundant).

Until this week, no one has needed anything more than availability zone redundancy. If you had redundancy across availability zones, you would have survived every outage suffered to date in the Amazon cloud. As we noted this week, however, an outage can take out an entire cloud region.

Regional redundancy enables you to survive the loss of an entire cloud region. If you had regional redundancy in place, you would have come through the recent outage without any problems except maybe an increased workload for your surviving virtual resources. Of course, regional redundancy won't let you survive business failures of your cloud provider.

Cloud redundancy enables you to survive the complete loss of a cloud provider.

Applied "Design for Failure"

In presentations, I refer to the "design for failure" model as the AWS model. AWS doesn't have any particular monopoly on this model, but their lack of persistent virtual machines pushes this model to its extreme. Actually, best practices for building greenfield applications in most clouds fit under this model.

The fundamental principle of "design for failure" is that the application is responsible for its own availability, regardless of the reliability of the underlying cloud infrastructure. In other word, you should be able to deploy a "design for failure" application and achieve 99.9999% uptime (really, 100%) leveraging any cloud infrastructure. It doesn't matter if the underlying infrastructural components have only a 90% uptime rating. It doesn't matter if the cloud has a complete data center meltdown that takes it entirely off the Internet.

There are several requirements for "design for failure":

  • Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
  • Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
  • Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
  • Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure (full disclosure, I am CTO of a company that sells such automation tools, enStratus)

Applications built with "design for failure" in mind don't need SLAs. They don't care about the lack of control associated with deploying in someone else's infrastructure. By their very nature, they will achieve uptimes you can't dream of with other architectures and survive extreme failures in the cloud infrastructure.

Let's look at a design for failure model that would have come through the AWS outage in flying colors:

  • Dynamic DNS pointing to elastic load balancers in Virginia and California
  • Load balancers routing to web applications in at least two zones in each region
  • A NoSQL data store with the ring spread across all web application availability zones in both Virginia and California
  • A cloud management tool (running outside the cloud!) monitoring this infrastructure for failures and handling reconfiguration

Upon failure, your California systems and the management tool take over. The management tool reconfigures DNS to remove the Virginia load balancer from the mix. All traffic is now going to California. The web applications in California are stupid and don't care about Virginia under any circumstance, and your NoSQL system is able to deal with the lost Virginia systems. Your cloud management tool attempts to kill off all Virginia resources and bring up resources in California to replace the load.

Voila, no humans, no 2am calls, and no outage! Extra bonus points for "bursting" into Singapore, Japan, Ireland, or another cloud! When Virginia comes back up, the system may or may not attempt to rebalance back into Virginia.

Relational Databases

OK, so I neatly sidestepped the issue of relational databases. Things are obviously not so clean with relational database systems and the NoSQL system almost certainly would have lost some minimal amounts of data in the cut-over. If that data loss is acceptable, you better not be running a relational database system. If it is not acceptable, then you need to be running a relational database system.

A NoSQL database (and I hate the term NoSQL with the passion of a billion white hot suns) trades off data consistency for something called partition tolerance. The layman's description of partition tolerance is basically the ability to split your data across multiple, geographically distinct partitions. A relational system can't give you that. A NoSQL system can't give you data consistency. Pick your poison.

Sometimes that poison must be a relational database. And that means we can't easily partition our data across California and Virginia. You now need to look at several different options:

  • Master/slave across regions with automated slave promotion using your cloud management tool
  • Master/slave across regions with manual slave promotion
  • Regional data segmentation with a master/master configuration and automated failover

There are likely a number of other options depending on your data model and DBA skillset. All of them involve potential data loss when you recover systems to the California region, as well as some basic level of downtime. All, however, protect your data consistency during normal operations—something the NoSQL option doesn't provide you. The choice of automated vs. manual depends on whether you want a human making data loss acceptance decisions. You may particularly want a human involved in that decision in a scenario like what happened this week because only a human really can judge, "How confident am I that AWS will have the system up in the next (INSERT AN INTERVAL HERE)?"

The Traditional Model

As its name implies, the "design for failure" requires you to design for failure. It therefore significantly constrains your application architecture. While most of these constraints are things you should be doing anyways, most legacy applications just aren't built that way. Of course, "design for failure" is also heavily biased towards NoSQL databases which often are not appropriate in an enterprise application context.

The traditional model will support any kind of application, even a "design for failure" application. The problem is that it's often harder to build "design for failure" systems on top of the traditional model because most current implementations of the traditional model simply lack the flexibility and tools that make "design for failure" work in other clouds.

Control, SLAs, Cloud Models, and You

When you make the move into the cloud, you are doing so exactly because you want to give up control over the infrastructure level. The knee-jerk reaction is to look for an SLA from your cloud provider to cover this lack of control. The better reaction is to deploy applications in the cloud designed to make your lack of control irrelevant. It's not simply an availability issue; it also extends to other aspects of cloud computing like security and governance. You don't need no stinking SLA.

As I stated earlier, this outage highlights the power of cloud computing. What about Netflix, an AWS customer that kept on going because they had proper "design for failure"? Try doing that in your private IT infrastructure with the complete loss of a data center. What about another AWS/enStratus startup customer who did not design for failure, but took advantage of the cloud DR capabilities to rapidly move their systems to California? What startup would ever have been able to relocate their entire application across country within a few hours of the loss of their entire data center without already paying through the nose for it?

These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important.


You might also be interested in:

69 Comments

Couldn't agree more. When we designed Xeround SQL Cloud Database-as-a-service; we looked at all the layers you listed for alternate DRP design: Physical > Virtual resource > Availability zone > Region > Cloud.
With that in mind, we took a cloud agnostic approach enabling our users to run their databases on any public cloud - Amazon (East, Europe, same/multi zone), Rackspace and soon many other IaaS worldwide, private cloud (Vmware vCloud)… In fact we also support a similar approach PaaS wise – support Heroku, CloudControll and quite a few others coming soon.
We know running a dB in the cloud is tricky and took the DBaaS direction, Acunna Matata worry free philosophy; auto everything – hilling, elastic scalability, distribution, even more front end SQL/NoSQL APIs coming soon.
Try us at xeround.com (razi Sharir

I thought the failure was that multiple availability zones died simultaneously, something that by design and per Amazon's docs should never happen short of a hurricane in Virginia. Note that out is exponentially harder to distribute your app across not only AZs but geographical areas as well: high speed links connect AZs within a geo, but going from one geo to another is extremely slow and not realtime.

Of course you design for failure, it happens every day on AWS. But can you design around multiple datacenters (availability zones) dying simultaneously? When AWS told you not to worry about that eventuality? Probably not without downtime and some serious compromises.

It was a surprise that so many popular web services went down when Amazon went down. We always assumed one of two things would happen: that our infrastructure in the Amazon cloud would fail or our infrastructure in our data center would fail (hopefully not at the same time).

The problem is that once EVERYONE falls back to a service in another availability zone, that zone suddenly has to handle twice the load (probably a lot more when Virginia goes down, because it's generally believed to have the most instances). We saw pretty heavy slowdown across zones even with only a handful of people following this approach. You need to either bring another provider into the mix, or just have faith that AWS keeps piles and piles of spare capacity.

‘overload’ situation in such event remains for either few minutes or for couple of hours... you are right it affects the ‘performance’.... but that is what happens when you failover business critical apps to DR site when primary site goes off... you do it just for 20% of total apps... and assume that it will failback to primary site in few hours...DRs are not designed to meet the performance factors, but to keep the business critical apps available at scale-down mode.

More bullshit.

Ayep. What BiggieBig said. The cloud fanatics really do need to wake the f__k up and smell the f__ckin' coffee.

Succinct, not much substance or thoughtful consideration but succinct.

Biggie, tell us more...Messieur Biggie. Inquiring minds want to know your point of view.

AWS previously assured us that multiple Availability Zones wouldn't realistically fail at the same time. Now that proved to be untrue, you choose to say "Ah - you shouldn't have believed AWS, you should have been using multiple regions" Presumably when the next outage hits both US regions you'll say "Ah - of course you should have used the EU and Asia regions as well".

We should recognize AWS as a single point of failure and look at hosting across multiple providers. Fool me once, shame on you; fool me twice, shame on me.

This does require sophisticated management tools like enStratus, but you should use those tools to avoid putting all your eggs into the AWS basket.

I'm not sure that the rest of the technology stack has necessarily caught up to this model though - in particular NoSQL databases aren't the panacea you appear to believe them to be. Hopefully all the pieces of the technology stack will evolve.

AWS has never in any conversation I have ever had said that multiple availability zones would not realistically fail at the same time. If they felt that way, don't you think they'd have an SLA better than 99.9%?

Of course, if you want to survive the failure of multiple availability zones, you should spread yourself across regions. I don't understand why this is so hard for people to understand.

Similarly, yes, you should have some ability to migrate your systems into another cloud. I don't think actual technical loss of all AWS regions (or even multiple regions) can happen absent of nuclear war or asteroid strike, but companies do go out of business/get sued/etc.

From the EC2 homepage (http://aws.amazon.com/ec2/):
"Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location."

From the EC2 FAQ (http://aws.amazon.com/ec2/faqs/):
"Q: How isolated are Availability Zones from one another?
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone."

Seems pretty clear to me that multiple AZ failure is supposed to be unrealistic except in the case of disasters, and AWS even explicitly state that it would have to be a large scale disaster, not just a "measly" fire, tornado or flood :-)

In addition, AWS themselves engineered their own solutions reflecting this assumption (e.g. RDS Multi-AZ is multi-AZ, not multi-region)

Of course, you're right - AWS was over-promising here; we should have ignored what they stated and used multiple regions. But it's the same people and the same software that run those multiple regions, so I don't see understand how you continue to have faith that multiple regions won't go down except in extraordinary circumstances.

I think we're in agreement that you can't trust a single AZ; we've learned in this outage that you can't trust a single Region. We only disagree in that you continue to have faith in multiple AWS regions, whereas I have no reason to believe that e.g. an AWS software bug won't get deployed to all regions, or that a rogue AWS employee won't somehow shut down all the regions.

As for your conversations with AWS, if they were in fact privately saying to you that multiple AZ failure were likely, while publicly saying the opposite, I think you should publish that story.

Pretty telling that the author never replied to this comment, yet replied to comments below.

Cloud computing has its uses, but if someone is trying to "polish the turd" that was the epic outage is nothing more than a glorified used car salesman. Amazon promised something that they didn't deliver - it's really that simple. They had more downtime this week than my company has had in the last 5 years and that includes an entire physical relocation of our corporate office.

Trying to justify the outage as a failure of the customer is so ridiculous I'll never read anything by this author again.

Cloud is an approach and a concept of sharing IT resources, making them available on-demand at low cost (because they are shared and not reserved or fully controlled by one user). The concept is evolving... putting your entire stuff on cloud is nothing but over-expecting something which cloud may not be able to deliver. Evolution takes time, and issues like this would accelerate the evolution... Even if you build your own DCs across the globe there will be downtime. The downtime may be controlled, justified, little longer or shorter, but there will be downtime. Time to appreciate Murphy's Law, "Anything that can go wrong, will go wrong" .


But that's for EC2. EC2 isn't what went down, right?

This was foretold decades ago in the movie Animal House:

"You f****ed up. You trusted us!"

I really like how various vendors jumped in these comments saying how their products would have prevented it.

a.k.a. see "Animal House".

We documented our experience at bitmenu here:

http://blog.bitmenu.com/2011/04/production-push-redundancy-is.html

An event like this makes us value the new infrastructure even more.

I think Amazon should stick with selling everything under the sun except cloud web services. Leave google to master the cloud much like they master everything else on the internet. EC2 FAiL :D

Because Google has such a stellar track record?

Google had/has frequent outages of its datastore(NoSQL) offered in Google application engine.

Your check from Amazon is the mail, tech shill guy.

"The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider"

Sadly, no. That the developers are in charge of this stuff is why so many sites were down completely. They're terrible at it, don't value it, and even when they try to roll it out they do it poorly.

An IT guy would have spent 15 minutes on day 1 thinking about disaster recovery. A developer always wants to do it tomorrow and tomorrow, as we all know, never comes.

Certainly, you can design for failure. And for those where failure it literally not an option, like things that deal with life-safety or where thousands of dollars are lost every second, sure.

But one simple thing you don't address is that doing so is a lot more expensive to design for failure during the development cycle. Yes, any bridge across troubled water can be over-built to ensure that it never, ever fails, but doing so is often so cost-prohibitive as to be unrealistic.

And lastly, your advocacy would have sounded more credible if you had stated up-front that you are CTO of a company that purposes to help people solve this problem in exchange for mucho deniro rather than bury that fact in the middle of the article.

"In short, if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model."

Oh yes, a company would provide you a cloud service to host your data and when that very service fails and renders your own operation useless it is your fault. So why pay for their service in the first place?

This sounded like an argument made by either a total irrational fanatic, or another network guy who is clueless about creating software, or both. Seems like outages these days are always the fault of the software developer(s) and never that of the one maintaining the network resource.

Anybody with common sense should stop reading right after that quoted sentence.

Well it's the fault of the software developers for thinking they don't need "clueless network guys"

You realize "in th3 cloud!!! ZOMG we're in cloud!!!" means nothing more than you're running a virtual machine in a data center somewhere. That's all amazon does. It's not magic - you're not safe because you're "in the cloud."

It's just a datacenter. That most software engineers don't know that is why they need "clueless network guys" to point it out to them.

Cloud is not for you... host your apps in your DCs.

Of course it's for me. I just know exactly what it is. You guys all think it's something mystical - it's bizzare.

"100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage."

The problem was Amazon had multiple data centers fail simultaneously. Your comments ridiculing people for not being smart enough to span multiple *regions* are a bit strained. According to Amazon's docs, spanning multiple data centers within a single region should have been sufficient to guard against failures short of a regional natural disaster.

So far, it looks like Amazon did not architect their systems to properly isolate data centers from one another, namely by allowing an EBS failure in one center to cascade to others via a single region-wide failure point. Customers bear only limited blame for expecting them to adhere to their stated architectural principles. If Amazon had instead made an even larger goof and architected a single *global* failure point for EBS, this outage would have cascaded to the other regions as well, and your argument would be completely moot.

It is Amazon's fault for not properly isolating data centers. It is customers' fault only to the extent that they relied on Amazon as their exclusive host -- not that they failed to understand how to architect their apps as per Amazon's guidelines.

See this blog: http://labs.mudynamics.com/2011/03/10/blitzio-couchdb-in-production/ We had everything setup right, except the web app for http://blitz.io was on Heroku which happened to be on all AZ's in Virginia. We had lots of dynos to handle the web-app load, but yet when Virginia went down, poof. All of our scale engines, our CouchDB clusters, everything stayed up, except the web app. AWS outage is more about PaaS offerings not distributing the app across multiple regions and handling catastrophic failures like this. AWS already has a consistent multi-region API to spread out the love around the world. People just need to make use of this.

This article has a point; any company running on AWS could have designed its system to survive this outage. But this missed two key points:
1) How could they test this survivability, end-to-end, ahead of time? The rule it, if you didn't test it, it probably won't work. Companies that survived unscathed were prepared *and* lucky.
2) What about recovery? The statement "No humans" is wrong. A company may be able to design for the initial outage, but designing to automatically handle a period of days where Amazon are fiddling with flaky infrastructure is practically impossible. Everyone will have a lot of overtime afterwards, making sure their systems are working perfectly and data is consistent.

And the elephant in the room is that startups can only tackle a few problems at once. If they pile resources into 99.999% reliability, the opportunity cost is that the rest of their development goes slower and they fall behind the competition.

I'm curating a history of the outage at:
http://blog.marketingxd.com/post/4808529314/ice-cream-castles-in-the-air

Crafting the test scenarios for cloud computing can definitely be challenging, but it is doable.

My best advice is to assume any where you have one concept (e.g. availability zone), that you have some kind of single point of failure and shut down access to that single point of failure.

Then automate your tests!

You say that like you've never actually done it from soup to nuts it is cheap and easy.

The cloud is basically just outsourcing to a mainframe-model.

And lo... the cloud has all the weaknesses of a mainframe, and no attempting to shift responsibility onto a "design for failure" is going to change that...

... cos when you come down to it, THE most effective way of designing for failure is distributed redundancy, ie: away from the mainframe-model, and back to the raison d'etre of the internet in the first place.

Until the cloud becomes the mesh (ie: not owned by a single company) it is inherently weak. I mean this last thing was a technical failure - Amazon could simply decided to switch you off. Like they did with wikileaks.

I wondered when the Cloud Snow Job would begin, surprised it's on O'Reilly frankly.

What this article doesn't "get" is that Amazon fundamentally did not deliver what it said on it's tin.

Also, the solutions espoused here are pretty standard traditional datacentre operating procedures, which cost real money - the whole point of the "Cloud" was to avoid these costs, else why bother?

As to "Applications built with "design for failure" in mind don't need SLAs." - run a mile from anyone who suggests that.

For us, who get paid by uptime, we get it: three days is too long.

Oh, now I get it. When the cloud fails gloriously, that shows us why the cloud is so important. Thank you very much for clearing that up. And what a shining moment it was. I personally can't wait for the next instance of shining moment for the cloud to show us once more how awesome it is in all it's shinyness.
Now, if you'll excuse me, I have to go knock my head against a wall to clear up my headache. You know, that also is a shining moment of headache cure. Knocking your head against a wall. Most people would say that this is bad when you have a headache, but I know it is really a shining moment to show that knocking my head against a wall really can clear up a headache. You should try it sometime!

I always thought it was a myth that IT guys are scared of the cloud because they think it's going to take away their job. Apparently from the comments on this post..this is no myth.

and google master the cloud?! Are you sniffing glue?

I think the vitriol behind some of these comments shows exactly why the business is moving to the cloud.

IT: The Department of No

Do you guys think people are really duped into moving to the cloud because of pots of gold and promises of no worries?

No. They are moving there because you, the IT leader, make it impossible to do their jobs. Procuring a server or even a VM in most organizations takes 3-6 weeks (or, in some cases, 3-6 months) and the business has real work to do that actually generates revenue for your company.

But you are saying no to the business and going on forums yapping about why the cloud sucks.

In the mean time, they are sticking systems without any controls or risk analysis or redundancy in the cloud.

Stop bashing the cloud and do your job. Help them move into the cloud appropriately.

The reason it can take some time to get a new system up (3-6 weeks is probably long for a small company but believable for a big one, and is certainly long for any company which has virtualized internally) is because that system needs to have controls and there needs to be risk analysis and redundancy planning exactly as you suggest. The cloud doesn't mitigate any of that. It just shifts the responsibility for it from the "Department of No" to the "Developer who doesn't know." It still takes time and effort and skill to do it right. Which I think was the original point of your post.
But the bleeding edge developers who want to do something cool and not have to worry about those pesky details have to either wait for somebody to do that part for them, or take it to the cloud and accept that their amazing app is going to be subject to the whims of Amazon's IT staff who don't give a darn about them and their puny app, instead of the whims of their own IT guy that they could be taking out for a beer every now and then. IT can be a good friend. But you can't treat them badly (or, for example, call them names) and expect them to still cater to your every whim when you have it. There are 50 other developers clambering for the same thing. They're probably off helping the ones they like more.

Why do I keep reading all these articles about how great cloud computing is, and how the major failure of a cloud service provider just proves it.

Wait... what? How does failure - epic failure - failure that wasn't supposed to be possible - Titanic type failure - how does that prove how great cloud computing is?

Amazon screwed up. Their system failed. Cloud computing is not as reliable as touted. Period. End of story. Stop making excuses.

I can't be the only one hosting stuff on EC2 us-east-1d, and hasn't been affected _at all_ by these outages, right?

We had recently shifted to Amazon Web Services for hosting our Facebook game - Fun2Hit. We were running out of Virginia only and did not got any issues with any Amazon services (EC2, EBS or SimpleDB) for that game.
On the first day of failure we did face problems launching new EBS and EC2 instances for testing purposes for our forthcoming products. We shifted testing to California region of AWS and it worked nearly fine except took more a little more time than usual to start and stop EBS.
I support the author on planning for failure. Planning in Cloud is similar to planning for bank failures. You do not put all your savings in one bank or one asset class. Similarly you should diversify.

Cloud does provide benefits of arbitrage, redundancy, rapid prototyping and development etc. Also we should design applications for intra-service transfer and redundancy. For example if you are using RDS and RDS in your region goes down you can plan to have ready redundancy for the application in SimpleDB and S3. These two services were running fine even when RDS got issues. Difficult to convince developers to keep this in mind but can be an effective weapon to tackle this issue.

You can visit my Facebook page for more on cloud computing at Facebook.com/openpad or visit www.openclass.in

George,

I think that calling this week the cloud's "shining moment" is stretching things. It would be more accurate to say that, since it's the cloud, recovery is easier. With native Amazon tools in some cases, and strong devops tools and practices in others, DR can in fact be radically cheaper and easier in the cloud than in traditional, physical infrastructures.

Cheaper and easier, though, is only part of the point of the cloud. Another important point is that it lets IT focus more on the business and less on infrastructure complexities. At the moment, major cloud providers remove physical complexities but not really software ones. It's still up to us to design, build, deploy, and manage those great devops practices. Vendors like Enstratus take the next step and bite off that layer. Either poetically or ironically, depending on your viewpoint, they use the cloud to recover from failures in the cloud. (If nothing else, we're proving there is no such thing as THE cloud. Maybe we should call it "the sky" instead. The sky definitely was falling this week! :-).

In any case, it would be interesting to learn how much load the various cloud automation services others handled this week. Can they scale if thousands of AWS customers use their service to migrate tens of thousands of servers and terabytes/petabytes of data all at once, or does the meltdown cascade from one level of the cloud to the next?

Data, on the other hand, is a whole different kettle of fish. If we all waited to migrate to the cloud until we'd implemented true design-for-fail architectures, cloud adoption would be at least an order of magnitude slower than it is. The bottom line is that we're still only partway through the journey to the holy grail of the cloud. So far we've pushed the complexity, difficulty, and cost up the stack from hardware to software and systems architecture. This week the tradeoffs that were made at the architecture level are revealing themselves for everyone to see.

Anything more substantial being shared from Amazon about what actually happened? So much we can all learn from a giant's points of failure. Any telling reports/analysis one can share?

Nothing yet.

And my comments are not meant to suggest this outage or the severity of it are excusable or acceptable.

The whole point is that regardless of how bad things get at an infrastructural level, the cloud enables you to survive it.

The layman's description of partition tolerance is basically the ability to split your data across multiple, geographically distinct partitions.

No, that's completely wrong. Partition tolerance is the ability to continue to meet your service guarantees in the presence of communication failures that isolate portions of your system.

Also, non-relational is not the same thing as eventually consistent (just ask the HBase folks). You can have strong consistency requirements without using a relational storage model.

IWhen you say, "The knee-jerk reaction is to look for an SLA from your cloud provider," you are ignoring the point made above by Abol. Amazon claimed that zones were independent, but an EBS failure affected multiple zones. They fell into the same trap you are warning about: they didn't design for failure.

Actually, it's a good layman's definition of it. You seem to be conflating technical details with a simple, non-technical generalization. Same with the idea of NoSQL.

I am sorry but this is too stinky load of crap to be taken seriously. If the author want to provoke argument of how to improve application design or to promote cloud computing, there are so many OTHER ways to say it. It really takes Security Chief of Jones Town to put out something like this,

I dont buy it. This theory works with relatively small outages (say a datacenter or two) pertain to small portion of the cloud usage companies (say 40% of capacity). The bottom line is that it is a 'cloud' to the users, but real datacenters, real servers, and real network gears for Amazon. Not all Amazon datacenters (and associated gears) are created equal. No companies, including Amazon, can design datacenters exactly in equal capacity or plan for multiple failures. If they did, this would've never made the news. For example, if Virginia datacenter fails, why not just fail to West Coast datacenters? If they both fail, why not fail to Europe or Asia? The truth is the capacity are not the same and they cant stand a heavy outage in the heavier used portion of the capacity. Say, everybody sign up for redundant instances in both East and West coast AZ, they will just overload West Coast datacenter all the same.

"Amazon screwed up. Their system failed. Cloud computing is not as reliable as touted. Period. End of story. Stop making excuses."

Ok, let's go back in time with no electricity coming from utility companies.

I didn't realize Netflix was a customer of yours! It's impressive that you managed to keep a site that big up and running through it all. Congrats!!

"What about Netflix, an AWS customer that kept on going because they had proper "design for failure"? Try doing that in your private IT infrastructure with the complete loss of a data center. What about another AWS/enStratus startup customer..."

No, Netflix is not an enStratus customer.

AFAIK, they have their own, custom-built tools for managing availability and a very solid architectural strategy.

Did Netflix just get lucky because they were in CA zones to begin with?

I've seen a few comments that suggest that the whole US East region was/is unavailable. That's not true; it was just zone C. To weather this particular event, all you had to do if you were an AWS customer was one of:

a) rebalance out of zone C
b) design your system not to use EBS in the first place

Granted, this second option would still leave systems vulnerable since starting a new EC2 instance requires EBS at least temporarily.

It appears there were some issues with the EBS control plane that is used by all zones and so had an impact on all zones in the regions. The impact on other zones is somewhat less severe than the impact on the one AZ with EBS mirros issues. Details here.

http://aws.amazon.com/message/65648/

This was a moderately interesting but also intensely frustrating posting.

"This should never have happened if you designed your services right, to never trust (that one) cloud!"

Yes, but, the number of people who have all of the CS and IT architecture and IT operations backgrounds to understand how to not trust any single point of failure and actually design really robust systems around that sort of thing is not that large.

What is being suggested is that the entire industry must suddenly develop a higher level of technical competence than it now has, by a large factor.

Would this be a good thing? Of course. Is it practically going to reach the priority level that real operational organizations can make it happen? Unlikely.

Eventually, attempting to wring the last 9's out of a service, one runs into externalities such as partitioned and failing backbone ISPs, major DNS outages, physical damage to infrastructure, and other hard to solve problems. One can design right up to that ragged external unavoidable outage edge, with arbitrary amounts of time and money and expertise. I and a few others are happy to do that for clients. But knowing what is economical and sensible, and what is polishing the shine on areas when there are larger inherent risks accepted as costs of doing business, is important.

Actually, almost all you need to know to achieve AZ-redundancy is to simply follow programming best practices you were supposed to be following in the first place.

Getting x-region and x-cloud is moderately more difficult, but not as difficult as doing it for a traditional data center.

George Reese writes:
Actually, almost all you need to know to achieve AZ-redundancy is to simply follow programming best practices you were supposed to be following in the first place.

No, not even close. You need to follow system design and integration best practices you were supposed to be following in the first place, which includes programming and architecture and all the other subcomponents.

The number of organizations that actually meet system design and integration best practices, in the real world, is very small. Hence my frustration. Actual high availability and dependability is a much harder problem than people tend to think it is. Saying it's just a programming problem is obfuscating.

All the things that need to be done are described in literature and operational reports and so forth. None of them are secret or particularly obscure. But rigorous study of systems architecture needed to understand the scope of it well enough to conceive of it and implement it is rare.

Then stop writing software in the cloud.

So a cloud provider is just like any other datacenter, except you have to spend a lot more money in development to work around their unreliability. Awesome!

-- The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.

Wow. Just.... WOW.

I think the author is mixing two things up.

1) Massive failure of AWS despite their SLA
2) Effect of Massive disaster of AWS failure for customers

These two might look like the same but they are obviously not. Failure of AWS is of course a good thing for cloud computing as a ringing bell which eventually makes us think more in depth when designing our architecture and considering these things. But surely any shortage in the applications of customers do not justify AWS failure.

The latest Dr. Who (on BBCA) had the doctor get killed because he got killed while regenerating from getting killed. And they said it three times so it must be true.

Yes, he saved himself by sending an invitation backwards in time, to himself, so that he could sort things out (One assumes. We'll have to wait for next week to see…). Is that what's meant by designing for failure?

Message to past self: tomorrow, my today, AWS is going to fail. Please start planing now because otherwise your Easter Break will be just that: broken.

Nolio customers were able to automatically deploy their applications to US West as well as other cloud providers - http://www.noliosoft.com. If you're one of them, would love to hear your story.

Adrian Cockcroft recently addressed the issue of distributing relational databases in his blog (http://perfcap.blogspot.com/2010/11/nosql-netflix-use-case-comparison-for_17.html). It looks like there is a way for traditional SQL-based applications to scale-out effectively.


I don't know if I'd go so far as to call it a "shining moment". Despite that technically these outages were avoidable, many internet shops either don't have the resources to have redundant everything, or more likely don't have the know-how. That said you're absolutely right, and more need to embrace design-for-failure methodologies. The Netflix Chaos Monkey is a great example of this.

At minimum, shops deployed in EC2 need scripts to rebuild all servers from bare metal, AMIs, source code and configurations all in version control. And then of course a database backup. Snapshot in S3, hotbackup via Percona's xtrabackup, and a mysqldump (assuming you're using MySQL). The latter two types can be copied offsite periodically.

-Sean Hull
My Discussion of the Amazon outage

The was outage was summarized by some at rackspace who said ( from memory here)

" the aws outage is like a plane accident. It's news every time it happens, but that does not mean that driving is safer."

No one hears about the 50,000 cats who pulled out the server plug to bring down a small site, but when 50,000 go down at once, it's news.

Another way of putting it:
The news worthiness of the story increases linearly with the number of simultaneous downtimes, but overall downtime is simply not effected by this simultaneity.

Yes - if you want to be up all the time you should spend $ doing the things he talks about, but for many sites the uptime percentage of a single aws zone is good enough. S3 worked right through it all. Planes don't even make off the runway some days due to weather yet somehow we just accept the delay and move on.

Wondering if... A B or A = B where

A> On-premise: infrastructure provisioning, architecting, designing and developing for very high availability through redundancy, application level resilience features etc.
B> Cloud: cost of paying for multi-region redundancy, re-architecting, re-securing, architecting, designing and developing for very high availability through redundancy, application level resilience features, new cost structures with higher transactions/growth etc.

Note: I am not pushing for one or the other - I am just plain confused as to why costs do not figure in this never ending debate!

Any comments?

News Topics

Recommended for You

Got a Question?