On Why I Don't Like Auto-Scaling in the Cloud

By George Reese
December 6, 2008 | Comments: 53

I have gotten into heated discussions over the subject on Twitter (Follow Me). I enter into sales meetings getting clients excited about dynamic scaling only having to vigorously talk them away from the idea of auto-scaling. I just don't like auto-scaling.

What is auto-scaling?

Auto-scaling is the ability (with certain cloud infrastructure management tools like enStratus—in a limited beta through the end of the year) to add and remove capacity into a cloud infrastructure based on actual usage. No human intervention is necessary.

It sounds amazing—no more overloaded web sites. Just stick your site into the cloud and come what may! You just pay for what you use.

But I don't like auto-scaling.

What is dynamic scaling?

Auto-scaling takes advantage of a critical feature of the cloud called dynamic scaling. Dynamic scaling is the ability to add and remove capacity into your cloud infrastructure on a whim—ideally because you know your traffic patterns are about to change and you are adjusting accordingly.

I like dynamic scaling.

Capacity Planning

If you care about the scalability of your applications—whether in the cloud or in a managed hosting infrastructure or in an internal infrastructure—you should thoroughly understand capacity planning. If you don't, you should pick up the book by John Allspaw The Art of Capacity Planning.

In short, capacity planning is how you understand your traffic patterns, how they change periodically, how you expect them to grow, and what kind of infrastructure is necessary to support those traffic patterns. You cannot design any kind of infrastructure without doing proper capacity planning. Otherwise, you will overspend on infrastructure or you will get slashdotted* and lose money.

Capacity planning is critical largely because it enables you to tie infrastructure costs to the benefit the organization will see from different combinations of capacity and demand.

Consider an example in which you know you have an average demand requiring a single server but, for an hour out of the year, you need ten servers.

Even with the cloud, it is possible that you can never justify the spend on meeting the one hour of demand. On the other hand, it is possible that the cloud is just the tool you need to make meeting that demand cost-effective. Or it could be that that one hour is so critical to your business that it makes the rest of the year irrelevant.

Without proper capacity planning, you won't know what those traffic patterns are nor what kind of costs make sense to take on in support of those traffic patterns.

Can't you avoid capacity planning through auto-scaling?

No, you can't. Get the idea out of your head right now. For the most part, auto-scaling is nothing more than a crutch for those too lazy to do real capacity planning. True, if you configure your site for auto-scaling with no governors limiting the max capacity, you will never get slashdotted. And once sudden, unexpected volumes are reached, your infrastructure will return to its baseline configuration.

Here's why that's stupid:

1. Amazon and other clouds cannot respond fast enough to increased capacity needs.

It can take up to 10 minutes for your EC2 instances to launch. That's 10 minutes between when your cloud infrastructure management tool detects the need for extra capacity and the time when that capacity is actually available. That's 10 minutes of impaired performance for your customers (or perhaps even 10 minutes of downtime).

By the way, Amazon S3 has not proven itself to be the most stable of Amazon's cloud offerings. You could thus also discover yourself totally unable to add capacity at a critical time.

Guess what? Almost all capacity changes are foreseeable. If you had done proper capacity planning, you would have had two key advantages:

  • You would have added the capacity before it was needed, guaranteeing that the proper capacity is always in place.
  • You would have discovered any operational issues with Amazon S3 before they impacted your operations (and thus allow you to take alternative steps to deal with the situation).

2. Got any disgruntled employees, unhappy customers, or malicious competitors?

Here's an easy way to go broke: Set up auto-scaling with no governors limiting the maximum capacity. Any yahoo can then execute a distributed denial of service attack (DDoS) against your infrastructure. It won't take down your environment because your cloud provider almost certainly can withstand a reasonable attack. It will, however, cause you to add more and more servers into your infrastructure until you go broke.

3. So you think you'll stick some governors in place...

You definitely should never have auto-scaling without governors in place, but they really won't do you any good. They will simply respond to one of two events:

  • Capacity demands you should have planned for, and thus don't need auto-scaling for.
  • Capacity demands you could not have planned for, and thus you have no idea whether the governor level you have set is even appropriate to the traffic.
4. So what about getting slashdotted?

Sometimes traffic is truly unexpected. But not as often as you think. If you know you are getting coverage in some publication, marketing should have done an ROI projection on the campaign and be able to provide you with expected response rates.

There is, however, the rare occasion when you knew you were going to get coverage in one place, but another much larger venue (like Slashdot) suddenly picked up the story and ran with it. On this rare occasion, you really, really, really would like your site to scale to match the needs of this unexpected traffic.

But you don't want it to auto-scale. Auto-scaling cannot differentiate between valid traffic and non-sense. You can. If your environment is experiencing a sudden, unexpected spike in activity, the appropriate approach is to have minimal auto-scaling with governors in place, receive a notification from your cloud infrastructure management tools, then determinate what the best way to respond is going forward.

Here, the auto-scaling is simply a band-aid to enable a human to use dynamic scaling to define an appropriate, temporary capacity to support the unexpected change in demand.

5. Don't you lose a key value of the cloud without auto-scaling?

No. If you properly use dynamic scaling, you pay for exactly the capacity you need and nothing more. You still add capacity when you need it; you just add it according to a plan rather than willy-nilly based on perceived external events.

The dynamic scaling to plan can also be automated. If you know you have a batch window from midnight to 3am, set your cloud infrastructure management tools to add capacity pro-actively at 11:30 and throttle back at 3:30. You just don't want the system automatically adjusting capacity based on usage.

* Getting slashdotted is generally referred to as having a web site with modest activity suddenly go down due to a sudden influx of valid traffic—often due to coverage in a popular online publication like Slashdot.


You might also be interested in:

53 Comments

George, thank you for sharing not only your knowledge by opinion as well. Capacity planning seems like a ton of work but then again, if you do your homework right, you're already ahead of the game.

I'm curious. Have you heard of or have you had experience with Morph Labs AppSpace and the AppCloud?

They are a part of the Amazon ecosystem but have developed the AppSpace and the AppCloud with it. I'd be interested to know what your thoughts are on it.

Cheers!

Carmen

I have heard of them, but I am not familiar with their offerings. So, at the time, I am definitely not qualified to comment on them.

As far as capacity planning being a lot of work, formal capacity planning certainly can be. But as with any process, I think scaling (excuse the pun) the capacity planning process to the nature of your business can give you meaningful results without forcing you into the unreliable world of relying on auto-scaling.


Hi George,

I think once you go into "automating" dynamic scaling -- it is hard to see why it is not a auto-scaling.
Also, clouds is not just about websites, but also about web-services (or just services -- like backup), where usage patterns are much more predictable, and auto-scaling with governors makes lots of sense. If I know that as I am getting more backup users I should add automatically increase the number of servers -- why not use it? It might not be as simple as that (base it on load averages, memory utilization, traffic patterns). Capacity planning is essential to properly setup auto-scaling. Governors are needed. Yet, there are quite a number of cases where auto-scaling would work well.

Automating dynamic scaling is not auto-scaling because it is based on planned usage rather than actual usage. It does not suffer from the issues I outlined with respect to auto-scaling. The downside is simply that your planning is wrong. But that's what monitoring is for.

You are right, I am very focused here on web sites and web applications.

As far as running a backup management infrastructure, in that case you are probably incurring revenue for each backup you are running. So you don't really care what the nature of the increased traffic is, you just charge your end customer.

SmugMug uses auto-scaling for image processing:

http://blogs.smugmug.com/don/2008/06/03/skynet-lives-aka-ec2-smugmug/

Outside of one instance where it launched 250 XL nodes, it seems to be performing pretty well. Their software takes into account a large number of data points (30-50) when deciding to scale up or down. It also takes into account the average launch time of instances, so it can be ahead of the curve, while at the same time not launching more than it needs.

Re: SmugMug

Like I said, using auto-scaling as a crutch for poor or non-existent capacity planning.

I actually owe SmugMug a big apology here.

I misread the post I was responding to as being about a web app, not image processing. My article was, as I have noted, relating to web applications. Image processing is actually a GREAT use of auto-scaling. I do like it in that environment.

LOL - I love it. In an article entitled "Why I don't like auto-scaling" Reese writes "the appropriate approach is to have minimal auto-scaling". Minimal what?

It's not auto-scaling that is bad... it the improper use of auto-scaling that is bad.

This is an interesting discussion. I work for a Cloud Hosting company. Yes, we auto-scale. And guess what? We do it well, and in a very customer-friendly way.

Why? Because we know our customers. We talk to them, and we work with them. We are available 24/7, so if you scale at 3am on Sunday, we have you covered.

And we don't take ten minutes to scale up - and you don't pay based on time anyway - you pay based on usage.

So while I understand the basic argument (capacity planning *is* good), I disagree with the auto-scaling argument. As a long time web entrepreneur, I can think of dozens of times where I *wish* we had scaled... instead we failed.

Not delivering content that is under serious demand is failure. You have to be able to scale up and down quickly, pay for what you use (and have a business model that takes advantage of the traffic!).

Traffic spikes are NOT "perceived external events" - they are real people trying to see your content. You can serve them, or not. But they are customers.

Auto-scaling does work - we've proven it for nearly 100K domains.

Rob

Hi George

While autoscaling may mean an increase in cost, will it not in the end save a company the hours of the analyst needed to predict the behavioral patterns of its users, as well as possibly cutting the costs incurred whilst securing added server power in those cases when user numbers do not quite meet hopes/expectations?

Where I'm looking from, depending on the task at hand, autoscaling seems at times to be just the right tool for the job.

This post and the arguments presented are so dumb I was motivated to refute via a post of my own.

http://sam.curren.ws/index.cfm/2008/12/6/Really-Bad-reasons-not-to-autoscale-cloud-based-systems

Re: Magnus

Most applications don't require some kind of wildly expensive capacity planning consultant.

When you are talking about web applications, you should be able to understand expected usage patterns fairly easily.

The hard part is generally understanding how your system responds under load.

But the bigger issue is ROI. It can be the fact that there is simply not an ROI for certain kinds of traffic and certain load levels.

Auto-Scaling is a fantastic tool - yes, it does mean you need to monitor your stuff but the reality is that in a highly social world it is almost impossible to plan for when your service "goes viral" because Scoble posted it or something.

I want my capacity to auto-scale and give me a chance to understand whats happenign without dropping users ont he floor. Auto-scaling isn't about not having to do capacity planning - it's about having a safety net if soemthign wonderful happens.

This is one of the reasons we moved to the Mosso cloud (and will look into Azure in the future) - we get great scaling features that do NOT take very long to kick in and we are ready if the best case happens.

Much better for a small group than having a bunch of dedicated servers sitting idle "just in case" or paying for emergency capacity. With a good cloud host your not goign to get blindsided with a huge bill... just nice, smooth added capacity at a nice smooth cost.

Re: Rob

I support auto-scaling for my clients because it is a checklist feature for the cloud.

And of course auto-scaling works. That does not mean it is the right thing to do. Within the context of web applications, I cannot think of a single example in which auto-scaling is the right way to approach capacity management.

As a long-time entrepreneur myself, every instance in which I have wished I had more capacity has been an artifact of me running in a traditional infrastructure on traditional hardware where scaling up was a matter of weeks. Being in the cloud is good enough to deal with any concerns as long as you have good cloud infrastructure management tools.


Re: Soulhuntre

You are talking about letting your entire infrastructure be built around a single corner-case.

If you get a sudden spike in activity, you should be manually reacting to it with or without auto-scaling in place. That's because your governors in an auto-scaling setup should not allow for that kind of bizarre spike in the first place.


The main argument against auto-scaling seems to be that it's too slow to respond. Hmm. My average time to bring up a new EC2 instance is between 2 and 3 minutes. I'm sure it's possible it could take 10 minutes but it never has for me, in thousands of launches. That data point is definitely an outlier.

But, regardless of the launch time, the proposed alternative seems to be to make sure you have absolutely perfect capacity planning in place, then you don't need auto-scaling. Easier said than done, I think. Auto-scaling isn't an excuse to not do capacity planning, it's the way you cover yourself when your capacity planning falls short, which it will undoubtedly do from time to time.

Re: Mitch

The time to bring up an EC2 instance is just the first point listed, and it is likely the least important.

In my experience, time to launch is wildly unpredictable. That does make it hard to respond to significant, but fleeting variations in the traffic.

And no, I am not suggesting perfect capacity planning. I am suggesting basic capacity planning. That will do you a lot more good than auto-scaling.

The fact that there is a dichotomy being drawn between capacity planning and auto-scaling means that the tools we use for both are inadequate. We should have monitoring and trending tools that help do capacity planning, not just alerting. How many monitoring and trending tools support trend lines and performance ceilings? How many of those tools have usable APIs that allow us to build effective tooling on top of them?

Auto-scaling often sucks because it is a house built on sand - the foundation is based on a few simple assumptions, that aren't nearly good enough to cover reality. Auto-scaling could be great, if we could inform it from proper tooling.

(Don's SmugMug scaling system is one example of this done "right" - it works so well because he has a ridiculous amount of instrumentation, specific to his application, that manages instances based on the same predictive model he would use for capacity planning. That's different than "launch another server if the load average is above 10 globally")

HA! Check it out!! www.shortershelflife.com

George, you are dismissing a logical concept, “auto sizing”, based on current physical implementations of that concept. A model where we don’t have to worry about manually planning application resources requirements because they are automatically allocated based on needs makes sense and certainly seems to the right direction to be heading. One can predict a likely point in the not too distant future where planning I/O and other resource demands in detail will cease to be relevant and service providers charging metrics, such as based on network traffic, will become outdated just as charging based on number of CPU instructions or CPU time ceased to be (and charging based on number of database transactions used ceased to be, and charging based on disk space is starting to be). Capacity planning as a discipline is becoming more holistic as the detail becomes less relevant to us humans, the focus will be on planning the capacity of an “environment” and leaving the individual allocations of applications that run in that environment up to it.

So I will agree with you that Auto Sizing systems are not necessarily fool proof or should be trusted, but only for now.

It seems way too many of you are seeing the title "Why I don't like auto-scaling in the cloud" and internalizing "George thinks auto-scaling is the source of all evil and should be banned."

Come on.

As a total outsider to this industry, (visiting from Slashdot, actually, which I suppose makes me a 'case in point'), I have a mundane observation I thought I might share. . .

These arguments remind me of the days back when cameras with automatic features began to make their way on to the market. Professional photographers hated them. Today, there's no point in hating them; such auto-features have become well-integrated into the machines pros perform their jobs with, --though they always do still demand the ability to tinker with camera settings directly.

But for the gen-pop, those automatic features today dominate photography; and they work surprisingly well. --It is easy now for the average person to take a very good picture with zero effort. It strikes me that many small to medium size companies are much like general consumers. They don't care about F-Stops and they don't care about Clouds, (or the whole stinking Internet, for that matter), and any way to remove the pain of having to think about their web-presence while still getting a reasonably good job done is a welcome and indeed, a highly valued service.

Automation is what the West has been built on for the last century. Whether or not this has been ultimately a good idea has yet to be seen, but the question is probably not going to be answered by this small, albeit fascinating issue. Not in today, not in the cloud.

Nice, article, by the way. I enjoyed it very much.

Caps can be (and should be) a bit more granular. If every 'consumable' is metered like electricity, rules need to be put into place to limit these with a degree of sanity that does not shoot the whole idea of auto-scaling in the foot.

A degree of introspection is needed, i.e. if http appliances are stressed yet database storage volumes see little growth, something is wrong .. and it __should__ be up to the provider to limit that to a degree with a pre-defined threshold stating when "we need to call the customer".

Most home grown scaling systems are abhorrently brain dead, many just look at loads, network connections and little else. That's like looking at a window with the curtains drawn, seeing very little light coming through and deciding that a typhoon has landed. As you noted, many do not factor their own provisioning time when calculating a response.

There is a happy medium with auto scaling .. but its based on the unique needs of every site which __does__ require some resource planning on the part of the owner. Setting up these caps has to be simplified and as idiot proof as possible. Owners must be able to easily create a behavioral profile for their sites on which automatic decisions can be made.

This was a well written nudge to get IAAS providers moving in a saner direction. So many are just happy to say "WOW , WE HAVE A CLOUD!!" and leave it at that. Its the difference between engineers and people with big wallets.

My favorite Dilbert strip has the pointy haired boss demanding that Dilbert produce a schedule of "all unplanned outages." Dilbert puzzles for while, then hands a calendar full of items to the boss. "Here", he says." Pointy Hair stares at the schedule. "Does CNN know about this?" he asks, somewhat stunned.

Capacity planning is all well and good, but nobody can forsee every event, or even most events. For example, it's usually impossible to predict being slash-dotted. Auto-scaling clouds are perfect for those situations.

The ten-minute lag you complain about is a red herring. Auto-scaling doesn't have to happen at the tipping point. Any online system has a certain amount of hysteresis in it; you only trigger auto-scaling when a pre-determined load slope occurs -- say, x orders per minute per minute, or whatever.

I think what you should actually hate, George, is UNATTENDCED auto-scaling. Let the system auto-scale, but then notify a human and provide ongoing status reports as traffic evolves. A human is in the loop to investigate the reason for the auto-scale, while the company enjoys the benefit of a potential sales or publicity windfall.

Technology is full of autonomic system success stories, from cameras to aircraft autoland systems to Mars rovers to self-driving vehicles. There is a very rich engineering discipline that provides well-tested predictive algorithms delivering the hysteresis and feedback intelligence necessary to make such systems very useful. That knowledge is being applied to cloud scaling by some very smart computer scientists, including some at Amazon and its partners.

And no, I don't think you believe autoscaling "is the source of all evil and should be banned." But you don't seem to have thought the concept through all the way. As time will demonstrate to you (as it did to professional photgraphers and NASA mission planners), automated systems such as auto-scaling loud computing are both feasible and worthwhile.

The arguments presented fall under the umbrella statement "software cannot smartly auto-scale". Wait 3 years. This article will be irrelavent.

Well.. One other reason why Automatic Scalling should be carefully handles is that it breads lazy developers. Working in an ASP, one of our customers has an app running with us which is one of the worst SQL users in the world.. some pages require 13.000 (Yes.. that is Thirteen thousand) queries to display their information.. We informed them about this (as during performance monitoring, we noticed the slow response and started to dig until we found it was not our infrastructure, but the app itself).. At one point, this made that the page only loaded after a bout 2 to 5 minutes.. we added webservers and SQL nodes at that point for a different project but they benifited from it so the load time dropped to 1 minute..

At that point, the developers at the customer closed our ticket saying the "problem" was solved.. no need to look into their code and SQL usage cause our adding of capacity solved it..

What i am saying is: if your app is getting slower, don't just ADD capacity, starting looking for bottlenecks and fix them.. when doing a bunch of tweaks, you might gain a lot more resources then you though..

I work as a capacity planning (I prefer capacity management) consultant and I deal with these problems every day.

Most monitoring and trending tools are just made for alerting, while there is a whole bunch of capacity planning (management) tools capable of right-sizing infrastructures or simulating "what would happen if" scenarios, both in terms of infrastructure utilization and response time, which by the way can not be managed by simple trending.

Almost any monitoring tool that I've used can be leveraged (using APIs or using their collected data) to build simulations scenarios with these capacity planning tools.

I also recommend reading N. J. Gunther "Guerrilla Capacity Planning".

Rob La Guesse makes a valid point.

I guess the key phrase here is Everything in Moderation.

A personal example -- on a really small scale -- follows


I pay fastmail to manage my mailbox with numerous features for a fixed fee a year.
I get a bandwidth allowance tied to the size of my mailbox - at 1GB per month. Since first signing up I got 2GB of spare bandwidth - which I haven't ever needed yet, but I could buy more spare capacity in advance if I ever expect to need more.

This example illustrates what the customer and the provider expect: predictable service in an unpredictable world.

The prepaid spare capacity allows the service to pick up any peak load, and allows the customer to put a cap on any expense.

If additional load would translate directly into additional revenue, you have the time to decide to divert some of that revenue back into more capacity and availability.

Henk

Note that you are describing problems with today's implementation of auto-scaling in the cloud.

The problem you that the "scaling-up" can take a lot of time (like 10 min in S3) is just a temporary one.

I'm not advocating that organizations depend solely on auto-scaling -- that would be narrow minded. But I think it is also narrow minded to rely only on capacity planning -- guessing the future is not something we humans do very well. Even the non-lazy ones.

We've a little bit of a false dichotomy here:
capacity planning can employ auto-scaling as part of the
process, usually followed by alerting the sysadmin
that we just had an insane growth rate!

On the other hand, George's original point is true,
that auto-scaling in the cloud in preference to capacity
planning is very unwise. You're handing the budget for
front-end processors to the people who are renting them
to you.

In a proper planning regime, one predicts growth,
and when you fall above or below the prediction, one
takes action.

Scheduling additional capacity for busy periods
is normal, having a bit in hand to cover spikes
is necessary, and adding some degree of automation
to handle really unexpected spikes can be useful,
but only if you have limits on it and warnings to
the sysadmin, so she can decide if she wants to add
more capacity for a real surge in business or put
limits on a DOS attack.

We do a fair bit of that in the large-systems world,
automatically shifting processors from one domaoin or LPAR
to another when quarter-end or the Christmas rush
approaches, and alarming when response time starts to
grow so we can decide whether or not to move in more
processors.

It's actually pretty easy: the hard part is deciding
to do it before the anomalous spike. Many of my
projects start after a customer has had a bad experience,
and doesn't want to ever have another.

--dave

"If you know you are getting coverage in some publication, marketing should have done an ROI projection on the campaign and be able to provide you with expected response rates."

That's the best line in the article. Yeah, marketing departments are really well known for doing extra work ahead of time to notify IT of expected responses. In fact, I'm sure most small companies who benefit from cloud computing have very well staffed, experienced marketing departments.

"using auto-scaling as a crutch for poor or non-existent capacity planning."

And stating the above is a crutch to get more people to read your blog and use dynamic cloud management tools from your company.

"Any yahoo can then execute a distributed denial of service attack (DDoS) against your infrastructure... cause you to add more and more servers into your infrastructure until you go broke."

Yes because if you have auto-scaling setup clearly you would never monitor that scaling for weeks on end to see what was happening. And it would be a much better result if the yahoo launched the attack and instead of paying a little more money temporarily while you worked on defending the attack, your site just went completely down.

This is too easy, I don't have enough time to pick apart the whole article. Mel Beckman hit a number of good points.

I am not suggesting people be able to predict anything.

The reality, however, is that the unpredictable is the exception, not the rule. In fact, it is an extremely rare exception.

And such exceptional cases deserve human attention, not some automatic response.

Mel says, "It's usually impossible to predict being slash-dotted. Auto-scaling clouds are perfect for those situations."

Is it really? Is the slashdotted traffic actually traffic you want to incur actual expenses to support? How do you know the difference between that traffic and a DDoS attack? No auto-scaling tool can tell the difference between a sufficiently good DDoS attack and an actual unexpected spike in activity.

It's not worth a few minutes of your time once in a blue moon to validate the value of this traffic before altering your auto-scale governors?

Cecil T says, "Marketing departments are really well known for doing extra work ahead of time to notify IT of expected responses."

Marketing departments are used to IT doing a terrible job of supporting their needs. For the most part, when they do a marketing campaign, they know what kind of response they are expecting from that campaign.

Cecil T further says, "And stating the above is a crutch to get more people to read your blog and use dynamic cloud management tools from your company."

Except my company also supports auto-scaling. And the competition also supports dynamic scaling. There is no competitive positioning at hand in this opinion.

Cecil T also says, "And it would be a much better result if the yahoo launched the attack and instead of paying a little more money temporarily while you worked on defending the attack, your site just went completely down."

It may or may not be better for your site to go down completely. That's why you react BEFORE your capacity is saturated and analyze the unexpected traffic.

Vasco says, "Note that you are describing problems with today's implementation of auto-scaling in the cloud."

That's partially true, but that's important since we are working with today's cloud infrastructure. On the other hand, even if it were instant, it still would not change my argument. The bottom line of my argument is that auto-scaling is dangerous and almost never needed and human intervention in the rare circumstances when it is needed is nearly always superior in the realm of web applications.

Uh, in: "I am not suggesting people be able to predict to anything" change "anything" to "everything".

Of course, I expect people to be able to predict most things :)

I think there is also one major misconception going on:

I very much believe in automation of dynamic scaling based on capacity planning.

What I am arguing against in this article is automating scaling based on "facts on the ground".

I agree with the some of the issues that you have mentioned with autoscaling, I feel at it's current implementation, it is too reactive and can leave things to be desired. However, I feel like we should use every tool in our disposal and combine these different scaling methods together to handle different situations.

Let's look at the different situations where scaling is useful. The first situation is an application which follows a certain pattern, it has heavy use during certain times of the day/week/month and then idling other times. I agree with you that we should be smarter about planning for this, and not just let autoscaling (as currently implemented) take care of this. However, I feel that an even better solution would be a statistical / learning systems that can figure this out and adjust capacity without any human intervention. The system should try to predict traffic ahead of time and prepare your capacity before those traffic spikes happen. This is a smarter way of autoscaling, as opposed to current systems which are reactive and use CPU load or current traffic to try to react, and may take a little time to ramp up. Cloud vendors such RightScale already allow autoscaling and configurable dynamic scaling, but it would be great if they implemented a predictive system such as that can allow the use of past data to scale ahead of time.

The second situation occurs if you are releasing a new application with no historical trends, or expect a traffic spike due to a press release or another event. In this situation, you still need to schedule a capacity increase ahead of time, using dynamic scaling and traditional capacity planning techniques to predict capacity.

Finally, reactive autoscaling is still useful for those unplanned events such as a slashdotting. The system should notify sysadmins so they can monitor the situation, however, increasing your capacity in the meantime is still useful so that potential customers are not denied usage to the application (I'm assuming using governors to limit scaling). After the event, you can then do an analysis of this event and see if it was a legitimate traffic source, or something that did not provide any value such as a DOS attack.

While each method has it's shortcoming, I feel that we should use a combination of these methods to handle the different situations that occur. Why not set a base level plan using an automated learning system, allow tweaks to it based on future planned events, and then let autoscaling kick for any unplanned events.

Wow... sounds like a bunch of capacity planning engineers afraid of losing their jobs. Personally I am not opposed to some degree of capacity planning, but the very example you repeatedly use undermines your premise. Who ever knows when they're about to get slashdotted? How can the tech guys know the exact impact that the sales teams' latest promotion is going to have on traffic? And 10 minutes wait time isn't all that bad if you have software that looks at the traffic trend rather than waiting for their to be a capacity shortage (e.g. when you hit 60% capacity, you start up another server, and if you hit 70%, you spin up another, etc.). It might not be perfect, but it's better than trying to guess, errr... plan exactly how much capacity you will need at any given time.

I would agree with this article in a perfect world. Unfortunately...

Your clear conflict of interest makes your post worthless.

Sorry.

While I agree with the need for an enterprise to do capacity planning, I think that the dicussion goes far beyond an overloaded website. I believe that the real value of autoscaling lies in the support of a service oriented architecture (SOA), especially when services are auto-discovered and workflows are created on the fly with mash-ups.

With a nod to Adam Jacob's comment, the work that goes in to capacity planning is too often taken for granted.

Auto-scaling, like auto-anything, doesn't take enough metric's into account. It's simply foolish to think that scaling up any infrastructure is a matter of needing more servers, and perhaps that clouds make it easy to add more servers has lured us into a false sense of complacency when it comes to scaling.

Anyone who's been in The Suck has seen situations where you keep having to switch between solving different bottlenecks. Experience makes us more aware of these situations. Automation can be a time saver, but it isn't aware.

This article is simply not useful to anyone looking to implement their own auto-scaling. I'm not sure what the author does for a living, but I have a strong suspicion that he does some sort of consulting for "dynamic scaling" and is losing business to companies like RighScale that can set you up with autoscaling minus the big consultant costs.

A couple of points
> It can take up to 10 minutes for your EC2 instances to launch.

I launch EC2 instances regularly and can't remember the last time an instance took more than 90 seconds. The average is around 40 seconds. A proper autoscaling solution doesn't wait til you're "full" before spinning up a new server, and can very easily handle the max 90 seconds of lag between spinning up a new server.

> It will, however, cause you to add more and more servers into your infrastructure until you go broke.

Except that you won't go broke. At $.90 per hour per server in the worst case, even an absolutely minimal notification setup through your autoscaling that sends you an email when new servers are spun up would be enough to effectively control those costs. Worst case, the DDOS attack is huge an organized and you might be out a hundred bucks because you were asleep during the whole attack. If that breaks you, then you were already broke.

For those readers who are looking for a conflict of interest:

No, I don't run a consulting shop that makes money on capacity planning. I have never been paid a dime to do capacity planning, and I don't expect I ever will. There are countless people much better at doing that than I.

Yes, I do own a company, enStratus, that provides cloud infrastructure management tools. One of its features is that it does auto-scaling based on complex, configurable criteria with governors.

If I were trying to move more product, I would be saying, "Go! Auto-scale!"

In general, if you are going to question someone's integrity, you should get your facts straight.

I do apologize for the conflict of interest attack. That was not helpful or called for. It's very possible for people to have different opinions without some sort of hidden agenda. Looks like I should have taken a couple of more read-throughs before hitting "Submit."

Auto-scaling is a tool with strengths and weaknesses. I just don't see any real cost downside to the potential over-scaling with the ridiculously cheap incremental costs, especially when compared with the cost of your time to manage what is referred to as dynamic scaling here. For sure not everyone needs auto-scaling, but in many situations, it's superior to sitting around or stressing about when you're going to need to respond to notification and do it manually, however assisted.

Wes says, "I launch EC2 instances regularly and can't remember the last time an instance took more than 90 seconds. The average is around 40 seconds."

Then you are lucky or you don't launch them that regularly.

Most of the time, it does take under a minute.

A lot of times, the launch commands fail and you have to retry. Though EC2 is stable, the web services API is less than robust.

And, often enough, a launch will take more than a minute.

If you are going to build a strategy where you time to launch matters, you should assume scenario in which a) the launches fail and need to be retried and b) the launches take longer than normal.

Amazon's own documentation says 10 minutes.

Auto-scale currently is a checklist item, not something very robust.

The main problem with auto-scaling is that we're in the stone ages with the sorts of metrics we use to enact thresholds. It works with only certain classes of applications to be able to use CPU and/or I/O aggregates as a trigger to add resources. Yet, it's capacity planning 101, that with many applications, adding the WRONG resources will make your overall application slower and less responsive.

Right now, it works well with stateless compute processes, and maybe with a read-mostly cached web farm. Even then, YMMV.

Great article !!
I think once you go into automating dynamic scaling, you can't go back. Capacity planning is essential to properly setup auto-scaling. Governors are needed.
But...Auto-scaling sometimes sucks, because it is a house built on sand. If we could inform it from proper tooling it would be great.
Thanks anyway.
Melanie sweets.

To what extent are we really talking about our confidence in crafting the algorithms that might (1) respond faster and more accurately to increasing demand, and (2) filtering out the spurious or dangerous sources of increased demand, such as denial of service attacks?

Clearly, I disagree. Both in general and with the assertion that I'm "using auto-scaling as a crutch for poor or non-existent capacity planning". And here's why:

http://blogs.smugmug.com/don/2008/12/09/on-why-auto-scaling-in-the-cloud-rocks/

...and I am sure there are those who were opposed to the horseless carriage... Very immature view of where auto-scaling fits into an overall architecture.

Without knowing the exactities of auto-scaling verses dynamic scaling, what it sound like you are doing quite frankly is stating that the feature set for auto-scaling cloud computing has not matured yet to your liking.

Perhaps auto-scaling with a greater feature set will be more promising. Perhaps creating intelligent filters surrounding those systems would even be sufficient in ideal conditions.

As with any other branch of knowledge, the "wizard" type of approach seems largely inadequate for absolute dependability, but certainly it does not invalidate the usefulness in a pinch.

-Alex

"10 minutes for your EC2 instances to launch"
You are doing it wrong! A proper deployment strategy would point you to using bundled AMIS. Go from minutes to seconds. You should do better research of the optimal uses of the cloud before shooting thing down. Startups can't forecast the spikes and can't commit to investing is something that might happened. PreEmptive Optimization!!

A couple of points:

#1 Bundling everything in your AMI is a poor deployment strategy for reasons much to numerous to enumerate here. Even if you do, a simple Linux AMI takes almost a minute to become available. Windows instances 5 to 10 minutes.

#2 If you are planning things out, you need to assume the worst-case scenario. Whether or not you are bundling everything into the AMI, I have seen instances take as long as 7 minutes to start up. I use 10 minutes as a reasonable, conservative estimate.

"10 minutes for your EC2 instances to launch"
You are doing it wrong! A proper deployment strategy would point you to using bundled AMIS. Go from minutes to seconds. You should do better research of the optimal uses of the cloud before shooting thing down. Startups can't forecast the spikes and can't commit to investing is something that might happened. Preemptive Optimization!!

News Topics

Recommended for You

Got a Question?