You moved into the cloud to save some money. Now it's the first of the month and you're looking at your latest cloud provider bill. It's not at all what you planned.
How can than be? You probably did a solid ROI analysis and it clearly indicated that a cloud infrastructure was going to save you money.
Welcome to the dark side of cloud computing—the world of VM sprawl.
With the newfound ability to launch servers on-demand comes the critical responsibility to shut them down when they are no longer serving their purpose. When people do their ROI analysis and embark on the cloud adventure, however, they aren't often considering the responsibility part of this equation. Newcomers to the cloud generally find it very easy to start up servers but very hard to shut them down. The result is a cloud infrastructure with an unfortunate number of pointless servers.
Unfortunately, Amazon does almost nothing to help customers properly track their cloud computing assets and retire them appropriately. They do a poor job giving you any reasonable idea of what your infrastructure is costing you, and they don't provide you with solid meta data for understanding what resources are operating in the Amazon cloud.
For example, the typical list of servers in the AWS console lists servers with the following fields:
- Instance ID (e.g. i-da56cb5d);
- AMI ID (e.g. ami-bca121e9)
- Security Groups (e.g. default)
- Instance Type (e.g. m1.small)
- Status (e.g. running)
- Public DNS (e.g. ec2-255.255.255.255.compute-1.amazonaws.com)
- Keypair (e.g. mykp)
- Monitoring state (e.g. disabled)
If you can imagine a listing of 100 servers, how do you pick out which one is no longer in use? If you think the third one in that list is the one no longer in use, are you really, really sure you aren't clicking on a production box? Is letting it run at $0.10/hour worth it to avoid accidently shutting down a production server?
Here are four tips to avoid VM sprawl in the public cloud:
1. Don't use the AWS console or the AWS command-line tools to manage your infrastructure.
These tools work fine when you are playing around with AWS, but they are wildly inappropriate for the management of a production cloud infrastructure. There are plenty of good cloud broker tools (full disclosure: my company enStratus is one such company) that can help you better track your cloud assets.
2. Segment your infrastructure into multiple accounts.
Do not mix production servers in the same accounts as QA, development, testing, or R&D work. When you mix servers, it suddenly becomes very scary to shut down servers—even when you are absolutely sure what purpose each one serves. There's always the nagging worry that you are shutting down a mislabeled production server, and that worry can prevent you from shutting down a VM that simply isn't being used. The basic rule should be: production environments are either fully managed by automation tools or they run 24x7. Servers in non-production environments are fair game to be shut down by a systems administrator at any time.
3. Define a formal naming convention for cloud assets and vigorously enforce it.
If you see a server named [clientname]-mysql-master-prd sitting in the development account, you are going to be cautious about shutting it down. It's probably just someone who was staging an environment to be put into production. At the very minimum, you will spend a good amount of time looking for the person who started it and making sure it's really not a production server.
Resource naming conventions are important in the cloud. A name alone should provide any administrator with enough information to know how to handle the server. If you follow strict naming conventions, you are very unlikely to get hit with out of control VM sprawl.
An appropriate naming convention is business-specific. Whatever your convention, the most important item is to make sure that transient servers are clearly labeled as transient. Remember, the problem is not people wantonly killing production servers—it's people forgetting why a given server was launched and leaving it up long after it should have been killed. Consider the following naming convention:
[user ID]-[function]-[killdate] (e.g. george-centosbuild-091005)
Anyone in the organization can leverage this information to determine an appropriate course of action:
- They know when the server should be considered fair game for shutdown
- They know why it was running in the first place
- They know who to ask about why it's running after its kill date
4. Define basic budget and procurement policies for each environment.
Hopefully you started out with a budget to begin with. If you don't already have a budget, you probably don't even know you are suffering from VM sprawl.
Each environment should have its own budget and well understood policies for provisioning systems. Though you don't want to end up in a situation that results in a 6-week timeframe for launching servers in the cloud, you also don't want a situation where anyone can launch a server in any environment for any reason without any need to answer to anyone.
Your objective should therefore be to define what kinds of provisioning should be taking place in each environment and what budget is approved for that budget. You should have controls in place to help you understand if a specific environment is going over budget before it goes over budget. If you have done everything else I have suggested, the only "at-risk" environments should be environments that require more liberal procurement policies. Fortunately, those environments should also be the ones with the least forgiving kill policies. As a result, you should be able to keep VM sprawl under control in the cloud.