Sizing to Fail

By David Collier-Brown
May 30, 2009

It's a known hard problem to size a production system when the only thing you have is results from a smaller test or quality assurance machine. I'm a modeler, so in general I try to do a capacity planning model to predict queue delays.

However, if you can overload the application in test, you can identify configurations which will never work, and then size the system to avoid them.

Introduction

Imagine you're the QA manager for a small company, and you've just finished doing the basic tests of functionality for a new program on a small machine in your lab. Out of the blue, your management asks you for a sizing estimate for the program in production, with 1500 users. You've only ever tested with 100 simulated users in JMeter, you don't have a machine big enough to test 1500 users on, and management need the answer by the end of today.

Stop. Don't run screaming from the building, however horrible this sounds. You can't tell management what will work, but you can tell them how large a system they'll need to avoid guaranteed failure, which may suffice.

What you can do is reduce the CPU available on your current machine, and measure how many users it will support with decreasing amounts of CPU. From that you can estimate the minimum amount of CPU you'll need to support 1 user, and estimate the CPU for 1500 users, all other things being equal. And that last sentence is the important one. You can't guarantee the program doesn't have a bug that prevents it from scaling, but you can find out how much CPU it needs to support, 10, 100 or 1500 users. And then repeat the experiment for memory, disk and network I/O.

What This Really Is

Some of you will recognize this from your university days as a sensitivity test, and will know what I'm up to. If you know how sensitive a program is to a list of shortages, you can make a reasoned guess about how many resources you will need for a larger number of users.

The idea is to create an equation that looks like

1 TPS = w CPU + x memory + y network + z disk

which you can then use to size a machine for some large number of transactions per second (TPS), bytes/second or users.

Tools to Use

To reduce resources, run a second program to use then up. For example, you could run an infinite loop adding 1 to a variable to use up CPU. Then use Linux or Solaris resource managers to control how much they use.
For example, if 100 users took roughly 80% of the CPU on a SPARC test machine, use Solaris Resource Manager to give the test program 70% of the CPU and record how many users you could run. Then repeat with 60, 50 and 40%. A quick spreadsheet regression would tell you that each user would take, for example, 0.831%

Memory is similar, but there will be a static amount that even a single user takes, and then a variable amount for each additional user. In this case, your test program would be little more than


if (sbrk(atoi(argv[1]) * 1024 * 1024) != -1) {
printf("wasted %s memory,press return to exit.\n", argv[1]);
(void) getc(stdin);

For I/O there are time-wasters like iozone and resource managers like the Linux cgroups.

For strangling networks there is netcat to generate the load and trickle to limit it. If you're accessing the program via a network, you'll need to be careful you don't also remove the load on the program at the same time you're choking off the rest of the network I/O.

To be sure you're getting equal degrees of choking in each experiment, chose a given amount of response-time, as reported by your load generator. A good metric is to use double the no-load response time. If your low-load response time was 3.1 seconds, for example, you'd limit your CPU to a certain percentage and then adjust the load generator to get an average of 6.2 seconds and then record the number of users.

This works because, at a ratio of response time to service time of 2:1, you're close to 100% utilization of the resource you're experimenting with. A smaller ratio won't be as close to full load, and a larger one will cause a very rapid increase of response time, making it hard to set the exact load you need.

A Worked Example

A company we consult for recently had exactly this kind of sizing problem.

They wanted to offer a new large-streamed-data service and had a copy of the program they planned to use on a 30-day trial. They couldn't build a test system in the time they had, so they did a sensitivity test on the program and found that it was most sensitive to disk I/O operations per second (IOPS), and secondarily was sensitive to having enough CPU and threads available to start a transfer quickly. As it was a streaming operation, it was quite insensitive to memory, as long as there was enough for the I/O buffers.

This told them that for a particular size of initial offering, they'd need 30K IOPS and a reasonably fast multiprocessor, with the emphasis on having enough cores, not on raw clock speed. That in turn meant that they'd need a multi-core server and either 100 fast enterprise disks or a single SSD in front of thirty-odd slower terabyte drives.

Conclusions

You now have an initial answer for management: "We aren't sure what's needed for 1500 users, but we do know that if you build a system with less than 24 cores and a disk array that won't do 30,000 IOPS, it will fail".

You can't make that a positive statement because it's easy for something you haven't measured to bottleneck and cause the sizing to fail. Buses, caches, adapter cards and locks are common examples that only a test with the real machine and load will expose. If you need accurate estimates you need to measure service times and wait (queue) times, and use a queuing model.

However, a sensitivity test is good enough for the proverbial "back of the envelope" estimate, and in many cases, that's all that's needed.

The example we used was deliberately chosen to be one of the cases where having an approximate answer quickly is fine. This scenario occurs early in the process of adopting a new program, when management needs an estimate just good enough to figure out how many rack units or floor tiles they will need, not the exact number of CPUs or HBAs they'll need to put into the cabinet. They need enough data to make a go/no-go decision, and you can easily give them that.

If they eventually do decide the program's a go, then you can model it out formally, prove it with a stress-test on the production hardware as part of the acceptance procedure and finally put it under a capacity planning regime as described in John Allspaw's The Art of Capacity Planning.



You might also be interested in:

News Topics

Recommended for You

Got a Question?