Imagine for a moment that you're a capacity planner for a successful LAMP-based web site, and management has just "gifted" you with a new reporting program. You now have to somehow shoehorn it onto the server without hurting the performance of the existing programs.
Insoluble problem? Not a bit!
All you need is a resource manager like cgroups, the resource manager for Linux containers. With it you can assign most of the CPU, memory or I/O bandwidth to the more important programs and then give any other programs, including reporting, a fair share of what's left.
Dave's Laptop is Too Slow
As a practical example, my older Solaris laptop is just a bit too wimpy to run a particular CPU-hog data-analysis script. Whenever I runs it, even with nice -20, it takes 100% of the CPU and a big chunk of the disk bandwidth. So much so that I can't keep up with his everyday work. I can't even read email when the script is hogging the machine. Somehow we need to limit the script to 25% or less of the system.
To solve this, we used the resource manager to give one share of the CPU to the hoggish background program, and three equal-sized shares to everything else. Now I can run the job and the only indication is that gnome-perfmeter is pegged at 100%. Interactive programs run at full speed.
What's happening is that the background program is cut to 25% whenever other programs need CPU, but is allowed to use all the CPU it wants when the user is neither typing nor moving the mouse.
While editing this article, I was running the job, and my process status looked like this:
The project (Unix cgroup) "background" contained the background job, and took on average 92% of the CPU. User "davecb" is me: I averaged only 6.5%, but whenever I ran anything, that program got a guarantee of no less that 25% of the CPU. That was far more than needed for Open Office, so interactive editing was as fast as ever. Well, at least as fast as OO ever is.
You literally needed to look at perfmeter to see that the batch job had completed.
Resource Managers for Unix and Linux
Back in the days of the IBM mainframes, resource management was critical. Hundreds of users needed to be able to get their fair share of a single machine. Nor could they afford to let a program with a memory leak steal all the memory from everyone else.
For many years, mini and microcomputers had so little performance that you only ran one program per machine, and didn't have to worry about sharing. Now, however, the average Linux machine has the power of an older mainframe and will be running seventy-odd processes by the time the first user has logged on.
Because of this, resource managers are coming back, initially with the commercial Unixes like Solaris and AIX, and now with Linux and BSD. Linux in particular is now a hotbed of resource management research, and the V2.6 kernel has a fair-share scheduler and cgroups, the performance management infrastructure for a whole site of tools.
Cgroups stand for task control groups, an elegant extension to the completely fair scheduler to allow resource management. If a program is in a particular control group, it will given a share of the resources of the machine. You get to specify how big or small that share can be.
What's elegant is that the shares are minimums, not maximums. If you give one group 10% of the bandwidth of a particular disk and another 90%, then if the more privileged group isn't using its full 90%, the other group can have whatever is left over.
Control groups were written as the low-level mechanism in Linux for containers and virtual machines. However, they aren't restricted to virtual machines: they can manage the resources and therefor the performance of ordinary processes, too.
A Cgroups Example: CPU Management
Like many other Unix constructs, cgroups are organized in a virtual filesystem, so you can inquire about and set cgroups by reading and writing files.
Let's start out by mounting a CPU container. We'll put it in a directory under /dev in a directory named cpu, to keep it out of the way
# mkdir -p /dev/cpu
# mount -t cgroup -ocpu cpu /dev/cpu
Any directory created under /dev/cpu after this is magic: it will define a cgroup for CPU management, so you can say
# mkdir /dev/cpu/background
and background will become a cgroup, visible to the scheduler. If you cd to cpu/background, you will see several files, including notify_on_release, release_agent, cpu.shares and tasks.
If you say
$ firefox & echo $! | sudo tee /dev/cpu/background/tasks
that will start a firefox process and add it and all its subsequent children in the task set of the background cgroup.
If you then say
# echo 1 > /dev/cpu/background/cpu.shares
that will assign 1 share of the CPU to the background cgroup.
As you create additional cgroups and assign them shares of the cpu, the dispatcher will recalculate the percentage of the total CPU each cgroup will get, saving you from having to calculate percentages that will add up to 100.
For the background example, we might create a cgroup for the background processes, another for the logged-in users and a third for root and daemons. If we gave them each one share, they'd each get a guarantee of no less than 33% of the cpu.
The Same Thing on Solaris
To do this on Solaris, you create a project called "background" and specify which users can use it. Then you turn on the fair-share-scheduler, which will allocate CPU by project.
# projadd -c "Background jobs" -U davecb background
# dispadmin -d FSS
When you want to start a process in the background group, you use newtask and tell it to use the background project
$ newtask -p background
If the process is already running, you can put it in the background project with
^Z
$ newtask -p background -c $!
and then put it in the shell's background with
$ bg
Implementing CPU management is pretty easy: if a process starts to use more than its fair share of the CPU, the scheduler doesn't dispatch it. The implementation is recursive, so the scheduler can support a hierarchy of containers or VMs, each containing their own control groups.
This is very simple and low-cost, but it only affects the CPU. A program which hogs memory or I/O bandwidth is only indirectly affected. Dave's data-analysis script also uses a lot of the disk bandwidth. CPU management only cuts that down by accident: because the program is starved of CPU, it doesn't run often enough to use up the disk.
What would happen if the program was even slightly more disk-intensive, though? One thing is to find a way to starve it some more. We could give all the other programs larger shares, which would help a bit, or we go to an extreme and give the offending program zero shares of the CPU. That doesn't mean it won't run at all, just never when any other program whatsoever wants to run.
To do that we edit the /etc/project file, and add the string project.cpu-shares=(privileged,0,none) to the end of the line for the background group. Be careful you spell it right and put it in the right field, as there are no diagnostics for typos in the projects file. The file should look like
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
background:100:Background jobs:davecb::project.cpu-shares=(privileged,0,none)
This might be a good setting for the reporting program we were given, especially if it is as compute-intensive as some we've encountered.
However, even zero shares won't work with a program like find. If a program spends all its time issuing reads, then throttling the CPU it gets will only slow it down a little.
I/O Management
The solution to this problem is to limit the I/Os that the program is allowed to issue. This is where Linux shines, courtesy of its I/O scheduler. The scheduler has been recently extended to understand cgroups, and can refrain from dispatching I/Os if they will exceed the cgroup's ration of bandwidth.
Like the CPU and memory limits, it's managed by setting values in a virtual filesystem
# mkdir /mnt/cgroup
# mount -t cgroup -oblockio blockio /mnt/cgroup
Note that this is a different virtual filesystem than the CPU cgroup example: as cgroups are currently experimental, all the developers haven't decided on a single place to put them.
As with the cpu, we create a cgroup with mkdir and add the process-id of our problem program to the list of tasks in the cgroup.
# mkdir /mnt/cgroup/background
# /bin/echo 62810 > /mnt/cgroup/background/tasks
Now we can set an I/O limit, in this case 1 MB/S to the /dev/sda1 disk partition.
# /bin/echo /dev/sda1:1M >/mnt/cgroup/background/blockio.bandwidth
In classic Linux fashion, we can see how much is available by reading the blockio.bandwidth file we just wrote to, to get a report sorted by major and minor device-number.
# cat/mnt/cgroup/background/blockio.bandwidth
=== device (8,1) ===
bandwidth limit: 1024 KiB/sec
current i/o usage: 819 KiB/sec
Why You Should Limit I/O
The diagram below is the measured performance of a disk array. Note the degradation after 450 I/Os per second: instead of degrading gently from10 to 40 milliseconds, the disk "hits the wall" and jumps to 50 and 100 milliseconds with only a tiny increase in load.
This is the classic behavior of a disk: this particular one can deliver 425 I/Os per second at 100% utilization. You can't exceed 100% utilization, so if you ask for 500 I/Os per second, only 425 are served. Seventy-five read or write requests will have to sit in a queue and wait.
This kind of abrupt slowdown is common to anything that can build up a queue, such as a CPU, a disk or a network, as well as any program that uses these resources. Avoiding horrible slowdowns is an excellent reason to use resource limits, to keep your devices in the "good" part of their response curves. Four hundred requests at 40 milliseconds is far better than 500 averaging more than 80 milliseconds each.
A bit of experimentation with find will give you rough estimate of the limit of your disks, and you can set a limit that keeps them from being driven into overload. Note that this is an absolute limit, not the proportional sharing we mentioned initially. We'll see later when to use each.
Ionice
Along with the cgroups mechanism, Linux also has an ionice program, which can be used from the command-line to manage the I/O of individual programs. Rather than limiting the I/O, it merely changes the relative priorities of programs.
If you say
# ionice -c 3 -p
the pid will be given the equivalent of zero shares, by being placed in the "only if everyone is idle" class. This is perfect for disk hogs, but only if you don't care when they're done.
Normally you'd say
# ionice -c 2 -n 7
which will give command the lowest priority (7) in the normal class. Any program with a smaller value for n will get more I/O priority than this one, so you'd put your hogs at -n 7 and your interactive programs at -n 0.
There is also a real-time class, -c 1, which has very high priorities. However, it's easy to starve all the other programs on the machine with one real-time program, so it's rarely used except on single-purpose servers.
Solaris has no I/O scheduler at the moment, so I can't have ionice for his SPARC. I can for my Linux laptop, thoigh. The LAMP reporting program we started with is also a good candidate for I/O prioritization.
Memory Management
The next thing that can be managed is memory: in both Solaris and Linux, you can put a memory limit on a cgroup, and if the group's resident set size (RSS) exceeds that number, its least-recently-used pages are swapped out.
This is excellent for managing a program with a memory leak, and can also be used for less severe problems. Programs which use a lot of memory tend to push out pages of other programs which haven't run lately, making them start back up slowly. If you find, for example, that Open Office has a tendency toward pushing out Thunderbird when you're using both, then putting a memory limit on OO will make Thunderbird restart more rapidly.
Like CPU, memory management cgroups use a virtual filesystem
# mount -t container container -o mem_container /mem
# mkdir /mem/oo
# echo -n 25600 >/mem/oo/mem_limit
# ooffice & echo $! | sudo tee /mem/oo/tasks
This will set a limit of 1024 4KB pages for OO, or about 100 MB, a very tight limit. At the moment, the cgroups memory limit is a hard limit, and doesn't allow overcommitment even when you have lots of memory free. You therefor probably don't want to set OO quite that low.
A natural extension to this is a soft limit, one which guarantees a least upper bound like CPU shares. When there is plenty of free memory, a larger soft limit is enforced, but when there is demand for memory, the limit is reduced to the smaller hard limit. This is the kind of behavior we'd like to have for programs which aren't seriously ill-behaved.
For a very severe case, the cgroups mechanism can hand the misbehaving program and its pages off to the dreaded out of memory killer, to kill the process entirely and reclaim all its pages.
Solaris Memory Limits
In Solaris, the resident set size limit is set in the /etc/project file, just like CPU. In our case we'd write
oo:101::davecb,root::rcap.max-rss=10737418240
This can also be set from the command-line, using more convenient units such as GB, using either projadd or projmod:
# projmod -s -K rcap.max-rss=10GB oo
The same Solaris CPU and memory limits can be applied to virtual machines (variously called zones or containers) with the zonecfg command. This is the typical use of resource management in production Solaris shops.
Networks, Too
You can create cgroups of network-using programs in much the same way. Let's say you want to create an "ftp" group, with a
# mkdir -p /dev/cgroup
# mount -t cgroup -otc tc /dev/ftp
# mkdir /dev/ftp/file_transfer
# ftp example.com & echo $! > /dev/ftp/file_transfer/tasks
Now assign a "class id" to the cgroup, in this case an arbitrary one, so you can use tc, the existing traffic control mechanism.
# echo 0x1234 > /dev/ftp/file_transfer/tc.classid
Finally use tc to create a "hierarchal token bucket" (HTB) class that rate limits traffic to 100mbits and a filter to direct all traffic from cgroup ftp to this new class.
#tc qdisc add dev eth0 root handle 1: htb
# tc class add dev eth0 parent 1: classid 1:2 htb rate 100mbit ceil 100mbit
# tc filter add dev eth0 parent 1: protocol ip prio 1 handle 800 cgroup value 0x1234 classid 1:2
This will limit all your FTPs on interface eth0 to a total of 100 Mbit/S, equally shared.
The Solaris equivalent used to be a separate tool, but a new "virtual nic card" mechanism was added to Open Solaris on December 4th, so now you can set a maximum bandwidth or a relative priority to a "flow". In this case, flow is used on the networking sense, and can mean any group of ports and ip addresses, so you could have one flow for http and another with more resources for https.
Cooperative Networking with Trickle
However, an easier and more general approach is to use trickle, written by Marius Eriksen at http://monkey.org/~marius/trickle/trickle.pdf. Trickle does network traffic management for both Linux and Solaris, plus OpenBSD, NetBSD/Alpha and FreeBSD.
From the command-line, you simply use trickle as a prefix, like
$ trickle -d 20 ftp example.com
which means to limit download speed for the ftp session to 20 KB/S
It has a daemon and configuration file, so you can set default behaviors in /etc/trickled.conf and have them apply to all subsequent trickle runs. For example,
[ssh]
Priority = 1
Time-Smoothing = 0.1
[ftp]
Priority = 8
Time-Smoothing = 5
Length-Smoothing = 20
This sets all SSHs to the highest priority and most predictable responses. FTPs, on the other hand, get the lowest priority, and in addition can queue up requests for as long as 5 seconds (or 20 KB) to get better throughput, but at the expense of predictable interactive response.
You can set system-wide values, so
# trickled -d 80 -u 10 -s
will set the system-wide download speed 80 KB/S and the upload speed 10 KB/S, for an asymmetrical home network.
Trickle's traffic shaper is implemented as a shared library using LD_PRELOAD, so it doesn't require OS support or even root privilege for simple uses. A system administrator can apply it to everyone on a machine, of course, with some /etc/ld.so.preload trickery.
This is Still an Immature Area
The most obvious indication is that the developers haven't decided where to put the virtual filesystem. How Linux will decide what cgroups to put programs in at boot time is also tentative, as are measurement and management programs.
Another problem is that shares aren't easy to reason about: they guarantee that when the machine is loaded, you get a specific share. When it isn't loaded, you get more, but how much more is probabilistic. It depends on everything else that's running on the machine. That can be confusing, and hard to explain to your manager.
Because they're new and not easy to explain, some administrators use hard limits where they should use soft ones. Hard limits are trivial to reason about, and they are genuinely useful for preventing horrible catastrophes, such as a disk driven into infinite slowdown. They also help if you have a license only allows you to have a certain number of CPUs. But you really shouldn't use hard limits as a general tool: doing so means that the program can never use any spare cycles that your machine has. Instead it wastes them, which is the same as wasting your bosses' money.
A Mature Use in Apache
Apache uses hard limits in one of the right ways. There is an option, MaxClients, which sets the number of simultaneous requests the program will accept.
It is initially set to 256, which says you can handle 256 TPS. That's too high. It assumes one page transaction only takes 0.0039 second or 3.9 milliseconds. Most non-local pages are likely to take something like 0.9 seconds. To guarantee resources for a 0.9 second transaction time on a heavily loaded machine, you'd need to set the limit closer to 1/0.9, or about 1.1 times the number of CPUs (queues). Don't forget this last step, or you'll effectively limit your Apache to one CPU.
Better still, an Apache extension could do that for you, based on a response time target you gave it and the number of CPUs you want it to use.
What's Next?
Right now, resource management is a little primitive on Linux, but it's such a hot-bed of interest that we expect to see a lot of advances in the next year.
Of course, we'll see integration work on cgroups, so that you can easily create a container or a VM with both absolute caps and relative shares. Probably we'll see reporting on a per-cgroup level, so you can tell if your container is using more or less than you expected. Plus a GUI or two.
ZFS, on Solaris and BSD, just gained a dynamic throttle, to keep a write intensive thread from using all the I/O bandwidth or buffer memory, and starving out other threads. Like Linux, ZFS implemented this by imposing 1-tick delays on the offending programs.
The big advances are likely to be in automatic recognition and managing of bottlenecks. Once you start managing resources, it becomes easier to detect sudden catastrophic increases in, for example, disk-drive response time. A system administrator, or in an emergency a daemon, can apply a cap and force the misbehaving program to moderate its behavior. Such a daemon would allow the system administrator to at least be able to log on and diagnose the problem: there's nothing harder to fix than a machine that's locked up so tight that only the power switch works .
Solaris has half of this, in its fault management architecture (FMA) daemon, while Linux has the other half, the I/O management mechanisms. I won't even try to guess who'll put the two together first.
We'll also see more applications doing some degree of self-management, like trickle or Apache. Oracle will already try prioritize the log writer and database writer above the user processes, so we expect to see other applications do similar things.
IBM already has a full-fledged resource management daemon, but only on mainframes. On zOS, a daemon manages whole virtual machines, and you can specify, for example, that normal production programs are to have a 3 second response time maximum and reporting only needs a 40 seconds response time.
If the response time of the production programs threatens to exceed 3 seconds, the daemon grants it more resources at the expense of reporting. If reporting then slows, it will gets more resources at the expense of the next-less-important program. This is called "goal mode" resource management, as you specify performance goals and the OS tries to deliver them. We'll know we're really making progress when we hear about goal mode for Linux or Solaris.
Do Try This At Home
In the meantime, you can use the mature parts today. Try trickle almost anywhere, especially for those interminable downloads at home. The same with ionice. At the very least, they will make your laptop or desktop survive CPU and disk hogs.
If you're running a reasonably recent 2.6, you can directly solve the problem we started this article with. Set up one or more cgroups for your main applications, with a share set at the proportion of CPU and I/O they use. Then create a new one for reporting, with a share proportional to what was left over.
If you're production folks running Solaris, you can already be doing this. If not, now's a good time to start, using the Solaris Container Manager GUI.
See what the performance is like, and then adjust the shares to ensure the main applications have plenty of resources.
Now's a good time to stress-test the reporting program, knowing it will leave normal production unaffected: you can then tell management exactly how much reporting they'll be able to do, and probably give them a good estimate how much it will cost to add more report performance.
Print
Listen

