|
Capacity Planning for Grid Computing:
Managing Virtual Distributed Systems
April, 2005
by Peg McMahon
Introduction
Whatever you call it - grid computing or utility computing or computing on demand - it is certainly at the heart of the computer room of the future.
At first glance, the introduction of virtual systems into the distributed processing world seems to be just another example of "everything old is new again". After all, virtual servers have been a mainstay of the mainframe computing world for decades.
A closer look, however, reveals a new twist. In workload-managed mainframe systems, the capacity planner at very least knew the physical machine on which the application was running. In the very near future, this may not be the case.
What challenges does this present to the capacity planner and the performance analyst? How well are we equipped to meet the challenges?
Situation
The fortuitous confluence of a number of major trends has made grid computing possible.
Very powerful CPUs drive small, relatively inexpensive computers. The new, fast processors give small, inexpensive machines the processing power which only a few years ago could be handled by top-of-the-line high dollar servers.
Open source software is reducing licensing costs which in turn makes horizontally scalable systems economically feasible. It is now financially possible to consider deployment of 10 or 20 small servers in the place of one large server.
Standards-based software development allows applications and databases to be hardware agnostic. Applications can run on any system any where.
As a result, it is no longer science fiction to envision a room full of computers connected by an almost neural network capable of adding or subtracting units as the workload demands. Applications merely present work to the master dispatcher saying, in effect, "Here’s the work, run it anywhere." The mission critical applications of the future could be handled much like web traffic is now.
The problem for the capacity planner, however, is that the workload completely loses its physical dimension. If our application’s current usage factor, for example, is transactions per CPU percentage, how do we obtain the required metrics when it is difficult to determine even something as fundamental as which physical machine was doing the work?
VMware introduced our company to the brave new world of virtual distributed computing. The plan called for consolidation of upwards of 1500 extremely under-utilized one and two CPU Intel servers onto about 50 eight-CPU VMware servers without recoding the applications.
The application would not need to be recoded because the difference between its old server and the new one really shouldn’t matter. If it thought it needed two CPUs to run, fine. Let it think it had two full CPUs at its disposal, even if they are virtual and dynamically assigned.
Although the applications running on our VMware servers today are far from mission critical, the installation serves almost as a test bed for the problems that capacity planners will face in enterprise computing of the near future.
The Challenge
By definition, virtualization is not real. So what do the metrics returned by a virtual system really mean?
We confronted that problem when we decided to stack 40 virtual machines on each of these eight CPU servers.
This required what mainframers will recognize as normal workload management activities. On the VMWare ESX server, minimum CPU and memory guarantees were set for each guest system. Maximum limits were set if it seemed necessary.
In addition, our company standards required the installation of OpenView Performance Agent on each of the guest systems.
So metrics were available from the guest system. But what did they mean? If a guest system reported 60 percent CPU utilization, was it 60 percent of its two allocated CPUs? Or was it 60 percent of the half CPU that it actually was running on?
Adding to the difficulty was the VMWare ESX operating system itself. It is a heavily modified Linux kernel written to have minimal impact on the box. While vmstat and the other command line metrics are possible, it generates very few metrics regarding its own utilization.
And, naturally, the kernel gets first priority on machine cycles. Consider this situation: If the ESX operating system were doing something resource intensive, the guest systems might record very low CPU because that’s all the CPU that they were allowed to have. The low guest CPU utilization would not mean a lightly used server. It is conceivable that low guest CPU might mean a very highly utilized system. But since the utilization would be due to the host OS, it might be unrecorded or undetected.
These problems are not unique to VMWare. They are part and parcel of the brave new world of virtual computing.
On a Mainframe, you could solve this problem by normalizing the metrics, using core OS metrics and SMF metrics from the application. Unfortunately, these tools are absent in the current virtual world.
So at this point, users are scrambling to find answers to these metric questions. This quest is also providing a cottage industry for OEM software vendors.
But, in actuality, the vendors of the virtual systems are the only ones who possess complete knowledge of where the metrics reside and what they mean. As a responsible user community, it is our duty to pressure vendors into providing the tools to perform this kind of normalization.
Different Stages of Planning
Before:
Before moving to the virtual system, applications need to be thoroughly examined.
- How many CPUs to allocate?
- How much memory is needed?
- How much storage to assign?
- What are the performance characteristics? Batch or OLTP? CPU, IO or memory intensive?
Initial Installation:
What are the performance characteristics of the new system? We found that CPU queuing was a normal part of VMWare operations.
Are the initial application allocations correct? Is each guest system running on the correct amount of CPU and memory? Has it been over or under allocated?
Ongoing:
How does performance compare to the baseline?
What is the overall trend of utilization? Is it growing linearly or exponentially?
Are the users reporting normal application behavior? Response times good? Any major projects planned?
So do you need detailed application metrics?
Maybe not
There are a variety of forecasting techniques available that do not rely on minute measurement of each guest server.
This provides an opportunity to make use of:
- Law of Large Numbers
- Central Limit Theorem
- Time Series Analysis
In our final analysis, the best idea seems to be to plan for the grid, not for the individual applications.
Tiering
If the decision is made to plan for the grid and not for the application, a tiering system should be applied.
Exceptions to the "plan for the grid" rule should be made for applications that are:
- Very large
- Very important
- Very dynamic
In essence, there are Tier 1 applications and The Rest.
Knocking on Vendor’s Doors
As users, the computing community needs to apply pressure to all the vendors to make our job a little easier.
Vendors need to provide a way to monitor the "dispatcher", the OS or scheduler piece that is parsing up the resources and controlling where the work is executed
- Configuration information
- Utilization information
Vendors need to provide tools to normalize the data. How much of the "real" machine is the workload taking?
Vendors need to find a way to express their capacity in some transportable fashion (think MIPS). Without this, capacity planners cannot determine how much hardware to buy. Grid computing stands the risk of being locked into a single vendor. Lack of competition has financial ramifications for the enterprise.
Summary
The 80/20 rule applies to virtual distributed environments. Planning for the grid and not for each individual application will work 80 percent of the time. There are a small group of very important, very large or very dynamic applications that will need more attention.
The 80/20 rule applies also applies to the phases of capacity planning. Eighty percent of the analysis work should be done before an application is moved to a virtual environment and in the early months of deployment. After that, less frequent monitoring is required. Detailed monitoring, at this point, should only be concentrated on systems not running within anticipated parameters.
Finally, we should press vendors into doing 80 percent of this work for us. They need to step forward with a meaningful metric strategy. They need to find a way to state server capacity in useful and transportable terms.
Last Updated 04/15/05
Home |
Conference |
Groups |
National |
Members |
Links |
Site Map
|