Guerrilla Capacity Planning
PART I: Hit-and-Run Tactics for Website Scalability
April 1, 2003
by Neil J. Gunther

About the Author


[Hide]

Click here to read part 2 of this article


1
  Introduction

We, so-called, performance experts have a tendency to regurgitate certain
performance cliches to each other, and to anyone else who will listen.
Cliches like:

  1. Acme Corporation just lost a $40 million sale because their new
    application cannot meet service level targets under heavy load. How much money
    do they need to lose before they do capacity planning?

  2. Company XYZ spent a million dollars buying performance management tools
    but they won't spend $10 thousand on training to learn the capacity planning
    functionality. They just produce endless strip charts without regard
    for what that data might imply about their future.

Several years ago I stopped mindlessly reiterating statements like these and
took a hard look at what was happening around me. It was then that I realized
not only were people not gravitating towards capacity planning, they actually
seemed to be avoiding it at any cost! From this standpoint, we performance
experts appeared more like clergy preaching from the pulpit after the
congregation had well and truly vacated the church.

In trying to come to grips with this new awareness, I discovered some unusual
reasons why capacity planning was being avoided. Later, I began to ponder what
might be done about it and presented some of those ideas at the 1997 CMG
Conference [Gunther 1997].

My thinking has evolved over the past several years [Gunther 2002] and I would
like to share my current perspective with you in this article. Since I see
performance management differently from most, you may find my conclusions rather
surprising and perhaps, inspiring.


2
  Doing More with Less

Traditional capacity planning has long been accepted as a necessary evil for
mainframe [Samson 1997] and data network
procurement [Cockcroft and Walker 2001]. The motivation in the past was
simple; the hardware components were expensive and budgets were limited.
Therefore, the expenditure of those dollars required careful and time-consuming
analysis.

Nowadays, however, hardware has become relatively cheap-even mainframe
hardware! The urge to launch an application with over-engineered hardware has to
be tempered with the less obvious caution that bottlenecks are more likely to
arise in the application design than in the hardware configuration. Simply
throwing more hardware at performance problems will not necessarily improve
performance. So, some kind of analysis and planning may still be required even
if you have all the hardware in the world.

To make matters worse, we now live in the brave new world of distributed
component-based computing and web-based architectures where we have many
software pieces in many hardware places. In stark contrast to the traditional
style of capacity planning for monolithic mainframes, we have huge number of
incompatible variants to contend with:

  • Little or no instrumentation in third-party or in-house applications.
  • No such thing as UNIX! There's: AIX, HPUX, Solaris, BSDI, FreeBSD, RH
    Linux, Debian Linux, MacOS X, ...

  • There's not even one kind of Windows operating system anymore.
  • Scripts built on one UNIX variant almost invariably do not to work on another.
  • Multiple COTS (Common Off-The-Self) applications running on multiple vendor platforms.
  • Component-based software: Java, .NET, CORBA, ODBC, enterprise beans, etc.
  • No common performance metrics like RMF (Resource Measurement Facility) or
    SMF (System Management Facility) available on MVS mainframes.

  • Most commercial tools have mainframe roots and thus tend to be
    server-centric in their data collection capabilities. Additional tools are
    needed for network and application data.

  • There's no convenient way to comprehend resource consumption across multiple tiers

This makes analysis and planning of web sites far more difficult than
it needs to be.

In an attempt to ameliorate some of these challenges, CMG has stood behind the
development of performance measurement and management standards like
UMA
(Universal Measurement Architecture) and
ARM
(Application Response Measurement) now owned by the OpenGroup.
Unfortunately, the vast panoply of tools and platform vendors have not been
convinced that there is revenue in these standards so they have not really caught
on in the industry 2.

In summary then, we are building more complex architectures with less
instrumentation available to manage them. This is a very risky approach which
seems to be sanctioned in software engineering [Smith and Williams 2002] in a way that
would not be acceptable in most other engineering disciplines. I don't know
about you, but I'm glad Boeing doesn't build aircraft in such a risky way! Since
this is such a source of frustration for fostering performance analysis and
capacity planning, let's try to understand why high risk is acceptable in the
context of software engineering.


2.1
  As Long as It Fails on Time!


Some managers believe they don't need to bother with capacity planning. How many
times have you heard this response?
But, I believe, this response is based on a wrong perception of risk.
Assessment of risk is often subverted by a false perception of risk:
Someone else will loses $40 million because of
poor performance, not me. See sidebar 2.1.1 on Risk Management
vs. Risk Perception
for an explanation of why this inverted logic runs so deep.

Management is generally employed to control schedules. To emphasize this fact
to my students [Gunther 2001], I tell them that managers will even let a project
fail-as long as it fails on time! Many of my students are managers and none of
them has disagreed with me yet. What this means is that managers are often
suspicious that capacity planning will interfere with project planning. Under
such scheduling pressures, the focus is on functionality first. Unfortunately,
new functionality is often over prescribed because it is seen as a competitive
differentiator. All the development time therefore tends to be absorbed by
implementing and debugging the new functionality. In this climate, applications
often fail [Ackerman 2002] to meet performance expectations [Smith and Williams 2002]
as a result of management pressure to get the new functionality to market as
fast as possible.

2.1.1  Risk Management vs. Risk Perception

Consider the poor fellow driving to the airport with white knuckles because he
just saw a news report on CNN about a plane crash and now he's fretting over the
safety of his own flight. What's wrong with this picture?

Statistics tell us that he has a greater risk of being killed on the freeways
than the airways (by a factor of 30 or more). Our traveller has also heard
these same statistics on television. So, why doesn't he remind himself of this
important fact and look forward to his flight, in spite of there being an air
disaster that day? Try it some time. It doesn't work. It's a psychological issue,
not one of rational thought. On the freeway, our intrepid driver feels like
he is in control because he has his hands firmly on the steering wheel.
But on the aircraft, he is just another fearful passenger strapped into his seat.
This fear is registered at a deep personal level of (false) insecurity. He remains
oblivious to the possibility that he could have been completely obliterated by
another careless driver on the freeway.

And that is the essential difference between risk perception and risk
management. Managers are paid to be in control. Therefore, bad things will not
happen to the project they are managing because that imply they are not really
in control. Incidentally, our nervous traveller's best strategy is actually to
fly to the airport!

Let's face it, Wall Street 3 still rules our culture. Time-to-market dictates
the schedules that managers must follow. This is a fact of life in the new
millennium and a performance analyst or capacity planner who ignores that fact
puts his or her career in peril. So, not only are we supposed to do more with
less, we're supposed to do it in less time! In view of these seemingly insane
constraints, it is imperative that any capacity planning methodology not inflate
project schedules.


2.2
  The Performance Homonculus

Performance management can be thought of as a subset of systems management activities.
Systems management includes activities like:

  • Backup/recovery
  • Chargeback
  • Security
  • Distribution of software
  • Performance management

Looked at in this way, performance management is simply another bullet item. But
this is another of those risk mis-perceptions. In terms of complexity, it
requires the most significant skill levels. It's rather like the difference in
medicine between the torso and the homonculus.

Indicating the location of an ailment to your doctor has meaning because your
body (torso) is referred to in geometric proportion. The homonculus, on the other
hand, represents the sensate proportion of our bodies. Reflecting this sensory
weight (see Figure 1), the hands and the mouth become huge
whereas the thorax and head appear relatively small. This is because we receive
vastly more sensory information through our fingers and tongue than we do via
the skin on our chest, for example.

homonc.gif
Figure 1: homonculus
 


The same proportionality argument can be applied to performance management skills.

Performance management skills are to the homonculus as systems management
skills are to the torso.

Almost every other item in the list above can be accommodated by purchasing the
appropriate COTS package and installing it. Not so for performance management.

In terms of coverage, performance management can be broken into three major areas:

  1. Performance monitoring

  2. Performance analysis

  3. Performance planning

Most attention is usually paid to level 1: performance monitoring
because it is generally easiest to address. If you want to manage performance
and capacity, you have to measure it. Naturally, this is the activity that the
majority of commercial tool vendors target. As a manager, if you spend $250,000
on tools, you feel like you must have accomplished something. Alternatively,
UNIX and NT system administrators are very good at writing scripts to collect
all sorts of data as part of their system administration duties. Since almost
nobody sports the rank of Performance Analyst or Capacity Planner
on their business card these days, that job often falls to the system
administrator as part of the systems management role. But data collection just
generates data. The next level (2) is analysis. The usual
motivation for doing any analysis these days is to fire-fight an
unforeseen performance problem that is impacting a release schedule or deployed
functionality. With a little more investment in planning (level
3), those unforeseen ``fires'' can minimized. But, level
3 is usually skipped for fear of inflating project schedules.
How can this Gordian knot be cut?


3
  Guerrilla Capacity Planning

In my view, a more opportunistic approach [Tabor 1970] to capacity planning is needed.
Enter guerrilla capacity planning!


kong.gif
Figure 2: G-u-e-r-r-i-l-l-a, not gorilla.


The notion of tactical planning may seem self-contradictory. At the risk of mixing
metaphors, we can think of traditional capacity planning as being the 800 pound
gorilla! That gorilla needs to go on a diet to produce a leaner approach
to capacity planning that is compatible with the modern business environment
described in Section 2.1. By lean, I don't mean skinny.

Skinny would be like remaining stuck at level 1 where
there is a tendency to simply monitor everything that moves in the false hope
that capacity issues will never arise and thus, planning can be avoided
altogether. Monitoring requires that someone watch the 'meter needles' wiggle.
Inherent in this approach is the notion that no action need be taken unless the
meter redlines. But performance 'meters' can only convey the current state of
the system. Such a purely reactive approach does not provide any means for
forecasting what lies ahead. You can't forecast the weather by listening to
leaves rustle.

The irony is that a lot of predictive information is likely contained in the
collected monitoring data. But, like panning for gold, some additional processing
must be done to reveal the hidden gems about the future. Keeping in mind the
economic circumstances outlined earlier, moving to levels 2 and
3 must not act as an inflationary pressure on the manager's
schedules. Failure to comprehend this point fully is, in my opinion, one of the
major reasons that traditional capacity planning methods have been avoided.


3.1
  It's Not a Model Railway

The goal of capacity planning is to be predict ahead of time that which cannot
be known or measured now. Prediction requires a consistent framework in which to
couch the assumptions. That framework is called a model. The word ``model'',
however, is one of the most overloaded terms in the English language. It can
mean everything from a model railway set to the model, Cindy Crawford. Consider
the model railway. The goal there is to cram in as much detail as the scale will
allow. The best model train set is usually judged as the one that includes not
just a scale model of the locomotive, and not just a model of an engineer
driving the scaled locomotive but, the one that includes the pupil painted on
the eyeball of the engineer driving the scaled locomotive!

This is precisely what a capacity planning model is not. For capacity planning,
the goal is to discard as much detail as possible while still retaining the
essence of the system's performance characteristics. This tends to argue against
the construction and use of detailed simulation models, in favor of the use of
spreadsheets or even automated forecasting. The skill lies in finding the
correct balance. Linear trending models may be too simple in many cases while
event-based simulation models may be overkill. To paraphrase Einstein: Keep the
model as simple as possible, but no simpler.


3.2
  No Compass Required

Traditional capacity planning has required relatively high precision because
many thousands of dollars were attached to each significant digit of the
calculation. In today's economic climate, however, managers usually just want a sense
of direction rather than the actual compass bearing.
In this sense, the precision of capacity predictions has become less
important than its accuracy. There is little virtue in spending two months
debugging and verifying a full-blown simulation if the accuracy of a simple
spreadsheet model will suffice.

At a more technical level, there is little support for high-precision
measurements in open systems. Take UNIX, for example. It's basically an
experiment that escaped from the lab circa 1975 and has been producing mutants
ever since. What little performance instrumentation exists, was originally
implemented in the UNIX kernel for the benefit of the early developers; not for
the grand purpose of capacity planning. Nonetheless, every capacity planning
tool in existence today primarily relies on those same kernel counters with
little modification. And since the PC revolution of the 1980's, performance
management has become ad hoc, at best.


4
  Summary

To summarize our theme, so far. Time is money; moreso today than in Benjamin Franklin's
time. Web sites are distributed and more complex than in mainframe days.
Therefore, the traditional approach to capacity planning can no longer be
supported. And it's not really about hardware anymore. The new emphasis is on
software scalability and that impacts the way capacity planning should be
approached.

The key idea presented here is tactical planning but there are at least
ten ways in which guerrilla capacity planning differs from traditional
capacity planning.


Item Traditional Guerrilla
Budget Big None
Tools Big Tiny
Time scale Strategic Tactical
Approach Passive Proactive
Title Business card No badge
Schedule Inflationary Deflationary
Scope Routine Opportunistic
Reporting Expected Unexpected
Skill set Narrow Diversified
Focus Hardware Applications

Guerrilla capacity planning tries to facilitate rapid forecasting of
capacity requirements based on available performance data in such a way that
management schedules are not inflated. In PART II, I'll give some examples of
how Guerrilla Capacity Planning can work for you in the context of making
scalable web sites.

References

[Ackerman 2002]
Ackerman, E.
''Waging a Battle Against PC Bugs,''
SiliconValley.com
Posted Jan. 26, 2002
[Cockcroft and Walker 2001]
Cockcroft, A. and Walker, W.
Sun Blueprints: Capacity Planning for Internet Services,
Prentice-Hall, 2000.
[Dumke et al. 2001]
Performance Engineering: State of the Art and Current Trends,
(Eds.) Dumke, R., Rautenstrauch, C., Schmietendorf, A., Scholz, A.,
Springer Lecture Notes in Computer Science,
# 2047. Heidelberg: Springer-Verlag (2001).
[Gunther 1997]
Gunther, N. J.,
``Shooting the RAPPIDs: Swift Performance Techniques for Turbulent Times,''
Proc. CMG'97, 602-613.
[Gunther 2001]
Gunther, N. J.,
Lecture notes
for Guerrilla Capacity Planning course.
[Gunther 2002]
Gunther, N. J.,
``Hit-and-Run Tactics Enable Guerrilla Capacity Plannings,''
IEEE IT Professional, pp.40-46
Jul-Aug, 2002.
[Samson 1997]
Samson, S. L.
MVS Performance Management: OS/390 Edition,
McGraw-Hill, 1997.
The z/OS edition is available in
digital format.
[Smith and Williams 2002]
Smith C. and Williams L.,
Performance Solutions: A Practical Guide to Creating Responsive,
Scalable Software,
Addison-Wesley, 2002.
[Tabor 1970]
Tabor, R.
The War of the Flea: A Study of Guerrilla Warfare Theory and Practice,
Paladin, London, U.K., 1970.

Footnotes:

1 Joint Copyright © 2002-2003 Performance
Dynamics Company and IEEE. All Rights Reserved. Permission has been granted to
CMG Inc., to publish this version in CMG MeasureIT.

2 Alan Schulman recently commented to me that what is
lacking is a champion like Barry Merrill. RMF and SMF didn't just magically appear
in MVS either.

3 When Einstein was asked what he thought was
the greatest force in the universe, he quipped, "Compound interest!" Today, he
might well say ``Wall Street!''



Last Updated