June, 2009
by Ray Wicks
Introduction
This article is part 5 of an Introduction to Capacity Planning tutorial. It is designed for professionals entering or just getting started in the field. Emphasis is placed on large processor systems and examples will be largely drawn from z/OS but the concepts apply to all levels of operating systems and hardware. The tutorial is organized to discuss appropriate Performance Analysis principles and then delve into CP principles and practices. Measure IT has been featuring this tutorial over the last few months as shown below:
February: An overview
March: Modeling, validation, a performance view, & data acquisition.
April: An aside on queuing theory and probability
May: A mini capacity plan and a summary
This Month: Trending, rules of thumb and bibliography
Trending
Trending in capacity planning is a useful and relatively simple technique.
The purpose is to both visually and mathematically observe the past history of resource usage and predict the future requirement.
Implicit in this process is the grand assumption that the past can be used as a predictor of the future. In other words, the behavior in the past will continue in a similar manner.
In IT, the past is meant as the recent past. Too much can change with this technology to use years of history. You should look at Era segments- periods of business interest.
When you look at a historical graph, the shape and scale of graph can be an interesting indicator of what to expect. Keep in mind that you may need more than numbers. The business and technical environment can be significant. Major technical innovations being considered may destroy any possibility of using past trends as a future predictor and finally, be smart and lazy. There's an old saying attributed to Napoleon and told to me by Siebo Friesenborg: "When looking for help, look for smart lazy people. Smart energetic people already have their own crusade and won't stop to help you. Dumb lazy people can't help. And dumb energetic people should be shot!"
Here's some historical data for 130 weeks. It's a single system set of samples of CPU% data. Each point is the prime shift average for 40 hours: one number for each prime shift week.

What happened around week 125? Was it an upgrade? Work was moved to another system? If the graph was normalized to MIPS instead of CPU%, any model upgrades would be transparent. For example, 500 MIPS machine at 50% = 250 MIPS used. If one were to upgrade to 1000 MIPS machine with exactly same workload (i.e., the same MIPS consumed) the CPU% would have dropped to 25%. With a MIPS scale, it would have remained level.

This MIPS scaling provides a much smoother picture. What do you predict in the future? Even visually you can make a reasonable projection. With Excel, it looks as follows.

With Excel you can not only fit a linear trend line to the data, you can extend the trend some number of periods forward or backward thus giving you a good look at the projected values. This picture looks safe and secure. If only reality was always so tractable. Here's what actually happened after week 130.
After week 130, the system appears to have entered a period of instability or chaos. Without the rest of the data, we would have been led astray by our assumption that the past would be a good indicator of the future.
I deliberately hid part of the data to stress this very grand assumption.
Now what would you do with the complete set of data? If you try a linear fit, an exponential fit, or even a polynomial fit with Excel, none of them seem too enticing. (Remember the polynomial fit with the RIOC data from the March article?) What to do?
There is in fact an advanced technique not offered by Excel called Time Series Analysis (or Box-Jenkins analysis) which handles data such as this. Most often it is not something that one tackles by one self. One applies a commercial package like the Time Series package under SAS or the AGSS package under APL2. Using the AGSS package one can produce a result as follows.

All time series packages, although relatively easy to use, require some sophistication and care. But as you can see, the resulting prediction does go in the expected direction.

The bounding lines indicate the confidence interval of the prediction. This is nice with a sophisticated package. As with any prediction, the confidence of the prediction is important. However, here it looks good.
What would you have done without the supporting package? Well up to week 130 you could have made a good prediction with a PLFF. A Portable Linear Fit Function. What's that? A pencil is a PLFF. Lay a pencil on the graph and it would look good as a predictor. Use a thicker pencil for a confidence interval. 
For the periods after week 130, roll the pencil up the graph to assume that the periods after the chaotic periods would trend as the past. (Call it a linear transformation.) And again, it doesn't look too bad.

Notice that the use of a PLFF comes close to the time series analysis. But we expected that. Our visual expectations (PS) have been ratified by the conceptual structure (CS) once again. Both help to check one another.
In summary, here are the steps to use in doing trending analysis.
The alternatives for capacity planning vary from using rules of thumb to running a benchmark.
Rules of Thumbs (ROTS) are often quite adequate for CP. Here one would apply a simple rule which would say that the CPU% limit for quality work (overall CPU% - discretionary CPU%) is about 90%. Once the growth shows that the Q-Work is over 90% you have to do something.
ROTS are also useful for a health check as we shall see.
We live our lives with rules of thumb.
Most rules of thumb we live by never questioning their veracity. In IT we need some idea of the veracity and the applicability of a ROT. One way of testing your understanding of any rule is to discover exceptions to the rule. In other words, you understand a rule by trying to break it. Logically it looks like this:
"∀xΦx.≡.~∃y~Φy. This says: For all x, x has property Φ is equivalent to: there does not exist a y that does not have property Φ. Or in order to break a general rule, one has to find one exception.
Here are some observations about Rules.
First and foremost, as Hagar observes (©Reprinted with special Permission of King Features Syndicate), there is a rule about rules: all rules have exceptions (except this one of course).
The rules that follow are guidelines which are valuable for scanning large amounts of data looking for exceptions. If most of the time, we have expectations, our Conceptual Structure or Perceptual Structure ignores expected features. Exceptions attract our attention.
Rules usually apply for ordinary situations. In IT, most rules are made for the ordinary circumstance. An acceptable response time is one defined for ordinary loading rates. In an emergency overload circumstance, the rules or expectations change.
Most rules are not crisp. If a rule which claims that "tall" means being greater and or equal to 6 feet in height, what is a person who is 5.995 feet tall? Well, that's close enough too. How about 5.992 feet? And so on. Most rules are fuzzy. The "Tall Rule" has a truth curve which is a spline: the degree of truth is from [0, 1] and is smooth across some region. Similarly for IT rules: A good DASD response time is 3 milliseconds. So is 3.01 and 3.1 to some degree.

As indicated, Rules of Thumb are useful for scanning large amount of data in a Health Check engine. The Rule Engine scans through the performance data base.

Each category of rules looks at its appropriate data: Sysplex rules examine Sysplex data. In this simple scheme, each rule returns a value:
Ideally, there also is returned some text describing the rule and how it applies in this situation.
The Rules follow.
Sysplex Rules.
Balance Systems Rules

Remember the earlier discussion of balanced resources? MIPS used, memory used, and I/O used should be in some proportion. This motivated the development of the Metric Resource Table.
The table gave us the 10th and 90th percentiles of common resource ratios. We could use these ratios to develop an overview graph.

This graph is used to identify which ratios fall out of the expected metrics range. A bar could be null too if the data is not available.
If the bar is above the 90th percentile line, it means that the value was in the top 10% of the samples reviewed. Similarly, if the bar is below the 10th percentile line, the value is in the bottom 10%. Neither is good or bad, it's a flag to examine the amount of resource available. The workload may warrant the exception.
LPAR Rules

System Image Rules
Workload Rules
Channel Path Rules
DASD Rules
Tape Rules
These rules sound intuitive. But they come from someone who knows more about DASD than tape. So caveat emptor.
Central Storage Rule
Since 64 bit is relatively new. Here's the only CS rule I found.
Summary of methodologies
In summary, ROTs can be very useful, but keep in mind that ROTs may not be adequate for some detailed questions. If you want to compare two processors such that they have the same total MIPS, but one has 4 CPs and the other 2 CPs, the answer is beyond most ROTs. The rules may say that a DASD threshold has been exceeded but it does not tell you the impact on transaction response time if the DASD response time is reduced. An LPAR rules may look at the configuration but it cannot assess the impact CICS performance. For questions such as these more sophisticated techniques are indicated.
Trending is a relatively simple technique in concept. Trending does predict the future if the future looks like the past. But no technique guarantees that unforeseen events won't grab you. Time Series Trending is a significant step up but it is rather complicated. Trending can answers overall CP questions such as when will we need more resource and how much.
Analytic Modeling usually requires pre-built packages. The technique can be fast to solve and relatively easy to use. Text book models are often too simple to use. The flow among service centers is statistically driven and usually predefined. How accurate are analytic techniques? They usually predict utilization within 5% and response times within 30%. Data acquisition is key. That's why so much time was spent above in sample selection and data gathering. Calibration (getting the model to agree with reality) can be tough. Custom analytic models are really tough usually requiring a technical staff to both use them and understand them. Fortunately, services are available.
Simulation solutions also involve pre-built packages. They are slower to solve and can be relatively easy to use. Like analytic models, flow is statistically driven and usually predefined but can be customized. How accurate are they? Utilization is usually within 5% and response times within 30%. Also data acquisition is key and calibration can be tough. Custom models are built from service center building blocks. One does not write a simulation from scratch. Simulation languages do exist. They do require special skills. And as always, services exist.
The Cadillac of tools is the benchmark. That's where you run, in miniature, an actual system. It does take a lot of work in preparation. You (or someone) provide the hardware and software to run your workload. You provide the data bases and other input data required. It does mimic the running environment the best. You measure actual Software flow & queuing and Software usage. It is expensive. This cost factor usually limits the variations you run. However, unlike any other technique, a benchmark does tests the environment. You find out if it even works. An analytic or simulation technique may show what you modeled may work. You do not verify that what is modeled is what will actually be running.
The choice of technique comes down to a few factors.
Skills
What is required to do PA and CP? Knowledge of the hardware architecture is required. Processor alternatives enable you to assess the impact to performance, capacity, availability, flexibility. Without it, you are at the mercy of any vendor. For the I/O subsystem, you have to be familiar with the architecture of paths, DASD, Tape, etc. How about Networks? Well those people speak a language all their own.
For the software, you have to be very familiar with the control programs such as z/OS. You have to know the major applications you are dealing with such as CICS, IMS, DB2, etc. Knowledge of the software (and hardware) means that you are familiar with the measurements and the meaning of the measurement points.
Once you have the data, the next level skill is embodied in Modeling & Statistics. You have to have some rudimentary skills here even if you use pre built packages. Using the tools to get output is one thing. Getting meaningful output is quite another.
Organization

Briefly, a good organization structure is a blessing for a capacity planner. On the left, the PA and CP organization is closest to the troops on the ground. This is great for communication with that group. However, in many cases, the PA and CP team has to exert influence upon many departments at many levels in an organization. That's why the position of the CP function is better placed as shown on the right. It is close to the business modelers and close enough to the CIO that she can be used as a big stick to intimidate the lower echelon. The development and test teams do not get measured on the things they do for other groups. They need motivation.
Overall Summary
View your data regularly. It is only in this regular report that you can develop a PS. You will begin to see the pattern. Once you have an expectation, you can identify any abnormalities.
You should watch carefully for any configuration changes both hardware and software. Changes in the configurations can change the overall behavior in a dramatic fashion. Monitor changes in software applications. In many installations, the application teams are responsible for their own software. New levels in Applications can have far reaching effects.
What's your overall metrics? If you developed your own metric ratios, where would you be? By tracking these metrics you can detect subtle changes in the system and system usage.
For the processor, make sure that the Priorities/Policies are properly set. Times do change. You should be able to match the policies with expected averages and distributions. Do you have any high CPU Delays? Are you sure about the pattern of business unit activities? In LPAR, you can't focus on a single partition to the neglect of the rest.
For storage, just make sure you have enough. You can keep track of who's using it and how much is being used. Of course you don't want any paging. In a 64 bit environment, keep track of the Available Frames.
For the I/O subsystem, Monitor the usage Skew (Check Relative Intensity) by both controller and actuator. Look for high I/O delay. The important I/O metrics are the IOSQ ( Check RT/ST), Pend time (Shared DASD), the disconnect time (Trouble), and the connect time. A god metrics is the Intensity. Where possible, use the DFSMS facility to monitor key Data Set (Usage).
Keep in mind a few key principles. When using any rule or technique, be aware of the exceptions to the rule and places where the technique may not be valid. A conceptual or perceptual structure helps but it can make you see things that just aren't there.
Impeccable mathematics does not replace knowledge of the facts. The graph may look good and yet be very misleading.
Protect yourself at all times. They may just be out to get you.
Business decisions can override technical issues. Your technical case may be sound and yet not match the business model. It helps to know both.
Sometimes being understood is more important than being very accurate. If you have to explain something to another person, being wrapped in a very technical mantel may not be an asset. Being "very" accurate may be a luxury of the idle. It just takes time to answer some questions in great detail.
Other than the technicalities, there may be a hidden agenda. After all, people are people.
Bibliography
The Art of Computer Systems Performance Analysis, by Raj Jain, Wiley. I like this one. It is thorough and complete. A very good reference. It may be hard to find.
Capacity Planning for Web Performance, by Daniel A. Menasce and Virgilio A.F. Almeida, Prentice Hall. A good book on network structure and terminology and introduction to the topic.
Probability, Statistics, and Queuing Theory, by Arnold O. Allen, Academic Press Inc. This is the classic in queuing theory.
Performance by Design: computer capacity planning by example. By Daniel A. Menascé, Virgilio A. F. Almeida, and L. W. Dowdy. The web site http://cs.gmu.edu/~menasce/perfbyd/ has a lot of .xls modeling worksheets.
MVS I/O Subsystems, by Gilbert E. Houtekamer and H. Pat Artis, McGraw-Hill. More than you want to know about the I/O subsystem. A definitive source. Is available only online at www.perfassoc.com.
Exploring IBM S/390 Computers, by Jim Hoskins and George Coleman, Maximum Press. A general introduction to S/390 hardware and architecture. (with IBM G326-3006-06)
Statistical Concepts and Methods, by Gouri Bhattacharyya and Richard A. Johnson, John Wiley & Sons.
The Practical Performance Analyst, by Neil J. Gunther, Authors Choice Press. A very good book.
On Demand Computing: Technologies and Strategies, Craig Fellenstein, IBM Press. A good introduction to On Demand architectures.
IBM Manuals
GC28-1761 MVSTM Planning: Workload Management. A guide to WLM.
SC28-1950 Resource Measurement Facility Report Analysis. A guide to report reading.
SC28-1951 Resource Measurement Facility Performance Management Guide. A good tutorial to get started.
SG24-5975 IBM zSeries 900 Technical Guide. A good hardware architecture and implementation Red Book.
LY28-1042 RMFTM Support for LPAR Management Time. Want to know how LPAR works?
SC28-1187 Large Systems Performance Reference by John Fitch. John goes into detail about the LSPR data.
SG24-4356 System/390® MVS Parallel Sysplex Performance. A good Red Book on Parallel Sysplex RMF reports and data.
SG24-4680 System/390 MVS Parallel Sysplex Capacity Planning . A good Red Book on the function and capacity of Parallel Sysplex.
EXCEL:
Applied Statistics For Engineers and Scientists Using Excel and MINITAB, by David Levine, Patricia Ramsey, Robert Smidt, Prentice Hall. This comes with a CD containing handy Excel Add-Ins.
Excel Data Analysis by Jinjier Simon, Wiley. Nice basic reference concentrating on data presentation.
Tools
Organizations - CMG
Almost any volume of the Computer Measurement Group (CMG) Proceedings is worth looking at for performance and capacity planning articles. Web Site: http://www.cmg.org/
Notices
This material is copyright by Ray Wicks 2007-2008.
Selected material reproduced by permission of IBM Corporation.
Many terms are trademarks of different companies and are owned by them.
Thanks to Bernice Riley and Alvero Salla for looking at this and making technical and editorial corrections.