Getting Started in z/OS Capacity Planning

Trending, rules of thumb and bibliography

June, 2009
by Ray Wicks

About the Author
Ray Wicks

Ray has spent most of his career at IBM in the performance analysis and capacity planning end of the business in Poughkeepsie and at the Washington Systems Center. He is the major contributor to IBM's internal PA & CP tool zCP3000. This tool is used extensively by the IBM services and technical support staff world wide to analyze existing z/OS configurations (Processor, storage, and I/O) and make projections for capacity expectations.

In support of this effort, Ray has given classes and lectures worldwide. He won the prestigious CMG A.A. Michelson award in 2000.

Introduction

This article is part 5 of an Introduction to Capacity Planning tutorial. It is designed for professionals entering or just getting started in the field. Emphasis is placed on large processor systems and examples will be largely drawn from z/OS but the concepts apply to all levels of operating systems and hardware. The tutorial is organized to discuss appropriate Performance Analysis principles and then delve into CP principles and practices. Measure IT has been featuring this tutorial over the last few months as shown below:

February:       An overview

March:            Modeling, validation, a performance view, & data acquisition.

April:               An aside on queuing theory and probability

May:                A mini capacity plan and a summary

This Month: Trending, rules of thumb and bibliography


Part V - Trending, Rules of Thumb, Bibliography


Trending

Trending in capacity planning is a useful and relatively simple technique.

 The purpose is to both visually and mathematically observe the past history of resource usage and predict the future requirement.

Implicit in this process is the grand assumption that the past can be used as a predictor of the future. In other words, the behavior in the past will continue in a similar manner.

In IT, the past is meant as the recent past. Too much can change with this technology to use years of history. You should look at Era segments- periods of business interest.

When you look at a historical graph, the shape and scale of graph can be an interesting indicator of what to expect. Keep in mind that you may need more than numbers. The business and technical environment can be significant. Major technical innovations being considered may destroy any possibility of using past trends as a future predictor and finally, be smart and lazy. There's an old saying attributed to Napoleon and told to me by Siebo Friesenborg: "When looking for help, look for smart lazy people. Smart energetic people already have their own crusade and won't stop to help you. Dumb lazy people can't help. And dumb energetic people should be shot!"

Here's some historical data for 130 weeks. It's a single system set of samples of CPU% data. Each point is the prime shift average for 40 hours: one number for each prime shift week.

 What happened around week 125? Was it an upgrade? Work was moved to another system? If the graph was normalized to MIPS instead of CPU%, any model upgrades would be transparent. For example, 500 MIPS machine at 50% = 250 MIPS used. If one were to upgrade to 1000 MIPS machine with exactly same workload (i.e., the same MIPS consumed) the CPU% would have dropped to 25%. With a MIPS scale, it would have remained level.

This MIPS scaling provides a much smoother picture. What do you predict in the future?  Even visually you can make a reasonable projection. With Excel, it looks as follows.

With Excel you can not only fit a linear trend line to the data, you can extend the trend some number of periods forward or backward thus giving you a good look at the projected values. This picture looks safe and secure. If only reality was always so tractable. Here's what actually happened after week 130.

 

After week 130, the system appears to have entered a period of instability or chaos. Without the rest of the data, we would have been led astray by our assumption that the past would be a good indicator of the future.

I deliberately hid part of the data to stress this very grand assumption.

Now what would you do with the complete set of data? If you try a linear fit, an exponential fit, or even a polynomial fit with Excel, none of them seem too enticing. (Remember the polynomial fit with the RIOC data from the March article?) What to do?

There is in fact an advanced technique not offered by Excel called Time Series Analysis (or Box-Jenkins analysis) which handles data such as this. Most often it is not something that one tackles by one self. One applies a commercial package like the Time Series package under SAS or the AGSS package under APL2. Using the AGSS package one can produce a result as follows.

All time series packages, although relatively easy to use, require some sophistication and care. But as you can see, the resulting prediction does go in the expected direction.

The bounding lines indicate the confidence interval of the prediction. This is nice with a sophisticated package. As with any prediction, the confidence of the prediction is important. However, here it looks good.

What would you have done without the supporting package? Well up to week 130 you could have made a good prediction with a PLFF. A Portable Linear Fit Function. What's that? A pencil is a PLFF. Lay a pencil on the graph and it would look good as a predictor. Use a thicker pencil for a confidence interval.

For the periods after week 130, roll the pencil up the graph to assume that the periods after the chaotic periods would trend as the past. (Call it a linear transformation.) And again, it doesn't look too bad.

Notice that the use of a PLFF comes close to the time series analysis. But we expected that. Our visual expectations (PS) have been ratified by the conceptual structure (CS) once again. Both help to check one another.

In summary, here are the steps to use in doing trending analysis.

  1. Select a variable for the trending study. For CP, it is usually a common resource in the data model such as CPU% or MIPS used.
  2. Choose appropriate sample, most often a series of business interest periods.
  3. Verify workload composition. Make sure the samples have the expected workloads present.
  4. Choose description. Three of the most common are average, 90th percentile or maximum.
  5. Collect samples by the business period.
  6. Fit Data with a regression analysis (or time series).
  7. Make a projection out a reasonable number of periods.
  8. And(!), track the results to gain confidence. Did it go as expected?

Summary

The alternatives for capacity planning vary from using rules of thumb to running a benchmark.

Rules of Thumbs (ROTS) are often quite adequate for CP. Here one would apply a simple rule which would say that the CPU% limit for quality work (overall CPU% - discretionary CPU%) is about 90%. Once the growth shows that the Q-Work is over 90% you have to do something.

ROTS are also useful for a health check as we shall see.



System Configuration Health Check

We live our lives with rules of thumb.

  • Honor your Father & Mother
  • Do unto others as you would have them do unto you.
  • Do unto others before they do unto you.
  • Keep your CPU%<90%
  • Don't swim soon after eating.
  • It is better to give than receive.

Most rules of thumb we live by never questioning their veracity. In IT we need some idea of the veracity and the applicability of a ROT. One way of testing your understanding of any rule is to discover exceptions to the rule. In other words, you understand a rule by trying to break it. Logically it looks like this:

"∀xΦx.≡.~∃y~Φy. This says: For all x, x has property Φ is equivalent to: there does not exist a y that does not have property Φ. Or in order to break a general rule, one has to find one exception.

Here are some observations about Rules.

First and foremost, as Hagar observes (©Reprinted with special Permission of King Features Syndicate), there is a rule about rules: all rules have exceptions (except this one of course).

The rules that follow are guidelines which are valuable for scanning large amounts of data looking for exceptions. If most of the time, we have expectations, our Conceptual Structure or Perceptual Structure ignores expected features. Exceptions attract our attention.

Rules usually apply for ordinary situations. In IT, most rules are made for the ordinary circumstance. An acceptable response time is one defined for ordinary loading rates. In an emergency overload circumstance, the rules or expectations change.

Most rules are not crisp. If a rule which claims that "tall" means being greater and or equal to 6 feet in height, what is a person who is 5.995 feet tall? Well, that's close enough too. How about 5.992 feet? And so on. Most rules are fuzzy. The "Tall Rule" has a truth curve which is a spline: the degree of truth is from [0, 1] and is smooth across some region. Similarly for IT rules: A good DASD response time is 3 milliseconds. So is 3.01 and 3.1 to some degree.

As indicated, Rules of Thumb are useful for scanning large amount of data in a Health Check engine. The Rule Engine scans through the performance data base.

Each category of rules looks at its appropriate data: Sysplex rules examine Sysplex data. In this simple scheme, each rule returns a value:

  • Null: Rule doesn't apply or data not available.
  • Green: data looks OK to some degree
  • Yellow: data looks suspicious and warrants review.
  • Red: The rule finds something in the data extreme. The rule doesn't like this at all!       

Ideally, there also is returned some text describing the rule and how it applies in this situation.

The Rules follow.

 

Sysplex Rules.

  • The Coupling Facility Utilization < 50%. The CF is required to be an extremely responsive service. In most cases another processor is spinning while waiting for an answer. The degree of delay, queuing, should be kept at an absolute minimum.
  • Lock Structure
    • Lock Contention < 2%. Lock contention means the entire application is waiting. Not very desirable. Hence the low threshold.
    • False Lock Contention < 0.1%. False lock contention requires additional processing in the CF.
  • Cache Structure
    • % of SYNC requests to ASYNC < 10%. Converting a synchronous request to asynchronous is expensive.
  • Sub channel Busy < X%. Since there can be a number of sub channels between a host and CF, the rule is a computed one. Where x = utilization such that probability of all sub channels busy < 0.1 (e.g. 25% for 2 sub channels). Remember the Erlang-C example earlier?

Balance Systems Rules

Remember the earlier discussion of balanced resources? MIPS used, memory used, and I/O used should be in some proportion. This motivated the development of the Metric Resource Table.

The table gave us the 10th and 90th percentiles of common resource ratios. We could use these ratios to develop an overview graph.

This graph is used to identify which ratios fall out of the expected metrics range. A bar could be null too if the data is not available.

If the bar is above the 90th percentile line, it means that the value was in the top 10% of the samples reviewed. Similarly, if the bar is below the 10th percentile line, the value is in the bottom 10%. Neither is good or bad, it's a flag to examine the amount of resource available. The workload may warrant the exception.

LPAR Rules

  • Yellow would mean that the ratio of the number of shared logical CPs to the number of shared physical CPs is say > 2. (see zPCR from IBM for the value). This avoids excessive LPAR overhead.
  • Yellow when the ratio of the DASD I/O Rate to MSUs Used < 30. The Low I/O LSPR Mix should be used to estimate CEC Power. The Mix used is estimating CEC power can vary with the Mix used. Especially important when comparing processor models.
  • Yellow: The short CP factor <0.6 . This factor is the expected percent of time for an LCP to be dispatched on a GCP at or near 100% CEC utilization. This is computed as follows to prevent sluggish response time when partition with short CP factor holds resource.
    • (Weight*#GCPs)/#LCPs.
    • (0.5*3)/2=0.75
  • Yellow: when a partition is near 100% of specified weight. When a CEC is near 100%, the weights are enforced. With all partitions demanding their share (weight) a 25% logical CPU% could be really 100%.

System Image Rules

  • Red: System CPU% > 98%. Some work has to be hurting. Does this include the Trash Applications (discretionary work)?
  • Yellow: System exceeds Saturation Design Point (usually around 90%).
  • Yellow: There's a system CPU% sample less than 30%. This is an indicator that you might want to toss some samples in your analysis.
  • Yellow: System relative I/O content value (RIOC=(DASD_IO)/MIPS_Used) in upper or lower 10 percentile. Green range: [1.374,4.203].
  • Yellow: DASD_IO/MSUs_used < 30. This is an indicator used in the Low I/O LSPR mix.
  • Yellow: Sample Duration CV>0.1(CV=Coefficient of Variation= σ/µ ). This is useful to know when selecting Peak or 90th percentile intervals that the durations are not very constant.
  • Yellow: High priority (Important?) Workload QCPU% (Quality CPU) exceeds SDP. QCPU% is the workloads CPU% + the CPU% of higher priority work. Even though you may run at 100% most of the time, the important applications are well below that (below a QCPU% threshold).
  • Yellow: Low Priority Work exceeds SDP. Do you care?
  • Red: Workload identified as Single Task (multi thread) CPU% is close to exhausting a single CP. (95 < CPU%*#CPs) On a 4 CP system, over all CPU% of 25% is really 4*25% of a single CP. Look for level area in sample display.

Workload Rules

  • Capture Ratios
    • Red: System wide Capture Ratio (Sum of  workload  CPU : Total CPU) varies too much over samples. CV > 0.1
    • Yellow: 0.6 < CR < 0.8. Capture ratio looks low. A workload may be missing from the input data.
    • Red:   CR < 0.6. A workload is missing
  • Yellow: CPU or I/O delay exceeds 5% of active samples. (This is reported in RMF Monitor III and saved in SMF data base.) Queuing is bad but the value may be immaterial. What's the SLO? How close are you to the SLO?
  • Yellow: Overall Peak to Average ratio >5. This is an indicator of Bursty Workload(s). Examine samples with care before selection for modeling.

Channel Path Rules

  • Yellow: A path (CHPID) exceeded 45% busy.
  • Yellow: Bus Busy (FICON) exceeded 50% (Prior to FICON Native Express 2). After that it's 35% since it is unlikely to get near 50%
  • Active AOEs (Understanding FICON Channel Path Metrics - Artis & Ross at www.perfassoc.com) For CPHID_Rate*(DISC+CONN)
    • Yellow: 3<#AOEs<5
    • Red: 6< #AOEs

DASD Rules

  • Yellow: Response time for non-trivial (rate>2) is > 5 ms.
  • Yellow: Service time is > 3 ms.
  • Yellow: Response/Service > 1.5
  • Yellow: Contention Intensity > 100 ms.
  • Intensity
    • Yellow:  >200
    • Red: >500
  • Read Hit Ratio (CU) Read Hit Ratio < 0.8
  • Cache residency time = CS / IO*MR*56 where CS is cache size in bytes, IO is the total I/O rate, MR is the miss ratio, and 56 is the track size in KB. (I/O Subsystem Configurations for ESA, by Bruce McNutt. IBM Systems Journal Vol32, #2, 1993)
    • Yellow: Residency < 180
    • Red: Residency < 30
  • Fast Write Bypass% Indicates NVS full.
    • Yellow: % > 1
    • Red: % > 5

Tape Rules

These rules sound intuitive. But they come from someone who knows more about DASD than tape. So caveat emptor.

  • Disconnect time : Connect time > 2 ms. Looks like the control unit is over loaded.
  • Connect time < 5 ms. An indicator of small blocks. A tape drive doesn't like stop and go traffic.
  • Mount Delay > 30 seconds. You need a robot or a faster operator.
  • #Addresses Allocated : # Addresses. Tape drives are a one user device.
    • Yellow: 50%
    • Red: 75%

Central Storage Rule

Since 64 bit is relatively new. Here's the only CS rule I found.

  • 64 Bit: keep CS Available frame queue > 400

Summary of methodologies

In summary, ROTs can be very useful, but keep in mind that ROTs may not be adequate for some detailed questions. If you want to compare two processors such that they have the same total MIPS, but one has 4 CPs and the other 2 CPs, the answer is beyond most ROTs. The rules may say that a DASD threshold has been exceeded but it does not tell you the impact on transaction response time if the DASD response time is reduced. An LPAR rules may look at the configuration but it cannot assess the impact CICS performance. For questions such as these more sophisticated techniques are indicated.

Trending is a relatively simple technique in concept. Trending does predict the future if the future looks like the past. But no technique guarantees that unforeseen events won't grab you. Time Series Trending is a significant step up but it is rather complicated. Trending can answers overall CP questions such as when will we need more resource and how much.

 

Analytic Modeling usually requires pre-built packages. The technique can be fast to solve and relatively easy to use. Text book models are often too simple to use. The flow among service centers is statistically driven and usually predefined. How accurate are analytic techniques? They usually predict utilization within 5% and response times within 30%. Data acquisition is key. That's why so much time was spent above in sample selection and data gathering. Calibration (getting the model to agree with reality) can be tough. Custom analytic models are really tough usually requiring a technical staff to both use them and understand them. Fortunately, services are available.

Simulation solutions also involve pre-built packages. They are slower to solve and can be relatively easy to use. Like analytic models, flow is statistically driven and usually predefined but can be customized. How accurate are they? Utilization is usually within 5% and response times within 30%. Also data acquisition is key and calibration can be tough. Custom models are built from service center building blocks. One does not write a simulation from scratch. Simulation languages do exist. They do require special skills. And as always, services exist.

The Cadillac of tools is the benchmark. That's where you run, in miniature, an actual system. It does take a lot of work in preparation. You (or someone) provide the hardware and software to run your workload. You provide the data bases and other input data required. It does mimic the running environment the best. You measure actual Software flow & queuing and Software usage. It is expensive. This cost factor usually limits the variations you run. However, unlike any other technique, a benchmark does tests the environment. You find out if it even works. An analytic or simulation technique may show what you modeled may work. You do not verify that what is modeled is what will actually be running.

The choice of technique comes down to a few factors.

  • What questions do you have? Simple questions may be resolved quickly with simpler and cheaper techniques.
  • What questions must you answer? Just because you have received a list of questions to answer doesn't mean you have to answer them all. Some are strategic and others are nice to know.
  • How much does it cost to get an answer? This often is enough to decide what one will work on.
  • What is the cost of getting it wrong? Years ago, a processor was a capital investment of huge proportions what was kept on the books for 10 years.
  • What's the time line you are working under? How much time do you have to prepare an answer?
  • What happens if you get it wrong? What's the business impact?

Skills

What is required to do PA and CP? Knowledge of the hardware architecture is required. Processor alternatives enable you to assess the impact to performance, capacity, availability, flexibility. Without it, you are at the mercy of any vendor. For the I/O subsystem, you have to be familiar with the architecture of paths, DASD, Tape, etc. How about Networks? Well those people speak a language all their own.

For the software, you have to be very familiar with the control programs such as z/OS. You have to know the major applications you are dealing with such as CICS, IMS, DB2, etc. Knowledge of the software (and hardware) means that you are familiar with the measurements and the meaning of the measurement points.

Once you have the data, the next level skill is embodied in Modeling & Statistics. You have to have some rudimentary skills here even if you use pre built packages. Using the tools to get output is one thing. Getting meaningful output is quite another.

Organization

Briefly, a good organization structure is a blessing for a capacity planner. On the left, the PA and CP organization is closest to the troops on the ground. This is great for communication with that group. However, in many cases, the PA and CP team has to exert influence upon many departments at many levels in an organization. That's why the position of the CP function is better placed as shown on the right. It is close to the business modelers and close enough to the CIO that she can be used as a big stick to intimidate the lower echelon. The development and test teams do not get measured on the things they do for other groups. They need motivation.

Overall Summary

View your data regularly. It is only in this regular report that you can develop a PS. You will begin to see the pattern. Once you have an expectation, you can identify any abnormalities.

You should watch carefully for any configuration changes both hardware and software. Changes in the configurations can change the overall behavior in a dramatic fashion. Monitor changes in software applications. In many installations, the application teams are responsible for their own software. New levels in Applications can have far reaching effects.

What's your overall metrics? If you developed your own metric ratios, where would you be? By tracking these metrics you can detect subtle changes in the system and system usage.

For the processor, make sure that the Priorities/Policies are properly set. Times do change. You should be able to match the policies with expected averages and distributions. Do you have any high CPU Delays? Are you sure about the pattern of business unit activities? In LPAR, you can't focus on a single partition to the neglect of the rest.

For storage, just make sure you have enough. You can keep track of who's using it and how much is being used. Of course you don't want any paging. In a 64 bit environment, keep track of the Available Frames.

For the I/O subsystem, Monitor the usage Skew (Check Relative Intensity) by both controller and actuator. Look for high I/O delay. The important I/O metrics are the IOSQ ( Check RT/ST), Pend time (Shared DASD), the disconnect time (Trouble), and the connect time. A god metrics is the Intensity. Where possible, use the DFSMS facility to monitor key Data Set (Usage).

Keep in mind a few key principles. When using any rule or technique, be aware of the exceptions to the rule and places where the technique may not be valid. A conceptual or perceptual structure helps but it can make you see things that just aren't there.

Impeccable mathematics does not replace knowledge of the facts. The graph may look good and yet be very misleading.

Protect yourself at all times. They may just be out to get you.

Business decisions can override technical issues. Your technical case may be sound and yet not match the business model. It helps to know both.

Sometimes being understood is more important than being very accurate. If you have to explain something to another person, being wrapped in a very technical mantel may not be an asset. Being "very" accurate may be a luxury of the idle. It just takes time to answer some questions in great detail.

Other than the technicalities, there may be a hidden agenda. After all, people are people.

Bibliography

 

The Art of Computer Systems Performance Analysis, by Raj Jain, Wiley. I like this one. It is thorough and complete.  A very good reference. It may be hard to find.

Capacity Planning for Web Performance, by Daniel A. Menasce and Virgilio A.F. Almeida, Prentice Hall. A good book on network structure and terminology and introduction to the topic.

Probability, Statistics, and Queuing Theory, by Arnold O. Allen, Academic Press Inc. This is the classic in queuing theory.

Performance by Design: computer capacity planning by example. By Daniel A. Menascé, Virgilio A. F. Almeida, and L. W. Dowdy. The web site http://cs.gmu.edu/~menasce/perfbyd/ has a lot of .xls modeling worksheets.
MVS I/O Subsystems, by Gilbert E. Houtekamer and H. Pat Artis, McGraw-Hill.  More than you want to know about the I/O subsystem. A definitive source. Is available only online at www.perfassoc.com.

Exploring IBM S/390 Computers, by Jim Hoskins and George Coleman, Maximum Press. A general introduction to S/390 hardware and architecture. (with IBM G326-3006-06)

Statistical Concepts and Methods, by Gouri Bhattacharyya and Richard A. Johnson, John Wiley & Sons.

The Practical Performance Analyst, by Neil J. Gunther, Authors Choice Press. A very good book.

On Demand Computing: Technologies and Strategies, Craig Fellenstein, IBM Press. A good introduction to On Demand architectures.

IBM Manuals

GC28-1761   MVSTM Planning: Workload Management. A guide to WLM.

SC28-1950  Resource Measurement Facility Report Analysis.  A guide to report reading.

SC28-1951  Resource Measurement Facility Performance Management Guide.  A good tutorial  to get started.

SG24-5975   IBM zSeries 900 Technical Guide. A good hardware architecture and implementation Red Book.

LY28-1042   RMFTM Support for LPAR Management Time. Want to know how LPAR works?

SC28-1187   Large Systems Performance Reference by John Fitch.  John goes into detail about the LSPR data.

SG24-4356   System/390® MVS Parallel Sysplex Performance.  A good Red Book on Parallel Sysplex RMF reports and data.

SG24-4680   System/390 MVS Parallel Sysplex Capacity Planning .  A good Red Book on the function and capacity of Parallel Sysplex.

EXCEL:

Applied Statistics For Engineers and Scientists Using Excel and MINITAB, by David Levine, Patricia Ramsey, Robert Smidt, Prentice Hall. This comes with a CD containing handy Excel Add-Ins.

Excel Data Analysis  by Jinjier Simon, Wiley. Nice basic reference concentrating on data presentation.

Tools

  • Gatherers/Monitors
    • RMF from IBM & CMF from Boole & Babbage
    • RMF SpreadSheet Reporter. This processes machine readable RMF reports and translates the text into Excel worksheets.
    • Tivoli/Omegamon Monitors
  • Models
    • BEST/1
    • PERFMAN
    • Tivoli Performance Simulator (zTPM)
  • Stat Packages
    • EXCEL (+add-ins)
    • 1-2-3
    • SAS
    • Mathematica
    • MATLAB
    • APL's AGSS
  • Vendor Services & Internal Pre-Sales Tools (IBM's zPCR, zCP3000)

Organizations - CMG

Almost any volume of the Computer Measurement Group (CMG) Proceedings is worth looking at for performance and capacity planning articles. Web Site: http://www.cmg.org/

Notices

This material is copyright by Ray Wicks 2007-2008.

Selected material reproduced by permission of IBM Corporation.

Many terms are trademarks of different companies and are owned by them.

Thanks to Bernice Riley and Alvero Salla for looking at this and making technical and editorial corrections.