CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

Uncaptured CPU Overheads, SRM / RMPTTOM, and evolution to the IBM System z9 EC Processor
A Site Experience (Applicable to Many Others)
February, 2007
by Geoff Adams

About the Author
Geoff Adams, National Australia Bank

Geoff Adams is a Senior Capacity Planner at NAB, with over 30 years of IT experience. Geoff has performed a variety of roles over this period in Government, Insurance, Telecommunications and Financial organisations, from the early years as a Computer Operator, then moving to an Applications Programmer/Analyst; through to CICS systems programming, and Capacity and Performance management of large IBM mainframes. Geoff has past achievements and contributions in computer performance evaluation, winning the CMG Australia annual conference’s best paper in August 2004, and later that year presenting at the CMG event in Las Vegas.

[Hide]

1. Introduction

This article details our site's experiences when implementing the new IBM System z9 EC 2094 model processors. The installation process started in late June 2006 and although tuning the new configuration is still a work in progress, we believe others can benefit from our experiences thus far.

To set the scene - a large variation in the "Uncaptured" CPU component was detected, with an associated fall in the capture ratio.

The scale of this rise in Uncaptured was considered to be out of proportion with the increase in the measurable workloads.

We duly followed this up with IBM via the normal channels for this kind of enquiry and then escalated it to ensure it received the attention that we believed that it deserved. It was recognised that an incorrect value existed in the SRM tables for z/OS 1.7 for the z9 EC processor. Further investigations highlighted an opportunity to enhance SRM to reduce the frequency of some non-critical algorithms which were currently contributing to higher than necessary overheads.

We believe that these findings can be of benefit to most IBM mainframe sites in the short term by adjusting a parameter called RMPTTOM, and in the longer term when IBM make the proposed SRM enhancements available. IBM announcements have occurred making initial recommendations and providing the necessary technical guidance, and we expect that these will continue as enhancements become available.

The information presented in this paper is based on the measurements taken that we believed were appropriate for our configuration and workload characteristics. It may not be representative for other sites' circumstances. As always, Your Mileage May Vary (YMMV). Trademarks and normal disclaimers apply.

I would like to thank IBM for addressing the issue for our site and their contributions to this article.

Other sites will receive the benefits of our experience if they choose to follow the revised IBM recommendations through adopting any proposed enhancements into their own environments, after thorough testing, of course.

1.1 Background

Our site upgraded two of it's IBM zSeries z900 model processors to the new System z9 EC technology during June and August 2006 as part of a refresh project, driven by timings of lease expiration.

Performance benchmark workloads performed within expected tolerances, however it was felt that the Uncaptured CPU overheads had increased out of proportion to the workload. In other words the Capture Ratio reduced by several percent. Generally, performance benchmarks would only apply to the measurable workloads, eg. CICS, IMS, TSO & Batch.

IBM was notified via a formal ETR and data/diagnostics were provided to assist investigations into the root cause.

1.2 Overview of Uncaptured

It is extremely difficult to understand what drives increases in Uncaptured CPU as there are no tools to measure what is essentially the UNMEASURED workload. This is an area of ongoing challenge for those in the Capacity and Performance domain, as reasons are not generally known for variations in Uncaptured usage levels. Moreover there is generally no-one to ask; you are on your own!

The largest increases appeared to be a function of the number of address spaces running; that is, the larger the number of tasks then the higher the Uncaptured CPU overhead. We had 2 LPARs for development that had a very high number of tasks, with one having 1,376 tasks for on-line testing alone. The capture ratio reduced on the latter from percentages in the low 80's to the low 70's. Benchmark workloads would I expect only run a limited number of tasks, therefore it was our extreme workload in this environment that helped in identifying and isolating the root cause.

This change in the capture ratio was to become a pressing concern during September for the production processor. When we applied the uplifted numbers to the capacity forecast model using the previous month's metrics it impacted our predicted upgrade schedules. The new forecast had us bringing the next production upgrade forward to before Christmas, along with the requisite financial pain this unplanned expense would cause. Bringing these facts to management's attention caused a renewed focus on the matter.

An IBM z/OS performance team member suggested we trial changes to the SRM invocation constant; the RMPTTOM parameter in the SYS1.PARMLIB(IEAOPTxx); and this had an immediate impact in reducing CPU on the systems programmer test LPAR.

IBM advised there were two issues:

  1. The SRM table had an incorrect value for the z9 EC processor which caused extra interrupts and produced unnecessary overhead.
  2. The SRM design was developed in the 1970s. As processors have become faster and the number of address spaces summed for all logical partitions has increased, the design has become less than optimal. Several functions processed do not need to be processed as often as the current implementation allows. There are other functions which are time sensitive and still need to be processed in a timely fashion on a production system. Design improvements are planned to change the frequency for some less critical SRM functions to reduce uncaptured time. It is expected this fix will enable the default value for RMPTTOM to provide a reasonable SRM processing cost for z/OS production images. The fix for APAR OA18452 is planned to provide design improvements to change the frequency for some less critical SRM functions to reduce uncaptured time. It is expected this fix will enable the default value for RMPTTOM to provide a reasonable SRM processing cost for z/OS production images.

This link provides an explanation of Uncaptured CPU in MXG and SAS language terms

2. The Way Forward

After reviewing trace data IBM suggested in a teleconference that we could reduce overheads with no risk for the z9 EC models by doubling the RMPTTOM parameter to 2000 (default is 1000). This removes the impact caused by the bug whereby an incorrect value was set for the z9 EC within SRM.

The above has now been the subject of an announcement delivered by the Washington Systems Centre - WSC FLASH 10526. In this "IBM recommend all customers running on zSeries and System z processors rated at a per CP speed greater than 100 MIPS update their RMPTTOM setting to 2000."

The IBM proposed enhancements to SRM (reference: APAR OA18452) as mentioned in the above FLASH would take some time to develop and test to address the frequency of the non-critical algorithms called currently each time of a fraction of SRM is invoked. A joint and/or separate announcement(s) to the above would occur for this performance improvement with a timetable for availability likely in early 2007.

As changes to the RMPTTOM parameter is not without risk, being entirely workload dependant, then it is recommended that you proceed with caution in trialling uplifts in the parameter.

The RMPTTOM resides in the IEAOPTxx member of SYS1.PARMLIB. For example RMPTTOM=2000 You can change RMPTTOM dynamically via the SET OPT=xx operator command.

This link provides the RMPTTOM parameter definition

2.1 Trials and impacts of RMPTTOM settings

In general the optimal setting for RMPTTOM is workload dependant, in particular the number of tasks on the LPAR is a function of the SRM overhead attributed to the Uncaptured CPU level.

We have followed IBM's recommendations:

"Installations may choose to experiment with larger RMPTTOM values in the 3,000-10,000 range for production LPARs running on processors with a per CP speed above100 MIPS. In most cases the risk of making these changes is low and the primary impact would be the accuracy of period switch and ability to meet goals. It is an installation's responsibility to evaluate the benefits or impacts of such a change. In any case IBM recommends values above 10,000 should be avoided for production LPARs.

For non-production LPARs the installation may decide higher values may be acceptable due to the nature and importance of the workload on the LPAR. It should be remembered higher values can impact responsiveness and system efficiency. Values above 20,000 for a z9 should be avoided since the SRM cost is reduced considerably at 20,000 and one probably does not want to risk poor resource management. A setting of 20,000 allows an SRM timer interrupt every 28 milliseconds or about every 17 million instructions (of a single CP) on a z9. One should use lower limits for z990 and z900 based on the respective processing power, 15,000 for the z990 and 10,000 for the z900 to enable SRM to manage resources at the same 28 millisecond interval (approximately)."

Although we have trialled large values in our non-production LPARs above 20,000 without an issue, this is not recommended as indicated above. Furthermore, with a long term resolution to the issue likely where the default will become appropriate again, then leaving these values in place may introduce a risk if they are not re-evaluated at the appropriate time.

2.2 Summary of trials -to date

For our site the following table and charts represents a summary of our trials to date:

Table 1: Prime Shift averages of selected LPAR metrics for RMPTTOM trials

Quite a dramatic drop occurred once we trialled the change to the RMPTTOM parameter in our Systems Programmer test LPAR:

Chart 1: SYSP-T LPAR trends of metrics of interest - Prime Shift Averages

Subsequently the other development and production LPARs had RMPTTOM changed (14 LPARs across 3 processors, 2 x z9, 1 x z990), some of which are shown in the charts below as well:

Chart 2: DEVT-A LPAR trends of metrics of interest

Chart 3: DEVT-B LPAR trends of metrics of interest

Chart 4: PROD-B LPAR trends of metrics of interest

2.3 Number of Tasks

Our Test LPARs mirror our production environment running duplicate sets of the On-line Production tasks. These are CICS, IMS and a particular banking application. By definition these are not swappable as they utilise cross memory services. The unique nature of these banking application test environments with their high usage of address spaces is the reason why the overhead was more apparent at our site than perhaps others. Large numbers of TSO and/or Batch do not influence the overhead to the same degree as a large percentage of TSO users can be swapped out at any one time when not active, and in the case of batch the number of concurrent jobs can be limited by the number of initiators combined with various wait conditions causing swap out.

As SRM processes the dispatching chain, the longer this chain is, then the more overhead is generated which is then subsequently being allocated to the Uncaptured CPU bucket.

I have developed a method of tracking the number of tasks (or address spaces) in an LPAR that may assist others by using the RMFINTRV member in MXG. It is quite useful to track this as it is a means of understanding variations to Uncaptured CPU overheads. It is best to average a range of intervals for comparison purposes. I use Prime Shift (08:00 - 17:00) primarily, and our peak hour (11:00 - 12:00).

Depending on your own level of customisation to IMACWORK, it is simply a matter of summing the system (MVSACTV), system STC / STC(****ACTV) variables., and in our case we break down the Online tasks (into prod and test). I then add in the TSOMAX, and for Batch (Prod and User) I allow for 20% by the submission (****TRAN) rate multiplied by the interval seconds (we use 15 minutes / 900 second intervals hence the 20% of 900 = 180 in example below). The reason for this is it is likely Batch jobs will start and finish or wait for resources within an interval resulting in less overall SRM overhead than a non-swappable STC.

An example of the SAS code:.

    TOTSTC =SUM(MVSACTV,SYSSACTV,ONLPACTV,ONLTACTV,STCXACTV,MVACTV,0);
    IF PBATTRAN GT 0 THEN PBATTSK = PBATTRAN * 180;ELSE PBATTSK=0;
    IF UBATTRAN GT 0 THEN UBATTSK = UBATTRAN * 180;ELSE UBATTSK=0;
    IF TSOMAX GT 0 THEN TSOTSK = TSOMAX ;ELSE TSOTSK =0;
    TTSK =SUM(TOTSTC,PBATTSK,UBATTSK,TSOTSK,0);
    CAPRATIO = (1-(UNCPMIPS / TOTLMIPS ))*100;

Then you can track the 3 variables (UNCPMIPS, CAPRATIO and TTSK) for your period by date as in the graphs shown above.

2.4 How to determine the SRM invocation interval.

The value that you have set for RMPTTOM is converted for the processor that you run on from an internal table. You can derive the actual SRM invocation interval for each of your LPARs by several means:-

  • By using your real-time monitoring facilities where they can display this field. We use Mainview so this is available by navigating to the SRMOPT screen.
  • Another way which most probably emulates the above, use Mainview (or IPCS, Omegamon or other) - the CVT offset 25Cx contains the pointer to the Resource Manager Control Table (RMCT). Offset 10x contains the pointer to the RMPT - RMPTTOM at offset 1Cx (4 bytes) is the SRM second in microseconds (which needs conversion from hex to decimal). Note: The statement above is true for z/OS 1.7 and above. Prior to z/OS 1.7, the value is in units of 1.024 milliseconds.

Note: There is rounding occurring when you use small incremental values for RMPTTOM. That is, when changing from say 2500 to 3000, or 4000 to 4500, for the z9 EC there will be no change to the derived actual SRM invocation interval in place. It is very important to be aware of this to avoid any confusion when assessing impacts to Uncaptured CPU after changing RMPTTOM. In my opinion you are best to make your changes in 1000's for RMPTTOM.

Table 2: SRM table for RMPTTOM setting z9 EC & z990 processors- invocation interval : rate/sec

Note : The values for the z9 EC above are the subject of a PTF fix and are provided as a guide.

Chart 5: SRM Chart for RMPTTOM setting z9 EC and z990 processors- invocation interval : rate/sec

3. Performance Benchmarks

As mentioned Performance Benchmarks including IBM's LSPR would only cover the measurable workloads.

I have undertaken formal and informal Performance Benchmarks over the last 9 years of IBM (and previously IBM compatible vendors) mainframe processors and during this period there has been a rapid increase in speed of processor technology. During 1984 I also benchmarked another large scale computer architecture in a technology refresh circumstance.

I have used my own custom MXG/SAS analysis for this purpose in the IBM environments.

This particular experience may drive some improvements in methods used in LSPR benchmarks to better emulate commercial IBM mainframe sites.

Performance Benchmarks, given the nature of this paper should in future include reference to the Uncaptured CPU component as this makes up a part of the workload that is site specific.

4. Summary

Tracking all mainframe workloads diligently, including Uncaptured, and questioning appropriate systems support teams or other resources of each variance to try to determine a reason, can and does pay off with measurable cost savings and performance benefits.

As there can be a high degree of variability with capture ratios, I would recommend data be analyzed over a large period, dating back multiple months. Before one concludes that a significant change to Uncaptured has indeed occurred, determine if the impact is sustained beyond several days prior to instigating any follow up actions.

It is very worthwhile to establish reporting to track the number of tasks in each LPAR in tandem with Uncaptured CPU and the Capture Ratio. This gives you a workload profile of the LPAR, a baseline for the level of Uncaptured CPU and the Capture Ratio and allows variations to the number of tasks that influence SRM overheads to be monitored. When trialling changes to RMPTTOM this reporting can deliver improved knowledge to base decisions on, as well as understanding the impact of processor upgrades, other configuration changes (eg. PR/SM), and changes to the level of z/OS and other system software.

Despite the difficulties involved in influencing reductions in overheads, I track Uncaptured CPU usage with some diligence as in rare circumstances a change can be identified as a culprit via a process of elimination and confirmed by other sources.

In this particular case, with IBM's assistance, to date we have reduced Uncaptured CPU by over 6% of the total site mainframe CPU utilization. This may reach 7% after further planned RMPTTOM changes. By any assessment this is substantial and will contribute to delaying future processor upgrades, and in the short term eliminate the need for an un-scheduled upgrade prior to the Christmas peak.

We are now in a much improved position in terms of understanding the influences to the overheads being attributed to Uncaptured and the benefits are being realised.

You can "Get MIPS Back" too, by following the IBM recommendations for RMPTTOM. IBM's planned enhancements to z/OS SRM will deliver an improved capture ratio for the longer term. Your own gains may not be as substantial however they will be positive.

This is all good news!

5. References

IBM Link : SRM APAR OA18452
WSC Flash : 10526

MXG Listserver archive: http://peach.ease.lsoft.com/archives/mxg-l.html search on RMPTTOM

MXG Source Library dataset on mainframe LPAR: <HLQ17>.<HLQ2>.MXG.SOURCLIB MXG website: http://www.mxg.com/

IBM Manual: z/OS MVS Initialization and Tuning Guide. (requires IBM userid and password)

Author Contact Email:

LPAR trend Charts showing Uncaptured increase from the z9 EC install on 25-Jun-2006 for DEVT processor, and 06-Aug-2006 for PROD, and subsequent trials of the RMPTTOM parameter for a reduced level of SRM overhead from 27-Sep-2006. Refer to the grey area second up from the horizontal axis.

Appendix I.

Chart 6: DEVT-A LPAR Prime Shift Trends - Workload breakdown

Chart 7: SYSP-T LPAR Prime Shift Trends - Workload breakdown

Chart 8: DEVT-B LPAR Prime Shift Trends - Workload breakdown

Chart 9: PROD-B LPAR Prime Shift Trends - Workload breakdown

 

Last Updated 06/05/09


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group