Using Principal Component Analysis to Analyze Data and Generate Forecasts

June, 2007
by Rick Isom

About the Author
Rick Isom

Rick Isom is a capacity planner with the capacity planning group at Nationwide Insurance in Columbus, OH. He has over 35 years experience in computer operations, computer performance reporting, and capacity planning. He has published articles in Mainframe Journal, Enterprise Systems Journal, and the Journal of Computer Resource Management. He has presented several times at CMG's yearly conference. He can be reached at 614-249-6580, or .

Introduction

Many corporations have customers spread across the entire country. Analyzing and forecasting large and complex corporate applications containing vast amounts of data present an ongoing challenge for a capacity planning group. For example, analysts of large and complex DB2 applications must simultaneously track several user access factors that affect CPU usage in different but significant ways. For instance, both the type and time of user access can be highly unpredictable and, therefore, erratically impact the application's CPU usage.

To illustrate the interwoven nature of a growing corporate IT infrastructure, some DB2 applications may have three main interfaces: a web-based interface that includes the Internet and Intranet, batch updates from legacy systems, and other more traditional interfaces such as servers, desktop, and telephone. Because of ever-changing corporate business requirements and the ever-increasing complexity of computer application interaction, a methodology called principal component analysis, which narrows the number of variables in the target data, can be extremely useful in determining which are the main variables affecting computer resources. These major variables can be converted into a multiple regression formula that a capacity planner can use to generate forecasts.

Several examples of innovative analysis of computer applications using the principal component methodology are presented in this paper and listed as follows:

  • Models created to show the effect of certain ad hoc DB2 queries on CPU usage
  • A CPU forecast generated for three IMS applications and 27 IMS transaction types
  • Analysis of a complex customer relationship management (CRM) application

Principal Component Analysis / SAS code

Principal component analysis has the following characteristics [1]:

  • It is a variable reduction procedure that results in a relatively small number of components and makes no assumptions.
  • Components are used as the predictor variables in a multiple regression analysis.
  • An eigenvalue represents the amount of variance that is attributed to a given component.
  • The total variance is equal to the number of components analyzed. (In Figure A the number is seven.)
  • An eigenvalue value greater than 0.7 is an important guideline in determining the number of components.
  • The cumulative percentage of the components selected should account for at least 70% and up to 80% of the total variance.
  • There should be at least three variables present to generate a multiple regression formula.

To initiate a principal component analysis, the general form for the SAS program is as follows:


		proc factor data = data-set-name
		method=prin
		    priors=one
		    mineign=p
		    scree
		    rotate=varimax;
		    var   variables-to-be-analyzed;
		    run;
		

Approaches Using Principal Component Analysis

Query Blockers
Many companies are using a DB2 data warehouse application for data mining [2]. These applications process long running queries that search for significant marketing patterns and trends to assist a company in becoming more profitable. However, such long running queries frequently consume large amounts of CPU time. Principal component analysis can be used to identify queries that consume excessive amounts of prime time CPU usage and target them to process in an alternative time frame as ‘illustrated' in the following figures A and B.

Figure A shows that only 8 percent of all prime time DB2 queries are in the over 180 CPU seconds category.

Figure A
Figure A

However, Figure B shows that those queries in the over 180 CPU seconds category use 79 percent of the CPU time.

Figure B
Figure B

Figure C shows the results of the accumulated data analyzed by day and hour using the SAS program to identify the principal components. A multiple regression formula can be generated using these variables.

Figure C
Figure C

In Figure D, application of the multiple regression formula identifies a significant decrease in CPU usage for prime time DB2 queries running between 120 and 180 CPU seconds and over 180 CPU seconds when just those two variables are reduced by 30%. This scenario alerts a capacity planner, who then can communicate to management, that control of such queries by using query blockers and/or moving long-running queries to the off-shift can result in a major saving of CPU resources.

Figure D
Figure D

In summary, information in Figures A, B, C and D illustrates a step by step principal component analysis. This process enables a capacity planner to isolate major variables that contribute to high hourly CPU consumption. Then using those variables the capacity planner can create a multiple regression formula. Finally, the results generated by that formula can be analyzed to understand the CPU usage effects of long running queries during prime time and the subsequent need for query blockers to control CPU resources.

IMS Transactions - an Example
For a large corporation, the key applications in place for the nightly update processing of all accounts, and the daytime processing of those accounts, focused on three IMS applications. These applications were critical to the maintenance and access of the corporation's customer account information. Insuring that these applications had sufficient (24 hour) computer capacity was a paramount concern. Principal component analysis methodology identified those IMS transactions that were crucial to generating a forecast. For example, the data was summarized at the date and hourly level showing CPU time and the total number of transactions by IMS type. The three key IMS applications labeled in this paper were application A (eighteen different transaction types), application B (eight types), and application C (one type).

Next, based on volume, five main transactions were selected from application A, three from application B and one from application C. Using principal component analysis, three weeks of data was collected and analyzed during prime time, 8 am to 5 pm. The nine transactions were reduced to a final four: two transaction types from application A and one transaction type each from applications B and C.

It is important to note that once the data was collected and copied into a dataset, the process can then be repeated to run in a variety of time frame scenarios. These scenarios can then be compared to determine if there is any time frame in which there could be a CPU capacity problem. For example, in one helpful comparative analysis, the IMS transactions during prime time, 8 am to 5 pm, can be compared with IMS transactions from 12 midnight to 6 am. On a broader scale, once a formula has been created, if one of the major IMS transactions undergoes a major code rewrite and measurements are taken, projections can be generated to show the IMS transition impact on the processor's CPU usage.

CRM/DB2 Analysis & Forecasting
For many companies that have a large, nation-wide customer base, developing a CRM application has become essential to the integration of customer information [3]. However, a complex CRM application presents a formidable task for a capacity planning group because of its basic charatertics [4]:

  • "Carries unpredictable and highly volatile loads - since most web-enabled applications can be reached by anyone with Internet access, there is no customer-friendly way to control the
  • Number of users accessing these applications at any time.
  • Supports enormous numbers of customers - accessibility by anyone, anytime, anywhere makes it necessary to develop applications of immense size.
  • Consists of complex multi-platform architectures - many new web-enabled applications must interface with existing databases and legacy applications developed for a variety of platforms."

Figure E contains simulated data for eight variables analyzed for a day, as well as DB2 updates designated as the primary variable used to generate a CPU requirement forecast for grave shift (midnight to 6 am).

Figure E - Simulated DB2 data
Figure E - Simulated DB2 data

For many DB2/CRM applications, DB2 updates are primarily done on grave shift (midnight to 6 am). It is important to remember that a DB2 update statement can update one or many rows. For example, suppose I own a business selling widgets and I need to make a change to my customer information base regarding what city one lives in. I could execute the following to change the customer zone:


			Update create.table1
			Set customer_zone='east'
			Where customer_city='new york';
		

If I had one customer where customer_city was equal to New York, I would update one row, and any associated index rows where customer_city is a column in the index. If I had 1000 customers where customer_city was equal to New York, I would update 1000 rows and any associated index rows [5].

Figure F shows simulated results in analyzing DB2 updates between midnight and 6 am, applies principal component analysis, and creates a multiple regression formula. The scenario average represents simulated data from a CRM/DB2 system. To test the formula, take the peak hour for three days, enter measurements into the formula, then compare predicted CPU seconds with actual CPU seconds.

Figure F - Simulated DB2 Scenarios
Figure F - Simulated DB2 Scenarios

Figure G shows the values used for the sync I/Os per DB2 update, async I/Os per DB2 update, and GETPAGES per DB2 update which are required as input when generating the various models (Figure M) using the multiple regression formula. The simulated formula is as follows:

Predicted CPU Seconds = Intercep + (A*DB2 Updates) + (B*Sync I/Os) + (C*Async I/Os) + (D*DB2 Getpages)

Figure G
Figure G

Figure H shows projected CPU requirements for various DB2 update volumes.

Figure H
Figure H

Conclusion

For capacity planners, who are responsible for analyzing and interpreting large amounts of data and then generating budget-affecting forecasts, principal component analysis is a valuable tool. It can be used both efficiently and effectively to analyze a large number of variables. That makes it a useful, cost-effective alternative for corporations, especially when reduced budgets no longer cover the cost of modeling software or personnel with modeling expertise. Most important, principal component analysis allows the capacity planner innovative, self-directed approaches to analyze the data and generate the forecasts for which he is responsible.


References:

[1] Hatcher, L.H., Stepanski, E. J., "A Step-by-Step Approach to Using the SAS System for Univariate and Multivariate Statistics", SAS Institute Inc., Cary, NC, 1994, p.450-457

[2] Isom, Rick, "Data Assets - Forecasting Large Data Warehouse Applications in a Banking Environment", Enterprise Systems Journal, Oct 2000, p.40-43.

[3] Isom, Rick, "Analyzing and Forecasting a Customer Relationship Management (CRM) Application", CMG 2002, p.387

[4] Buzen, J. and Shum, A., "Industry-wide Implications of the Capacity on Demand & Pay-As-You-Go Phenomena", CMG 2001 - Late Breaking Papers, p. 66-67

[5] Conversation with Melissa H., Senior DB2 Systems Programmer, Bank of America, Oct, 2002