Using ITIL Best Practices to Create a Capacity Management Process
October 1, 2003
by Chris Molloy
1 Background
The IBM™ Corporation has provided the performance management of the distributed computing environment for the customer for several years. As part of providing this service, IBM has used a service called Server Resource Management (SRM). SRM provides historical reporting of servers, at the server metric resource level (e.g. CPU, memory, and disk utilization at the server level). In the paper entitled "Six Levels of Sophistication for Capacity Management" by George Thompson [1], George outlines progressive levels of capacity management.
SRM is designed to meet the requirements of sophistication level 2. Sophistication level 2 is the level where there is a historical reporting process in place at the server level. Server level metrics includes such items as CPU, memory, and disk utilization. SRM stores several metrics at the 15 minute, hour, day, week, and month level, which can be used for both performance management and capacity planning. The additional sophistication levels beyond level 2 include the historical reporting of information at the process level on a server, the automated forecasting of workload based on historical trends, and the automatic conversion of business drivers to IT resource requirements (e.g. projected sales of 1000 items translates to the specific requirements for CPU, memory, and disk).
Competitive pressure and the need for additional reduction in the total cost of ownership of the customer IT environment have resulted in the customer asking IBM to extend its formal performance management process with the creation of a formal capacity management process. On March 5, 2003, the two companies met to discuss these additional requirements. It was decided that a Chris Molloy from IBM and a customer representative would develop a proposal for IBM to provide a documented capacity planning process which IBM would then use to provide capacity planning service for the customer.
2 What is ITIL
Information Technology Information Library (ITIL) is a set of best practices for the Information Technology (IT) service management. "Starting as a guide for UK government, the framework has proved to be useful to organizations in all sectors though its adoption by many companies as the basis for Service Management" (from page 1 of the Best Practice for Service Delivery [2]).
ITIL is divided into two major sections. The first section is Service Delivery, which looks at what service the business requires of the provider in order to provide adequate support to the business customers. The second section is Service Support, which looks at ensuring that the User has access to the appropriate services to support the business functions. Capacity Management is part of Service Delivery.
The ITIL best practices for capacity management are described in chapter 6 of the Best Practice for Service Delivery book. The ITIL capacity management discipline includes both performance management (optimization of existing resources) and capacity management (ensuring sufficient resources exist to meet forecasted requirements).
There are two types of ITIL certification. The first type is a certification of the personnel. The three levels of personnel ITIL certification are foundation (general ITIL knowledge), practitioner (specific knowledge on an ITIL discipline), and manager (specific knowledge on how to manage all the ITIL disciplines). The Examination Institute for Information Sciences conducts these certifications. Their web site is located at www.exin-exams.com.
The second type of ITIL certification is the PinkVerify™ certification offered by Pink Elephant, Inc. The following is a quote from their web site at www.pinkelephant.com, "If you see the PinkVerifyTM logo associated with any specific IT Service Management software toolset, it means that it has been objectively assessed according to the criteria specified by the OGC and certified by a qualified Pink Elephant IT Service Management Consultant, as meeting the minimum functional requirements to support the ITIL framework.".
While the Pink Verify process has certified several tools in different ITIL disciplines, they have yet to publish certification guidelines or certify tools in the ITIL capacity management discipline.
The major deliverable from the ITIL capacity management process is the creation and maintenance of the capacity plan. This plan ties future business requirements to IT resources, and outlines the IT recommendations required to meet those business requirements.
The ITIL capacity management best practice consists of three sub-processes: business capacity management (responsible for ensuring that the future business requirements for IT Services are considered, planned and implemented in a timely fashion), service capacity management (responsible for ensuring that the performance of all services, as detailed in the targets in the SLAs and SLRs, is monitored and measured, and that the collected data is recorded, analyzed, and reported), and resource capacity management (responsible for ensuring that all components within the IT Infrastructure that have finite resource are monitored and measured, and that the collected data is recorded, analyzed, and reported). It is thought that the existing performance management service being provided satisfactorily addresses the service and resource capacity management requirements, and that the new capacity management process being developed will address the business capacity management requirements.
Additional information on ITIL can be found at the ITIL web site, located at www.itil.co.uk/.
3 Why ITIL?
In the initial discussion with the customer, it was brought out that the customer had no formal process documentation for capacity management. This left us open to creating a new process. ITIL was chosen for the following reasons:
- Leverage work already done, expediting the creation of the new capacity management process.
- ITIL is growing in the industry as an approved framework for IT system management (the web site indicates that over 20,000 companies are using ITIL best practices).
- ITIL is an objective (third party) "stake in the ground" for us to use as a starting point.
- Use of ITIL for capacity management opens the way for the customer to use the other ITIL best practices for other disciplines.
4 Methodology
This section outlines the methodology needed to establish the capacity management process.
The first step in creating the formal capacity management process is to agree to the scope of the new capacity management process. Since performance management for the servers in scope is already being performed, we need to agree on what the additional scope needs to be added for this new offering.
As outlined in section 6.1 of the ITIL documentation, the capacity management process encompasses the following:
- The monitoring of performance and throughput of IT Services and the supporting infrastructure components
- Undertaking tuning activities to make the most efficient use of existing resources.
- Understanding the demands currently being made for IT resources and producing forecasts for future requirements.
- Influencing the demand for resources, perhaps in conjunction with Financial Management.
- The production of a Capacity Plan which enables the IT Service provider to provide services of the quality defined in the Service Level Agreements (SLAs).
The existing performance management service provided addresses items one and two listed above, and partially addresses item three. The partial addressing of item three comes from the SRM red action list, which identifies servers over a two month period that have exceeded warning levels of capacity thresholds. In order to fully address item three, this should be extended to create a forecast, based on the combination of historical utilization data and projected business requirement impact. The new capacity management process will also need to address items four and five.
As outlined in section 6.1.3 of the ITIL documentation, the process should encompass, for both the operational (production) and development environments:
- All hardware - from PCs, through file servers, up to mainframes and super computers.
- All networking equipment (LANs, WANs, bridges, routers, etc.)
- All peripherals (bulk storage devices, printers, etc.)
- All software - operating system and network software, in-house developments and purchased packages.
- Human resources, but only where a lack of human resources could result in a delay in end-to-end response time.
Since IBM does not provide the entire IT service for the customer, the hardware scope will be confined to the servers which IBM is contracted for providing operational service to. There will be no network or peripheral capacity management as part of this proposal. The software will include the operating system, and major applications that run on the server. While a complete analysis and creation of an application profile of resources is indicated by ITIL, the scope of this proposal will only address the major server level utilization characteristics (e.g. CPU, memory, and disk requirements) of the applications running on the server.
The action plan listed in Appendix 1 contains an item for consensus to the scope described above. This white paper was written by IBM, and provided to the customer, in order to document the approach and scope. The customer representative reviewed this information, and she agreed to the approach and scope without any changes needing to be made.
The second step in creating the formal capacity management process is to perform an assessment of the current capacity management being performed, and come to consensus on the new level of sophistication of capacity management. It was agreed to in our initial meeting that we could not simply jump to the sixth level of capacity management sophistication from where we are today. Instead, we will come to consensus on where we want to be next, and what investment it will take to get there. Further enhancements to the capacity management process past that point would be in scope for future proposals.
The ITIL best practice contains a self-assessment of the capacity management discipline. We will complete two copies of the assessment, the current state and the future (desired) state.
The action plan listed in Appendix 1 contains an item for consensus to the assessment described above. Jane performed both the current and future assessment using the ITIL assessment for capacity planning. The assessment has 9 sections, and 58 questions. We reviewed the spreadsheets, and the only change that was made was in the future assessment to question 36, where we agreed that this should be a yes. This question dealt with having suitable tools to support capacity management, and it was agreed that the proposal should include those tools.
A summary of the assessment is included in Appendix 4. The implementation of the scope of this proposal will move capacity planning at the customer from passing 2 of the 9 sections to passing 8 of the nine sections. The external integration section that it did not pass was due to the section primarily questioning the integration with other ITIL disciplines. Since capacity planning is the first ITIL discipline to be implemented at the customer, the interfaces to the other disciplines are not in place. This leaves open the possibility in the future to establish more ITIL disciplines, and to interface with them. The implementation of the scope of this proposal will raise the raw score of the assessment from 26 to 100. There are 15 points of additional raw score that could be obtained by interfacing with other ITIL disciplines, and minor issues with some of the other sections. We reviewed each of those issues, and concluded that we did not want to propose investing in them at this time.
The third step in creating the formal capacity management process is to map the roles and responsibilities of the current service being provided to the roles and responsibilities of the proposed additional service. ITIL Annex 6A contains a roles and responsibilities list for capacity management. We will use this list to create a matrix that will identify the gaps between the existing and proposed services. This matrix is included in Appendix 2.
The action plan listed in Appendix 1 contains an item for consensus to the matrix described above. Chris updated the matrix to outline what was currently being done (performance) and what is included in this proposal (capacity). The customer reviewed the information, and agreed to it without changes. There were five items from the ITIL responsibility list that we decided not to propose to invest in at this time (listed in the none column in the table). This provides room for further expansion of the capacity management process. We then looked at who could perform those new responsibilities. We decided that the only new responsibility that had to be performed by the customer was the one concerning "Maintaining a knowledge of future demand". We interpreted this to mean that someone needed to be able to understand what the changes to the customer business are, and to be able to bring those into capacity planning so that they can be translated into IT resource requirements. This would have to be done by customer personnel. The customer, IBM, or other appropriate service providers could perform the other responsibilities.
Once the scope, proposed assessment, roles, and responsibilities are agreed to, the last step is to agree on the content of the capacity plan deliverable, and to determine what resources (e.g. template, programming, personnel, training, etc.) are needed to create the capacity plan. It is assumed that the capacity plan will be an input to the upcoming fiscal fall plan cycle. If this is the case, the 2003 capacity plan (which takes affect in 2004) may not have the time necessary to automate some of the contents needed for the capacity plan (specifically the charts that show the historical utilization and forecast of future requirements). We will therefore create a preliminary capacity plan in 2003 (for 2004) without a significant amount of graphics. If this meets our requirements, this plan will be used as a template for future capacity plans. If the graphics are found to be required, than an additional Request For Service (RFS) can be submitted to address those requirements.
The action plan listed in Appendix 1 contains several items associated with the creation of the capacity plan. Appendix 3 has a draft of the Capacity Plan table of contents.
5 Update from 4/01/03 Review
On 4/01/03, we met with management to review our progress on the project, and received additional insight. We reviewed the current draft of the white paper, and our progress to date. Customer management gave us additional insight into what he was looking for, and that we needed to incorporate the following items:
- Provide alternatives to IBM doing the work.
- Add who should do the work
- Show return on investment
- List benefits of capacity management
- Outline process of who is doing what
We reviewed alternatives for performing the work represented by the proposal. The only new responsibility that needs to be performed by customer personnel is the responsibility of identifying the changing customer business strategy and how it relates to IT resource requirements. The rest of the requirements can be performed by the customer, IBM, or other qualified service providers. After taking a high level analysis of three alternatives (customer, IBM, other), we recommend that IBM do the work. The main reason for this is that we feel that SRM can be used (with possible small enhancements) as the tool for capacity management. Use of any other service provider or consulting service would require other tools to be installed, and increase the cost of providing the service. If additional analysis of alternatives is desired, a request for service (RFS) should be created, and distributed to appropriate service providers.
SRM 5.4 was recently announced, and supplemented the current capacity planning reports with linear regression forecasting. In addition, support for VMWare logical partitions on xSeries servers was added. During the summer of 2003, support for LPARs on pSeries servers will be added, providing information both at the partition and the server level. These improvements are in addition to the capacity planning reports that SRM already provides. The combination of these functions should be sufficient to support the capacity management function being proposed, without additional tool cost. If additional function is needed, a separate RFS will be submitted.
We discussed how to demonstrate return on investment, and benefits of capacity planning. We agreed that a formal financial business case is not warranted at this time, but that examples should be given to demonstrate significant return on investment.
The first example of ROI is in performing server consolidation projects. By consolidating servers, utilization can be increased, resulting in fewer resources being required. This also provides for the ability to redistribute resources in more of a dynamic fashion, using capacity on demand and dynamic resource reallocation technologies. A second example of ROI is in infusing new technologies into the IT environment. The customer is already investing in new technologies such as logical partition (LPAR) capable Unix servers and Storage Area Networks (SANs). By having someone responsible for reviewing these technologies and recommending improvements, a more cohesive strategy and positioning resulting in cost savings will be achieved.
The ITIL capacity management discipline (section 6.4.2) outlines two major benefits: increased efficiency through cost savings, and reduced risk. Properly implemented capacity management processes lead to deferred expenditure because it is possible to defer the cost of new equipment to a later date (if at all) by improving the mapping of business changes to IT requirements (especially with respect to when the changes are anticipated). Additionally, planned buying should be less expensive than panic (reactive) buying.
Effective capacity management reduces the risk of performance problems and failure in the following ways:
- For existing applications the risk is minimized through managing the resources and service performance.
- The risk to new applications is reduced through application sizing - as new applications can have an adverse effect upon existing applications, the risk to those applications is also minimized.
- The number of urgent changes to increase capacity are reduced, and hopefully eliminated, through effective capacity planning.
The establishment of formalized capacity management processes introduces changes to the business cycle. An outline of this cycle is shown in Appendix 5. Capacity management should be incorporated into the annual business planning process, and the fall plan interlocks of resources for the following year.
6 Recommendations
The first recommendation of this project is that additional investment by the customer in order to perform the additional requirements for establishing a formal capacity management process.
We initially recommend that one full time equivalent (1 FTE) be invested to perform this capacity management function. After one year, the amount of investment needs to be reviewed to determine if adjustments are required.
The second recommendation of this study is that SRM be used as the historical data collection process, and that all servers have SRM installed on them. It would take a significant amount of manual effort to collect performance statistics on servers that do not have SRM installed on them. In addition, it would be hard to create workload projections without historical information to base the project upon. There are currently several issues preventing SRM from being installed on several of the servers (e.g. concern with installing SRM on servers that have not been backed up yet), and these issues need to be resolved so that historical data to be used for capacity planning can be obtained.
7 Conclusions
The following conclusions can be made from the creation of a formal capacity planning process:
- ITIL best practices for capacity management can be successfully used to develop a formal capacity planning process for the customer.
- The ITIL Capacity Management Self Assessment can be used to assess the status of your current capacity management functions, and to understand what other items could be done to improve your capacity management.
- The ITIL Capacity Management roles and responsibility list can be used to create a statement of work for capacity management activity.
- Additional responsibilities can be added to provide improve capacity management for the servers listed in the scope. With the exception of identifying future business requirements, the new responsibilities could be performed by any qualified personnel reporting to the customer, IBM, or other service providers.
8 References
[1] "Six Levels of Sophistication for Capacity Management". G. Thompson, CMG 2000, December 2000. (95 KB PDF)
[2] "Best Practice for Service Delivery, Office of Government Commerce, United Kingdom. The ISBN number for this book is 0-11-330017-4, and was published in 2001.
9 Trademarks
IBM, is a registered trademarks of IBM Corporation in the United States, other countries, or both.
ITIL is a registered trademark of the Office of Government Commerce, an office of HM Treasury, United Kingdom.
Pink Verify is a registered trademark of Pink Elephant, Incorporated., in the United States, other countries, or both.
Other company, product or service names may be the trademarks or service marks of others.
Appendix 1: Task Plan
Capacity Management has overall responsibility for ensuring that there is adequate IT Capacity to meet required levels of service and for ensuring that senior IT management is correctly advised on how to match Capacity and demand, and to ensure that use of existing Capacity is optimized (capacity optimization of existing resources is commonly referred to as part of performance management).
(1) - Forecasting is not in the current reports, but will be included in the proposed service.
(2) - Network resources are out of scope