CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

ITIL Availability Management: Beyond the Framework
Part 1
July, 2006
by James Yaple

About the Author
James Yaple, Austin Automation Center

James Yaple works as an IT Specialist for the Austin Automation Center, an enterprise hosting facility for the U. S. Department of Veteran's Affairs. His focus is on Unix systems architecture, availability and performance. James is the AAC's technical representative to Standard Performance Evaluation Corporation (SPEC) and the Storage Performance Council (SPC).

[Hide]

Introduction

Availability Management under the IT Infrastructure Library (ITIL) framework is a strategic process associated with the delivery of new services and management of service levels.

The goal of availability management is to optimize the functioning and system efficiency of an IT infrastructure, services and additionally, the service provider. The objective is to deliver cost-effective and sustained system functionality (or “availability”) to support achievement of business (customer) objectives. [Art of Service]

Integration among ITIL process areas can often generate confusion regarding process area boundaries. This confusion is a natural by-product of ITIL interdependencies. If process areas are strictly defined, then integration between areas decreases, but the organization moves towards technology silos.

Many organizations considering ITIL struggle with how to get started. The ITIL framework is a collection of leading practices, rather than an implementation roadmap.

This article is intended to assist in moving beyond the conceptual ITIL framework and applying it to implementation. This will be accomplished by starting at the ITIL foundation level with a brief description of the ITIL framework. The next level is the ITIL practitioner level, where the meaning and scope of availability will be examined, especially as it relates to IT system performance. Progressing to the service management level, specific activities in the availability management and other ITIL processes will be described and analyzed with their relationship to other activities.

A case study is used to illustrate concepts of ITIL availability management

Overview of ITIL

ITIL is a collection of leading management practices for IT service providers developed originally by the United Kingdom in response to logistical difficulty encountered during the Falklands War. By the late 1980s, it had developed into a set of about sixty books by the Central Communications and Telecom Agency (CCTA) and managed by the EXIN Foundation. ITIL conceptually falls under the larger umbrella of IT Service Management (ITSM)—which utilizes three pillars to set the stage for effective technology management: people, processes and products. People rely on communication, training and definition of responsibilities. Processes are central to ITIL and include the service delivery and service support standards. Products are tools that complement ITSM procedures.

The IT Service Management standards of ITIL are broken down into ten processes and one function under service delivery and service support. Service delivery is the management of the IT services themselves, and involves a number of management practices to ensure that IT services are provided as agreed between the Service Provider and the Customer. It is comprised of service level management, capacity management, availability management, IT service continuity management and financial management. Service support is focused on supporting and improving the quality of existing services. This is comprised of incident management, problem management, change management, release management, configuration management and service desk.

Figure 1 - The ITIL Model

Here is a description of each ITIL process [OGC 2006]:

  • Configuration management is responsible for maintaining information about Configuration Items (CIs) required to deliver an IT Service, including their Relationships, managed throughout the Lifecycle of the CI. The primary objective of Configuration Management is to underpin the delivery of IT Services by providing accurate data to all IT Service Management Processes when and where it is needed.

  • Incident management is responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Customers as quickly as possible.

  • Problem management is responsible for managing the Lifecycle of all Problems. The primary objectives of Problem Management are to prevent Incidents from happening, and to minimize the Impact of Incidents that cannot be prevented. Problem Management includes Problem Control, Error Control and Proactive Problem Management..

  • Change management is responsible for controlling the Lifecycle of all Changes. The primary objective of Change Management is to enable beneficial Changes to be made, with minimum disruption to IT Services.

  • Release management is responsible for Planning, scheduling and controlling the movement of Releases to Test and Live Environments. The primary objective of Release Management is to ensure that the integrity of the Live Environment is protected and that the correct Components are released. Release Management works closely with Configuration Management and Change Management.

  • Capacity Management ensures IT processing and storage capacity match the evolving demands of the business in a cost effective and timely manner. It consists of three sub-processes, Business Capacity Management, Resource Capacity Management and Service Capacity Management. Business Capacity Management is a forward looking sub-process and translates business requirements into infrastructure demands. Resource Capacity Management monitors and manages existing infrastructure Configuration Items (CI's)with finite capacity against an established set of utilization thresholds via threshold based management reporting schemes and other techniques. Service Capacity Management is a client facing sub-process that performs a similar monitoring process for SLAs.

  • Availability management is responsible for defining, analyzing, Planning, measuring and improving all aspects of the Availability of IT services. Availability Management is responsible for ensuring that all IT Infrastructure, Processes, Tools, Roles etc are appropriate for the agreed Service Level Targets for Availability.

  • IT service continuity management is concerned with managing an organization's ability to continue to provide a pre-determined and agreed level of IT Services to support the minimum business requirements following an interruption to the business.

  • Financial management is responsible for managing an IT Service Provider's Budgeting, Accounting and Charging requirements.

  • Service level management is responsible for negotiating Service Level Agreements, and ensuring that these are met. SLM is responsible for ensuring that all IT Service Management Processes, Operational Level Agreements, and Underpinning Contracts, are appropriate for the agreed Service Level Targets. SLM monitors and reports on Service Levels, and holds regular Customer reviews.

  • The Service desk function (not this is not considered a process) is a Single Point of Contact between the Service Provider and the Users. A typical Service Desk manages Incidents and Service Requests, and also handles communication with the Users.

ITIL is concerned with delivering and supporting services appropriate to business requirements of the organization. ITIL provides best practices for ITSM processes, promoting quality, business effectiveness and efficiency in the use of information systems. ITIL processes encourage an atmosphere where IT service providers understand the business objectives on three levels: strategic (decision making), tactical (implementation), and operational (support and maintenance). Business people want to focus on business impacts and rely on an IT service provider for technical requirements.

Consider an insurance company that wants to market and service policies online through a web application. The business can realize a savings by enabling customers to choose self-service for their accounts rather than contacting a live operator. The business wants to provide an interface that can provide self-service functions continuously to customer via the Internet. The application needs to provide service to as many users as visit the firm’s web site and have the capability of adding new business functionality over time. The firm realizes that technical incidents may occur, but resolution must be rapid and future incidents prevented. In case of a crisis, such as a hurricane or fire, customer data must be protected and accessible.

Traditional IT management might approach this example from the bottom-up. It looks at the tech­nology first and measures performance, availability and reliability available within the technology organization. The view from the bottom, often looking up through organizational silos, does not take into account how changes or problems with technology components can affect outward-facing business ser­vices. To improve service delivery, IT resorts to adding resources, often making expenditures blindly.

ITIL is designed to enable the IT service provider to better respond to business needs. The first step is connecting IT priorities to business goals. IT service management starts with a top-down approach that puts business requirements first and then generates the architecture to support business needs. IT service providers need to evolve beyond order takers....”Our business needs a web site”...to understanding busi­ness requirements and risks, enabling joint decisions and strategies to meet the business objective.

Plan-Do-Check-Act

ITIL also makes use of the “Deming wheel” or “Shewhart cycle”, commonly referred to as the plan-do-check-act (PDCA) cycle. PDCA is four steps to progress from `problem-faced' to `problem solved' that involve measuring, implementing and re-measuring Key Performance Indicators (KPIs). A KPI is a metric that is used to help manage a Process, IT Service or Activity. Many Metrics may be measured, but only the most important of these are defined as KPIs and used to actively manage and report on the Process, IT Service or Activity.The cycle is continuous, in that once the cycle is completed, it begins again. [HCi]

The four steps in the PDCA cycle can be described as follows:

  • Plan how to improve a service first by identifying and measuring Key Performance Indicators (KPIs) and come up with ideas for improving them.

  • Do (implement) changes designed to improve the KPIs. Try to minimize disruption while testing whether the changes will work or not.

  • Check KPI performance to evaluate if the changes are achieving the desired result.

  • Act (implement) changes on a larger scale if the experiment is successful.

Figure 2 - The PDCA Cycle

ITIL Availability Management

Availability Management ensures IT services are available when the customer needs them. A system is available when the customer receives the service stated in the SLA. The measurement and reporting of availability is based on a consensus between and customer requirements and the capabilities of the IT support organization. [HP 2003]

The degree of availability of a component or service is often expressed as a number of nines. The number of nines references the percentage of time for which the system is available. Three nines means that a configuration item or service is available 99.9% of the time and five nines means 99.999%. These numbers become more significant when you look at these figures in terms of downtime over a fixed period. For example, based on 24/7/365 service hours, a system with 99.9% availability can be expected to experience only 8.75 hours of downtime in a year, while a system with 99.999% availability will be down for just 315 seconds in a year.

In general, practice and availability calculations dif­feren­tiate between planned and unplanned downtime. Downtime for maintenance, nego­tiated with the customer, is typically integrated into an SLA [HP 2003].

In negotiations between the service provider and the customer, availability can also be discussed as an amount of downtime within a measurement interval. For example, a customer may indicate they cannot tolerate more than five minutes downtime per year. It is important for parties to understand and agree how availability is calculated given the different ways availability can be expressed.

The formulas for availability used in this article are shown below:

Availability %

=

Actual Availability/Agreed Availability * 100%


Actual Availability

=

Size of measurement interval – Downtime


Downtime

=

Time to repair



or

Service restoration time – detection time

Beyond the calculation of availability, it is critical to have a common definition of availability. It is more important for consensus on the definition than a specific definition. Here is the ITIL definition of availability:

Ability of a Configuration Item or IT Service to perform its agreed Function when required. Availability is determined by Reliability, Maintainability, Serviceability, Performance, and Security. Availability is usually calculated as a percentage. This calculation is often based on Agreed Service Time and Downtime. It is Best Practice to calculate Availability using measurements of the Business output of the IT Service. [OGC 2006]

As an alternative, here is a similar definition:

Availability is the degree to which an application or service is available when, and with the functionality [that] users expect. Availability is measured by the perception of an application's end user. End users experi­ence frustration when their data is unavailable, and they do not understand or care to differentiate between the complex components of an overall solution. Performance failures due to higher than expected usage create similar issues as are created by the failure of critical components[Oracle 2004]

An IT service provider may need to agree to provide availability for components or services outside of its control. This is accomplished using contractual vehicles. A service provider can exercise control over external suppliers, through an underpinning contract (UC) and over internal suppliers through an operational level agreement (OLA).

Relationship of ITIL Activities and Processes

Each ITIL process is related to all of the others through one or more activities. For example, availability requirements are defined by the customer SLA, but SLA performance is defined by the measurement of availability. This duality is at the center of ITIL, and causes some of the organizational confusion about how to begin an ITIL implementation. In other words, which comes first, service level or availability management? The relationship of availability to other ITIL processes will be explored using a simplified model, beginning with identification of activities associated with availability management.

ITIL activities that involve Availability Management include:

  • Determining availability requirements in business terms.

  • Identifying vital business functions & impact analysis

  • Predicting and designing for expected levels of availability and security.

  • Producing & maintaining an availability plan,

  • Collecting, analyzing, maintaining and reporting availability data,

  • Monitoring ITIL Key Process Indicators (KPIs) including Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), Mean Time between System Incidents (MTBSI).

  • Monitoring SLA and OLA targets and external supplier serviceability achievements.

  • Conducting root cause analysis of low availability

Below is a simplified diagram of the relationship between availability management and the other ITIL processes.

Figure 3 - ITIL Relationships

Availability management, as drawn here, has two primary inputs: Service level and Incident Management. These inputs correspond to two key aspects of availability management: the pre-implementation (service delivery) activities, where availability requirements (KPIs) are determined and designed into the service; and operational (service support) activities, where KPIs are monitored and improved. This should look familiar, as it draws from the ITIL model.

This diagram also reflects the PDCA cycle. Service management provides availability requirements (plan), and receives feedback on availability indicators (check). Incident management also shows the PDCA cycle in action. As availability incidents occur, the root cause is analyzed in the problem management process. If the problem management PDCA cycle (not shown) identifies the root cause and resolution, it can be communicated to the service desk for use in future availability incidents. The result is that the downtime KPI is minimized.

Availability and Incident Management

Another primary input into availability is Incident Management. Incident management activities provide many availability KPIs, including downtime, mean time to repair (MTTR), mean time between failure (MTBF) and mean time between system incidents (MTBSI). These indicators can be described by the incident life-cycle, as shown in the diagram below:

Figure 4 - The Incident life-cycle

The metrics inherent in the diagram above are:

  • How long it takes to detect an incident

  • Total downtime per service (application/ customer)

  • How long it takes to recover from in incident

  • The availability of the services

  • How frequently incidents occur

  • The improvement in availability of the IT service

These metrics can be related to availability Key Performance Indicators, such as:

  • percentage reduction in the unavailability of services and components

  • improvement in the mean time between failures (MTBF)

  • reduction in the mean time to repair (MTTR)

  • percentage reduction in failures of third-party performance on MTBF/MTTR against [underpinning] contract targets [OGC 2002]

Performance: Availability or Capacity Management?

System performance in terms of end user response time is commonly included in service capacity management, a sub-process of ITIL capacity management. However, it is also an availability performance indicator. Based on the ITIL definition, availability is determined by reliability, maintainability, serviceability, performance and security. So far, this article has used two similar definitions for availability. First, a system is available when the customer receives the service specified in the SLA. Second, the service is available when it provides the access and functionality users expect. Both of these definitions require satisfactory performance.

ITIL Summary

This installment reviewed multiple ITIL elements and their relationships. Part 2 will be focused on a case study intended to explore how to get started. The ITIL framework is a collection of best practices, not an implementation roadmap. The objective of the case study is to move beyond the ITIL framework and into implementation.


References

[ArtOfService] ITIL Fact Sheets and Mind Maps, The Art of Service Pty Ltd, (date unknown).

[OGC 2006] Glossary of Terms, Office of Government Commerce (OGC) ITIL Knowledge Centre, May 2006.
http://www.get-best-practice.biz/glossary.aspx?product=glossariesacronyms

[HP 2003] ITIL Essentials for IT Service Management, Student Workbook, HP education services (Quint Wellington Redwood), 2003

[Oracle 2004] Oracle Database High Availability Architecture and Best Practices 10g Release 1 (10.1), Oracle Corporation, June 2004.
http://oraclesvca2.oracle.com/docs/cd/B14117_01/server.101/b10726.pdf

[OGC 2002] Best Practices for Planning to Implement Service Management, Office of Government Commerce (OGC), The Stationery Office, 2002.

 

Last Updated 06/05/09


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group