June, 2009
by Margaret Greenberg
Both at CMG's annual conference and regional meetings, attendees place a high value on sessions that provide "how to" information. Charles Foy's paper, Say Goodbye to Postmortems, Say Hello to Effective Problem Management is an award winning example of providing valuable first-hand experience for those who wish to improve Service Level statistics and seriously cut downtime. If you are near a CMG region where Charlie will be speaking in 2009, attend the meeting to hear his experiences and ask him questions.
In his words:
What would we have done differently? If you mean - what could we have done to improve the development of our process, I would respond with: "I would have liked to attend a CMG conference where someone presented how they created their process!" Seriously, the sharing of information, lessons learned, and best practices is invaluable no matter what you are developing: a problem management process, a capacity planning process, a statistical analysis process, etc.
We had several problem management processes in place, and we needed to standardize on one and make it even better to meet the needs of our customers. The project took six months to define and was implemented over the next year. Since then, we have periodically reviewed the process and updated the standardized templates for communicating the issue root cause, the permanent resolution and the preventive measures required to preclude recurrence.
We realized that a system that classified outage root causes in a consistent manner could be searched for trends that might not otherwise be obvious. Once you have a series of similar root causes and their outage time impacts, you have a tremendous amount of business intelligence. Remediation of issues can be quantified and costs of implementation defined. These can then be compared with the impacts of the outages over time, and informed business decisions made to address the issues.
One of our neighboring companies had a disaster and we took advantage of that situation to improve. The question of how much time we saved is coupled with effectiveness. As we were developing our process, we focused on the 'easy' outages with one (apparent) root cause. We learned of this more complex outage and decided to use it to test our developing process. It illuminated for us the truth about Problems - there is usually more than one root cause and more than one action item to mitigate it. If we had developed our process on the assumption of one root cause and one action item to address it, we would not have had a robust and effective tool.
I think the effect on Siemens management was more to illuminate how many things can, and do, go wrong to cause an issue: a parallel to Murphy's Law - whatever can go wrong will go wrong and at the worst possible time.
We have effectively reduced unplanned outages, which is the positive benefit above and beyond all others. Our applications' availability has increased. Our planned maintenance times are more precise. And our customers are more satisfied. As this is not a new process but rather part of our workflow, our gains since CMG08 are in line with gains over the past several years.
Our standardized problem management system has resulted in reduced downtime due to the following factors:
We knew that other managers and staff would have valuable insight into the future process. In addition, we knew that their participation would add to their sense of ownership. In fact, we did get a lot of great ideas that made the end product much better, for example the addition of an external communication template. Similar sections in the new external template and the existing internal template are written simultaneously, saving time and ensuring consistency of internal and external communications.
We had a lot of discussion on how to categorize root causes, both primary and contributing root causes. The challenge was to classify them at a high enough level to reveal trends but at a low enough level to provide actionable intelligence. Our defect-tracking database allowed for multiple levels of identifiers, called keywords, which aided categorization. In addition, this database allowed multiple defects to be associated with one another in a relational construct, yet each defect record had its own corrective action that could be tracked independently, from design through implementation.
There are two keys in getting to the root cause: the first is understanding that there is seldom a single root cause, especially in major incidents; the second is having a process that supports exploration of the depth of root causes, using a method such as the "five whys."
Most companies that adopt ITIL (IT Infrastructure Library) best practices realize that most of the ITIL disciplines are already in place in some form or another, and it is just a matter of identifying the existing processes and matching them to ITIL. Siemens was no different; we merely identified our existing best practices and merged into one process and, then, assisted our co-workers with migration.
Our initial process had too many steps. After gathering feedback, we re-designed the process to reduce handoffs. We now issue daily reports to detail which problems are resolved and which still have outstanding requirements. New requirements arise periodically, and typically it is for a classification of a problem that we may not already have in place. Some values we added over the years are items such as was the outage detected proactively and was the outage a result of a recent change. Additionally, we have added value to capture specific application components that change from release to release.
I have been most fortunate to have presented this paper in New York, Connecticut, Pennsylvania, Germany, England and Italy. What I find is that no matter where I go, experiences are very similar. We all have problems, problemas, herausforderungs, etc., that we need to address in a systematic and comprehensive fashion. We all are dealing with a multitude of platforms, complex applications spread over several tiers, and emerging technologies that challenge us daily. The information I learn from speaking with other CMG members world-wide is phenomenal, and I look forward to the rest of this year (only ½ over!) presenting in other locations in the US and around the world.
I have learned that we are all in a similar boat, no matter what type of application we are hosting or platform we are tuning. At the Central Europe CMG in Germany, I saw a presentation on sub-capacity pricing. Even though it was presented in German (and my German language skills are a bit rusty), the slides said it all - the presenters could have come from any mainframe in any shop. The questions presented by the audience and the experiences related by the CMG member co-presenting with IBM were all ones I had experienced here stateside. Every CMG I attend I learn a lot, and the regional meetings are no different. In fact, just today I sent to a note to CMG Italy to get a presentation on data deduplication that was presented in May - it is very timely as I am now on a project to strategize data storage.
I have had conversations with folks who, after hearing the presentation, are now on the same page as we are in understanding that a service outage - a Problem - is in reality the result of a defect or several defects. And tracking defects and how they are remediated is paramount to success. Several people who were tracking and managing problems using one root cause said they will now try to modify their approach to use multiple root causes or contributing root causes. And one person contacted me asking about what defect tracking application we used, it turns out his company had the same application in another division... same situation we encountered these many years ago!
You cannot address that which you cannot measure. Service outages are defects, whether in process, software or hardware. Once you identify and measure them, you can improve.
Some of his responses were edited from an interview by Pam Derringer, News Contributor 09 Mar 2009 for SearchDataCenter.com in an article titled: Siemens Healthcare cuts IT problem management downtime. .
Charlie will be on the MCMG agenda November 18 in Chicago. You are invited. Meanwhile check the CMG calendar for other local meetings.