November, 2007
by Osman Aykut
As if the busy mainframe administrator didn't have enough to do trying to keep the z/OS environment up and running efficiently, today's distributed environments present an entirely new series of challenges. Applications that live "in the cloud" and outside the jurisdiction of the z/OS administrators are quickly becoming the number one consumers of CPU resources within the mainframe, as end-user requests reach deep into the heart of the database, either directly thru DB2-Connect or after firing off a legacy transaction in a CICS or IMS/DC environment.
In a distributed environment, there are a host of tools like those from CA Wily, BMC and Compuware that monitor, measure and troubleshoot at the application level (including those that monitor database performance, Service Level Management and Network performance), while in z/OS you have tools such as TriTune and Strobe, which offer visibility into the application's efficiency in the z/OS environment. However, two things are truly needed to properly measure, monitor, and tune application efficiency: first, a global, end-to-end view of the entire process and second, the ability to transform bulk monitoring data into a business function view.
Imagine you're a highly trained mechanic at a BMW dealership. You've tuned the motor on a new M5 to the point where it's running perfectly and you send it out the door to be delivered to a smiling new buyer. Within five days, an angry sales manager calls you demanding to know why the customer is reporting that the car is idling at 5000 rpm, getting only 4 miles to the gallon and stalling at every red light.
The motor was perfectly tuned to the exact specifications required when it left your shop, so if the performance is that bad, something else must be amiss. Much like in a real-life IT environment, your first step when trouble shooting the problem is to look for the "usual suspects." In this case, you wonder: Who's driving the car? What kind of tires are they using? Under what road conditions? Are they on the highway, or driving in the city? What kind of fuel are they using?
Of course, like in real life, the sales manager has no idea. All he knows is that he's got an unhappy customer - and it's up to you to fix it and make everyone happy again.
To a certain extent, this is the problem faced daily by those tasked with tuning and managing the mainframe when applications that originate in the cloud of a distributed environment begin to invoke mainframe resources. Without being able to talk directly to the end users, or the processes that occur between them and the mainframe, efficiently diagnosing performance issues is very difficult.
In many large companies, critical legacy applications and databases live on the mainframe and are either technically or logistically impractical to replace. To make the most of this configuration, applications are created at the edge of the distributed environment - "out in the cloud" - and reach into the mainframe and utilize CICS, DB2 and other resources (See Figure 1). Too often these applications are designed without concern for the resources required, and utilize far more CPU than needed on the mainframe without anyone being aware. As more transactions are kicked off and more requests are made that use the mainframe, the savvy tuning expert recognizes that utilization, performance, and efficiency in the mainframe are beginning to suffer. He or she will likely begin to investigate the source. Unfortunately, in most networks, anything beyond the mainframe is out there "in the cloud" and remains a virtual mystery: "The Black Box."
Figure 1: The mainframe tuning expert identifies an extensive use of resources with in the mainframe but cannot see beyond the cloud to better understand the source. Meanwhile IT managers use monitoring and tuning software to keep their particular segments of the network running smoothly, unaware of the resources their applications are consuming in the mainframe.
So why isn't anyone else noticing? There are plenty of talented IT professionals using a variety of excellent measurement, monitoring, and tuning tools that will detect even the slightest variation in performance at virtually every step in a distributed environment.
The primary function of these monitoring and management applications is to provide a snapshot of how a particular application or database is performing in one leg of its journey. To a certain extent, monitoring and measurement tools in the distributed world also have a border that they can't cross. As soon as the application or a request that comes from a distributed application hits z/OS, these tools are blind.
However, by not providing a full end-to-end view, these applications usually wind up serving a secondary function: keeping the IT staff free of liability in their particular bailiwick. As long as response times remain within acceptable ranges at each tier, there is no problem -- at least as far as the person who is responsible for that segment of the network is concerned. The unfortunate element of this situation is its short-sightedness. It's just in the best interest of that particular IT manager to make sure that their segment of the network is functioning properly, even if it means others may not be.
Assuming that most mainframes have the processing power to compensate for and accommodate even a poorly written piece of script, there will be no performance issues apparent to the end user. The resources shifted to accommodate this particular request, however, may be dramatically more expensive than should be required, and eventually something ends up suffering. Usually it's the mainframe administrator who is left wondering why his utilization rates are so terrible and why he constantly has to add more CPU's when there should be ample processing power. Oftentimes he winds up taking drastic actions to isolate resources or to forbid usage of dynamic SQL in J2EE applications.
Because traditional performance analysis tools focus on only one application instance or execution at a time, they rarely provide visibility beyond a small segement of the network. A true end-to-end solution would allow IT managers and technical analysts to aggregate and analyze performance data for mainframe transactions invoked by SOAs, .NET and J2EE-based web applications executing across all Sysplexes and LPARs.
Ideally, this aggregate data would be compiled into a one central repository that would be easily accessible to all of IT in web-based format The central repository should feature easy-to-use reporting capabilities that can generate both regular and customized reports on demand, and a dashboard that can be used to inform decisions at both the business management and IT management levels.
Providing an end-to-end view of the performance of the entire lifecycle of an application Is the ultimate goal. A tool that would offer visibility from the time a request is made somewhere in the IP cloud, through the z/OS environment and back to the original end user is one that any IT Manager would gladly welcome. It would enable them to see through the cloud and quickly and easily identify potential problems before they manifest themselves.
The benefits of an end-to-end solution percolate throughout an organization on many levels. From an economic standpoint, it allows IT organizations to continue to leverage their existing mainframe investments, saving valuable CPU resources, and eliminating (or at least postponing) costly upgrades. From an IT point of view, it helps reduce the load on already overworked IT staff as they become freed up from the task of troubleshooting and can now focus on tuning and performance management. From a network and application performance standpoint, applications continue to run at peak performance, providing the end user - regardless of where they are - a better experience. And finally, from a management perspective, a complete end-to-end view offers dramatically better aggregate reporting capabilities allowing for more precise planning and budgeting.