Open Source Capacity & Performance Management Tools for Windows & Unix Systems
by Brian Johnson
|About the Author|
Windows and UNIX both come with native utilities that can be used to collect performance data and report it in a simple text format. Many firms, including mine, have developed scripts, processes and tools to collect this data and store it for the purpose of capacity planning and after-the-fact performance analysis.
A Native Tools Strategy
A large number of firms have independently adopted a strategy of collecting data using the native tools provided by Windows and UNIX for those systems that are only of casual interest.
Once the data is collected it needs to be stored and then converted into a form thats useful for the purposes of reporting and analysis. For most firms this has involved creating and maintaining a collection of utilities that parse the raw data files and write the data to files that can be manipulated by standard tools such as SAS or Excel.
Many firms have midrange systems numbering in the hundreds or thousands and frequently a large number of those systems are running as "appliances" requiring little attention in terms of ongoing detailed capacity planning or performance analysis. While it makes sense to invest in sophisticated products for performance data collection, retrieval, archiving, reporting and modeling for the relatively small number of mission critical systems with dynamic load growth characteristics, the rest of the servers need at least a minimal level of instrumentation.
Unfortunately, most third-party vendor capacity planning and performance monitoring products tend over time toward the "feature rich" end of the spectrum in order to compete. And license and maintenance fees usually track the growth in features. Absent ala carte pricing, the cost of these products, both in terms of dollars and the work effort required to learn, maintain and use them, is considerable.
There are a couple of vendors at the feature-poor end of the spectrum who provide products to handle just the basic functions of collection, retrieval and reporting. At least one of them is agent-less. Although the per-system licensing fees of the low level products tends to be low, anything multiplied by hundreds or thousands produces a number that attracts the attention of management come budget or cost-cutting time, even if the fee expressed as a percentage of the system cost is low.
Compounding the problem is the fact that many other products that use some of the kernel metrics to perform their core function, such as enterprise management software, eventually suffer from scope creep when their marketing folks say "Hey, we already have access to kernel and process metrics so why dont we add capacity and performance to our list of product claims." That leads to queries from Engineering and Operations about why we cant use the data from their product instead of installing our own. An example of why this usually doesnt work came a few years back when I received five minutes of data from a server to see if its usable. What I got was a text file with every one of a hundred different metric data points individually time stamped with time values that drifted over the span of about ten seconds making it impossible to correlate the values to each other.
The final motivator is that the vendors of third-party products are on an endless quest to add features and functionality. The new features invariably have a few bugs, as all new software does, so maintenance releases are sent to fix the bugs. Just about the time that all the bugs are fixed, along comes a feature and functionality release starting the cycle all over again. If your firm has a relatively rigorous Quality Assurance and Change Management cycle, then a large effort is required to deploy the release to the hundreds or thousands of systems. Worse yet, in some situations it requires a sign-off from the multitudes of application groups or lines of business that "own" the systems before any changes can be made. All of this represents a tremendous amount of labor expended that could be better spent doing capacity planning and performance analysis instead of tool maintenance.
My firm began to implement the described scheme in 1997 and continues to support and enhance it to this day. As of today there are systems that were instrumented back in 1997 that have not been revisited since and are still retrieving data.
For Windows the data collection is based on the Performance Data Logger that was supplied as part of the Windows NT 4 Resource Kit and the Windows 2000 Performance Logs and Alerts service.
For UNIX the data collection is based on a modified version of the "standard" sys cron job that by default collects system activity data. The modifications include:
- Changing the collection interval from twenty minutes to five minutes and the collection span from 08:00 to 18:00 Monday through Friday to 7 X 24 X five minutes.
- Adding a script to log iostat, netstat and vmstat data at five minute intervals.
- Adding a script to log per-process statistics at five minute intervals using ps.
- Adding a script to log disk file system space utilization at one hour intervals using df.
- Adding any platform-dependent scripts (e.g., prtdiag on Solaris systems to log the hardware configuration daily).
All of the scripts delete any of their data files older than seven days.
How the data is retrieved depends on security considerations. For systems on the trusted network and the DMZ the data is retrieved regionally by Managing Workstations (MWSs) that use FTP to retrieve any files not already retrieved. For firewalls an FTP script is used to push the data to an MWS. Except for the method by which the data is copied to the MWS the scripts for systems on the trusted network and DMZ are identical to the scripts used on the firewalls. And yes, Information Security reviewed the scripts and found that they presented no risk by virtue of the fact that all executables were supplied by the platform vendor.
Retrieval of the data files from the systems on the trusted networks and DMZ is scheduled to occur just after midnight local time at the target systems. For example, data from the UK servers is retrieved by the California MWS starting at 16:10 Pacific Time. Ten minutes after midnight is used to avoid any other jobs on the system that are scheduled to run at exactly midnight (a popular time to run daily scripts such as user accounting logs).
Once the data has been retrieved it is compressed using effectively lossless compression schemes, checked for validity and stored in a hierarchical archive directory structure by organized by application hierarchy. Finally a program scans the files using a simple threshold-based alerting strategy and emails any alerts to the staff for processing when they arrive in the morning. The most common alerts are excessive average CPU, runaway processes, memory leaks or contention, excessive system calls and excessive context switching.
Currently we retrieve and process the data files at rates exceeding 400 servers per hour. Your mileage may vary depending on the MWS and on the speed and length of the circuit between your MWS and the system whose data is being retrieved. We use Windows 2000 desktop systems fitted with SCSI adapters and external SCSI disk enclosures for our MWSs. Using Windows drive compression allows us to store approximately three months worth of data for about 1100 UNIX systems on a single 38 GB drive.
There is no automated report or graph generation other than an alert email. Thats intentional; our philosophy is that the value that we bring to the table lies in our ability to analyze and interpret the data, not in our ability to generate pretty pictures. It also matters that generating thousands of graphs each night would require a significant amount of processing so that a handful of graphs could be reviewed the next morning.
We freely publish the data to our Engineering, Operations and Application teammates as read-only shares along with a copy of the folder containing the conversion and analysis tools. All we ask of them is that they consult us before they make any decisions based on their own analyses.
In order to prepare the data for graphing, reporting and analysis we have a collection of utilities programs that we can use to convert the raw data files to .csv files that can in turn be opened in Excel and graphed or fed to a graphing utility such as the freeware Multi-Resource Graphing Tool (MRTG).
As an example, we have a program that can scan any number of sar data files, either a single date for multiple servers or a single server for a number of dates, limited only by the number of days worth of sar files available, and produce .csv files that can be graphed or otherwise analyzed.
Here are a few of the lessons that weve learned along the way.
- Keep it simple. No heuristics, no pattern recognition based alerts, no modeling (if you need that then buy a product to do it but leave these tools installed).
- If at all possible avoid creating executables to run on the systems; it complicates the release cycle and invariably leads to a need for periodic updates.
- Retain the raw data files in perpetuity in case a question arises as to whether the consolidation or normalizing utilities corrupted or misinterpreted the original data. It also allows for redoing a detailed analysis weeks, months or years later (thats happened to us).
- Use compressions schemes on the data files that do not result in data loss (e.g., whitespace compression, removing blank lines, removing null/insignificant data points, and removing redundant title lines). The space savings are huge as the raw data files contain lots of white space and redundant title lines.
- Make the back-end processing utilities support a common set of options (e.g., "/Begin=10:30 /End=11:45")
- Make the back-end processing utilities portable.
OSM Organizational Issues
One of the lessons learned by other OSM efforts is that sometimes a benevolent dictatorship is preferable to a pure democracy.
In some cases, such as Linux, there is a sole dictator. By the terms of his contract with the Open Source Development Lab where he is a fellow, Linus Torvalds retains veto authority with respect to the Linux kernel architecture and the copyright remains with him personally (eWeek magazine, June 23, 2003, issue).
In other cases, such as the Internet Engineering Task Force, the dictatorship consists of a large number of co-dictators.
The best results come from somewhere in between those two examples, probably closer to three co-dictators.
For the majority of servers in many environments its only necessary to collect the system-wide metrics and use a simple threshold-based automated analysis scheme to identify systems needed further scrutiny or more advanced tools.
Many firms have already adopted a "Native Tools" data collection strategy and developed a back-end processing tool set resulting in a lot of duplicated effort.
The ideal solution is an "install it and forget it" scheme that can be made part of the standard build process and that eliminates the need to revisit systems once they are instrumented. Such a facility exists for both Windows and UNIX systems.
All that remains is to implement some simple back-end processing tools to manage the retrieval, archiving, analysis and reporting functions. Because the data is common to all the systems there is no reason why a common set of back-end tools to accomplish this cant be produced in such a manner that they can be made freely available. The Open Source model of software creation and distribution provides a mechanism for doing this.
All that may be missing is the commitment to make it happen.