|
SQL Server - A Capacity Management Data Repository?
May, 2004
by Adrian Heald
1 INTRODUCTION
An outsourced Department I have had some association with has recently completed negotiation of a contract extension with its service provider. As part of this new contract the capacity management function has moved back in house. Well, part of the capacity management function anyway. The major function of capacity planning is now in house while the functions of system tuning and service delivery remain outsourced. As application development is an in-house function, the task of application tuning also remains in-house. The Department has always managed invoice verification functions and has had some limited IT infrastructure reporting capabilities.
Up until this new contract the majority of capacity management functions were either the purview of the service provider or were not done, as the Department did not have visibility of the necessary data. The new contract stipulates that the Department will now have visibility of any necessary data required for these functions.
As part of the transition to the new contract, a project was put in place to build the necessary systems required for the capacity management function. These systems were to be responsible for collecting the data supplied by the service provider, storing the data and generating any necessary reports. We saw this as a great opportunity to provide information to the Department that had been sadly lacking in past years and to better plan for the future directions of the Department. As the Department had existing mainframe SAS skills and the major ticket item requiring planning was the mainframe, it was decided that the systems would best be constructed using the SAS system. At the time of planning the project the Department had mainframe licenses for SAS BASE and SAS GRAPH. We had envisaged that the primary data store would be mainframe SAS based with SAS GRAPH being utilised to generate nightly graphs. These would be published to the Department’s Intranet.
Who was it that said, "The best laid plans of mice and men"? On commencing the project we discovered that the SAS GRAPH license had lapsed and was not to be renewed. The cost of this product was the primary reason for this action by the Department. This was a real set back as we definitely needed graphical output and there were no other mainframe tools that we could readily utilise.
2 THE ENVIRONMENT AND THE VARIOUS SOURCES OF DATA
It is useful to examine the Department’s environment and the various sources of data that were to be included in the reporting database. The Department utilises an IBM mainframe running OS390 and LINUX under VM. The LINUX environment is being put in-place to consolidate the Department’s Windows file and print server farm. In addition the Department is moving to a thin client environment with applications being served from a CITRIX METAFRAME farm of 50 or so Windows servers. As per the contract we had limited access to the actual data collection tools but were to get total visibility of any data we required. This data was presented to us in various forms. The diagram depicts the major sources of data and generally how it is to be presented.
2.1 Some requirements of the data
Our original concept of mainframe SAS based databases was still looking good. This solution satisfied some basic data requirements of:
- Backup;
- Retention;
- Security;
- Centrally managed; and
- Central data scrubbing;
It also made sense as initially at least, the majority of data would come from the mainframe via SMF records and using SAS and MXG to process this data is almost mandatory.
The reporting database was yet to be determined!
3 CHOOSING SQL SERVER
It is something of a sad note to say it, but the choice of SQL Server was made almost solely on the cost. The capacity management project had/has a very small budget (as is often the case) and we hadn’t factored in the possibility of needing any tools other than those that were available when we put the project together. The loss of SAS GRAPH meant spending money for an alternative that we just didn’t have. The Department has an existing desktop SQL Server license and we felt that we could utilise this to provide a workstation based reporting database that would give existing reporting tools such as: the Microsoft office products, CapTell, and COGNOS visibility of the data for reporting. There were some other factors that we considered:
Generating reports in the desktop environment would reduce the mainframe CPU costs as the mainframe CPU is billed on a per minute basis but the SQL Server platform is a one off cost;
Analysis of the data via tools such as SQL Server analysis services or COGNOS would be possible;
For the Department LAN based storage is cheaper than mainframe based storage;
Software license fees would be reduced;
Changes to LAN based systems are easier than changes to mainframe based systems; and
Possibility of utilising email to circulate some reports (though email is now being installed on the Department’s mainframe).
So SQL Server it was going to be. We now needed a platform to run as our server. As the data we were planning on storing in SQL Server was somewhat undefined and we were going to build the system using something of a prototyping methodology, we felt that the further we were away from a production SQL Server environment the better. We were not going to be responsible for causing any production problems. It looked like we might be able to squeeze enough money out of our budget to buy a dedicated server. This though would take some time. (At the time of writing, we are still waiting to finalise the acquisition of server hardware).
4 HARDWARE AND SOFTWARE REQUIRMENTS
4.1 Development hardware
To facilitate development, we installed SQL Server on a desktop platform with the following configuration:
133Mhz Pentium processor;
128mb Ram; and
80Gb disk.
While this may be ok for development, clearly it would be unsuitable for any sustained production environment; or so we thought. At the time of writing, we are still waiting for an adequately sized server platform. However, we effectively load data to our database on a daily basis, generate regular reports from this data and are about to launch a Cost Awareness System, "Mainframe Cost Awareness" [Heald 2003] based on data provided from this database.
Limited resources on this platform have significantly delayed the delivery of the systems, due mainly to having to deal with situations that would not normally arise on a fully configured server. Desktop SQL Server has a governor that slows processing when more than 5 active batch workloads are running. With two users accessing this system, we often exceeded the limit and processing virtually ground to a halt. The lack of adequate disk space has meant additional work to ensure processing continues. Several months ago we made the decision to run without backups on the server as they were taking too much space. We can recover from data held on the mainframe but that will take several days.
4.2 Production hardware
What was required was something in the order of:
2 Ghz+ Dual processor;
2 Gb+ Ram; and
180 Gb Disk Storage.
We envisaged that 180Gb of disk storage would satisfy our requirements for storing data from the mainframe. (We retain 25 months of detailed data). Additional storage would be required to store data from the other platforms.
4.3 Software
Software requirements were somewhat more readily satisfied. We required:
- A SQL Server license;
- Office products for adhoc analysis and reporting;
- CapTell for cyclic paper based reporting; and
- ASP for web based reporting.
All of these products were available at no additional cost to our project. For the SQL Server license we decided to utilise one of the Department’s desktop licenses as these were already paid for. Initially we didn’t think this would be a problem. However, as the volume of data and the number of reports we are asked to produce increases, we are often reaching the previously mentioned Desktop SQL Server governor limit when 5 or more active batch tasks are executing.
4.4 Problems and benefits of the inadequate processing environment
The inadequate processing hardware and using the desktop version of SQL Server was to prove frustrating. We load millions of records into our SQL Server database on a daily basis. The limited memory and processor speed simply meant that the load process ran for hours. This, coupled with the desktop SQL Server governor, meant that sometimes simply opening the database would take several minutes; a very frustrating time for all concerned.
The major benefit of this though is that it has forced us to look at making our system as efficient as possible. Where, on a fully-fledged server platform, a table join would have performed adequately we had to find other solutions. Some of these solutions are presented later in the paper.
5 OVERVIEW OF THE PROCESSING SYSTEM
At this stage it is useful to get an overview of our processing system and to examine why we do what we do. To satisfy our requirement for backup and data security etc., we decided that ALL data, irrespective of source, would be sent through to the mainframe. This provided a single monthly database that would contain all data for a particular month. Very little processing occurs on the mainframe other than simple data cleansing and storage. Jobs on the mainframe also convert the necessary data to comma-separated variable (CSV) format for subsequent FTP to SQL Server. The following diagram depicts the general data flow. The mainframe portion of the processing is scheduled to start automatically shortly after midnight each day.
The three key processes here are:
-
SPIN which manages the accumulation of data to the SPIN database on the mainframe.
SPIN simply uses SAS APPEND processing to append new data to existing tables in the SPIN database (the SPIN database is a SAS multi volume database, currently 9000 cylinders). The tables are sorted to remove duplicates.
-
ARCHIVE which removes older data from the SPIN database into the ARCHIVE databases.
The ARCHIVE process splits up data it finds in SPIN into monthly chunks. Monthly datasets are dynamically allocated, one for each table, and the data appended to existing tables in the monthly datasets. Any data older that 2 months is processed in this way. Using this procedure we are able to add several months data from a new data source and ARCHIVE will automatically distribute it across several monthly archive databases.
-
LOAD SQL SERVER, which is responsible for getting new data into the SQL Server database.
LOAD SQL SERVER for the most part automatically loads CSV data into the SQL Server database using dynamically created partitioned tables.
5.1 Partitioned tables
Within SQL Server, data can be split up into separate tables based on the value of columns within the data. These tables are called partitioned tables. The use of partitioned tables in the capacity management data repository is mandatory to obtain the necessary level of performance. When using SQL to reference a row in a table via a WHERE clause that refers to a column without an index, SQL Server will read every row in the table. This is called a full table scan. As we generally filter our data on date and some other columns, having the data partitioned on date restricts SQL Server to only scanning the rows in the particular partition referred to by the date value.
The simplest method of partitioning is to base the partition on date, one month per partition table. The following diagram illustrates partitioned tables. Update operations referencing the VIEW result in SQL Server splitting the data across the appropriate tables.
The use of partitioned tables also simplifies culling old data, simple delete the table part to be culled and recreate the view.
5.2 LOAD SQL SERVER
The LOAD SQL SERVER procedure is a SQL Server stored procedure that can reference a comma separated text file. It creates a new load table in SQL Server based on the structure of the text file. It then loads the text file into this new load table. Once loaded, the data is examined to determine the date ranges and new monthly partitions are created. A SQL Server view is constructed which UNIONS all the monthly tables together to form a single view of the data. The data from the load table is then loaded into the appropriate monthly tables via the view.
The LOAD SQL SERVER process is automatically started at 05:00 using the SQL Server job scheduling facility.
5.2.1 PROBLEMS WITH THIS TECHNIQUE
While this facility works relatively well there are a number of issues:
-
There are some bugs in Microsoft’s OLEDB which is used to access the text files. These seem dependant on release and service level of both SQL Server and the operating system;
-
It doesn’t recognise date and times properly. When first creating the load table, the column types are determined by examining the CSV file. SQL Server does this but unfortunately date and time values result in VARCHAR columns. This means that after the first load for a new data source the table definition needs to be tweaked to change these VARCHAR columns to DATETIME format;
-
By default SQL Server defines all character columns as NVARCHAR which uses two bytes to store every character. Part of the first time run tweaking is to change these columns to VARCHAR (1 byte per character); and
-
Having the LOAD SQL SERVER process run at a certain time has caused some minor problems when the mainframe processing has not yet completed. This will be rectified when we have email facilities installed on the mainframe. When this is complete the mainframe processing can email SQL Server, which will then initiate the load process.
6 SQL TEXT SERVER
Access to the FTP’d text data by SQL Server is relatively simple with a linked server. This makes CSV and TXT files in a directory visible to SQL Server as if they were tables. SQL Server is able to determine column types (for the most part anyway, see "problems with this technique") and can read the rows of the CSV file using SQL queries etc.
To create a text server on your SQL Server simply execute the "sp_addlinkedserver
" stored procedure. The following adds a linked server called TXTSRV to reference CSV and TXT files in the C:\TEXT FILES directory.
7 ISSUES WITH DATES AND TIMES
One major area of difficulty with this processing lies in the way SQL Server handles, and the way we want to process, DATES and TIMES.
We utilise time values often as parts of primary keys. Under normal circumstances this was working fine. Then we started to get duplicate key problems. It transpired that the issue was with the accuracy of the time values. Two CICS transactions for the same user started within the same thousandth of a second. The time value stored in SAS showed a difference beyond this accuracy. Unfortunately SQL Server only maintains time accuracy to 3 decimal places. In this case we removed the format from the time value and included this unformatted value as part of the primary key.
When indexing on a SQL Server DATETIME value, SQL Server will not use the index if you are referring to either the date part or the time part. To resolve this issue we create separate columns for date and time.
To simplify processing, any measures of time such as CPU time were stored as floating-point numbers, representing the number of seconds.
8 THE NEXT STEP
As of the time of writing this paper our SQL Server system is up and running, still on a desktop with grossly inadequate resources, but it is running. The paper "Mainframe Cost Awareness" [Heald 2003] describes a cost awareness system that relies on data from the SQL Server data repository and we are currently producing several report documents from the system. However, there is still along way to go.
Firstly, we will obtain (beg, borrow or steal) a dedicated server platform. Free space on our desktop will be depleted within a couple of months. It is therefore mandatory that this is resolved. We are also very hesitant to put any additional load on the processing resources we currently have. Once adequate hardware is available there are several exciting projects to be considered.
8.1 OLAP
SQL Server Analysis Services provides OLAP services. This will greatly increase our ability to analyse the stored data and will, in many cases, simplify reporting tasks.
8.2 Intranet access
Providing a Departmental intranet access to some of the data will empower our users, particularly the development staff, to develop efficient cost effective systems. It will also provide management with a better understanding of the actual costs of running some of the IT systems.
8.3 New Data
We currently have only included mainframe-based data. There remains vast amounts of data from the Department’s Metaframe environment, LINUX, Windows Servers etc. Additional data from Asset Management Systems, Help Desk Systems, Business data and Telephony Systems will complete the data collection phase. The disparate sources of this data will pose no problems for the Data Repository as all facilities from the Mainframe through to the SQL Server platform have been designed to support virtually any data source.
9 IN SUMMARY
This paper has looked at one instance of using a somewhat non-standard data repository to store capacity management data. We were forced to seek a new product for graphical reporting and analysis. This, coupled with our limited budget, resulted in our choice of SQL Server as the data repository.
Through the development of a sound infrastructure, we have been able to build a system that can accommodate virtually any data with virtually no additional coding or development work. Utilising the mainframe and SAS for backup, bulk data storage, and data cleansing makes good use of its abilities, while reporting using desktop tools such as Office, Cognos and CapTell greatly enhances our productivity and report format.
On the whole, bearing in mind that we still are using inadequate hardware and that the project is not yet finished, I believe that it has successfully shown that SQL Server can be used in this way to great advantage.
10 REFERENCES
Heald Adrian, "Mainframe Cost Awareness", Presented at CMGA conference 2003
Last Updated 06/05/09
Home |
Conference |
Groups |
National |
Members |
Links |
Site Map
|