MeasureIT – Issue 3.11 – mBrace – Part I by Michael Kok

mBrace - Part I
A pragmatic model driven approach for performance engineering of transaction processing systems
November, 2005
by Michael Kok

About the Author


1 Introduction

The changed business supported by a newly built information system was supposed to yield $ 200,000 savings per month. The system was to be rolled out to 700 users within three months time. But the roll out stalled at 85 users. Response times were in the order of minutes where they should have been seconds. It took six months to sort out the problem and correct it. The damage mounted to a $ 1.2 million for lost benefits, an unknown amount of further losses and $ 300,000 for hired expertise to correct the situation. There are several ways this disaster could have been prevented. The cost of which would not have exceeded $ 50,000. -.

Problems with responsiveness, i.e. poor response times, can occur at various stages in the lifecycle of an information system. Such problems can inflict tremendous damage. When implementing a new information system there is much at stake. A large investment in business transformation can be wasted because of poor response times of the supporting information system. Reducing such risks by securing responsiveness is a beneficial challenge. The mBrace approach is a model driven approach particularly developed to secure responsiveness in the commissioning stage of an information system.

The name mBrace is an acronym that intends to loosely refer to model based control of responsiveness. It is an approach that defines activities, techniques and deliverables.

This paper describes a number of interesting elements of the approach in two parts. The first part, published this month, outlines the approach and describes the way the model is used for performance analysis in a proactive improvement effort.

The second part, published next month, will illustrate how to create the model and obtain the necessary data to be filled into the model. For each project conducted with the mBrace approach the model is tailored to the specific situation. The model is then used in that project but is not a commercially available product by itself.

First the benefit of the approach is outlined, then a brief overview is presented and it is positioned against various alternatives. Then, an example case shows how the model is applied for analysis and proactive improvement of system performance.

2 Positioning the mBrace approach

2.1 Performance analysis and stages in the life cycle of an information system

Being dedicated to deliver and manage information systems with good performance, it is worthwhile to look at the lifespan of a typical system. A system may be viewed as having four stages that are essential with respect to system performance engineering:

  1. Planning
  2. Development
  3. Commissioning
  4. Exploitation

Figure 1 depicts the life span of a typical information system from a system performance engineering perspective. Quality of response times, also termed as responsiveness, is mapped on the y-axis. The bold black line describes the way responsiveness develops along the life span of the information system.

At cut over, responsiveness should be at the norm, but notice that it is still far below. After quite some effort, quality rises until it meets the norm and then the system is operationally ready. Often this point is reached several months after cut over. Then in the beginning quality first meets the norm. At some point in time however responsiveness may drop below the norm again for some reason.

Figure 1

When trying to control responsiveness the approach and its effectiveness is not the same in each stage. Each of the stages of the life span sets its own requirements for a performance engineering approach. In the first stage, Planning, senior management must legitimise the investment and estimate the future running costs of the projected system in order to fit it in the business case. The cost estimates have a limited reliability and must be repeatedly reaffirmed throughout the development stage. The client who ordered the system usually prefers to choose between high and low quality/cost options before the start of the development. This includes making a choice between higher performance and consequently higher cost or lower performance with lower costs in order to balance business performance and running costs. It seems to be difficult, if not impossible, to provide such kind of insight. For this the supplier of the system must be knowledgeable of the performance - cost curve (figure 3) of the system to be developed. If not feasible in the first stage, then in which stage can answers to such questions be provided?

At stage 2, Development, developers lay the basis for the systems performance. The application developers can test each transaction for its single user performance in the test environment. If necessary, they can investigate the transaction performance further with the help of a profiler. The infrastructure architects influence future performance when balancing state of the art and proven technology in their choice for infrastructure components.

At stage 3, Commissioning, the system is prepared for cut over and after that it shows its real life performance while being rolled out to an increasing number of users. Figure 1 depicts a case that occurs too often. The detail in figure 2, stresses the core of the problem. Responsiveness is far below the target at cut over. Consequently the rollout stalls in an early stage. The company is in panic. All experts are mobilised and start trying to find the cause. The solution is typically searched for by trial and error. When it looks like the solution is found and implemented the rollout restarts again, but a couple of weeks later the problem may reoccur and the cycle repeats. This way, it commonly takes several months until the system reaches operational readiness. The man-hours of all experts spent correcting the problem may be an expensive affair. Additionally, lost business benefits may even cost many times more than that.

Figure 2. Responsiveness should but often does not meet the norm at cutover

Figure 2

At stage 4, Exploitation, the system performs at full load. Especially where various applications share a common infrastructure there is the risk that a minor change can cause too much load. At the Exploitation stage this is less difficult to prevent. Much of the data for securing performance can be made easily available. The transaction volumes are known, the transaction mix is clear and resource usage can be routinely measured.

Figure 3. The responsiveness - costs relationship curve

Figure 3

It is extremely beneficial to secure a systems performance in stage 3 Commissioning, before the system goes live. What insight can performance analysts provide in order to secure response times before it goes live? At this stage, there is less information available than in Stage 4, Exploitation. Expected transaction volumes, transaction mixes and resource usage are not known yet. There is no baseline yet. So stage 3 sets rather demanding requirements to the approach used and is the most challenging stage for the performance analyst.

The mBrace approach is applicable at stages 2 (Development), 3 (Commissioning) and 4 (Exploitation), but is especially created for meeting the challenges of stage 3. Moreover, the approach is well suited to gain control over the responsiveness - costs relationship curve.

2.2 The mBrace process

mBrace is an approach consisting of a process, techniques and deliverables. The sketch in figure 4 gives an impression of the process and deliverables of the mBrace approach.

Figure 4

Figure 4. The mBrace process

2.3 Alternative options

There are several ways to secure responsiveness of new information systems such as:
over sizing the infrastructure, performance testing and performance analysis.

Over sizing the infrastructure

Over sizing the infrastructure means buying much more capacity, hoping that this covers all potential performance problems. It does not provide any insight into the performance behaviour of the system. During the first phases of the rollout of the system, measurements may be done for analysis. If a problem occurs, there is little time left to solve it. This type of approach can be costly and it is questionable that the extra capacity will resolve the most likely bottlenecks.

Performance testing.

The purpose of a performance test is to reveal if the systems performance meets its target. If it meets the target: OK, if it does not: pity. This approach is popular.

An increasing transaction volume is offered to the information system using a stimulator or record and playback tool. At the same time, response times are measured. When the system collapses, the maximum transaction volume and expected response times are known. This is very important insight; however in many cases more insight is needed. If the maximum feasible transaction volume is not sufficient to meet expectation there is no additional information about the cause nor what to do about it. For example, when a system should perform for 2,000 users and it collapses at 1,200 the test results do not provide any answers to the question what to do next. In that case, trial and error changes with repeated performancetesting are performed. This is a procedure that can be time consuming and expensive.

Compared to the option of over-sizing or to the damage of poor response times, performance testing is cheap and quick. An important plus of performance testing is that it really tests the system. There is less room for surprises.

The test can be completed in a couple of days; however diagnosis and solving of performance problems can take several weeks to months.

Performance analysis

The purpose of a performance analysis is not only to reveal future performance, but also to identify the measures to improve it at the same time. Performance analysis involves collecting information and analysis in a structured way. First a model is made and from that a measurement plan is derived and implemented. The measurement data are parsed and fed into the model together with business statistics. As a result, detailed insight in the time behaviour and resource behaviour of the information system is gained. Comparing performance testing and performance analysis, the first thing to remark is that the latter requires scarcer skills. If a performance test proves at once that the system meets its performance requirements, then it is marginally cheaper than performance analysis. If, at a performance test, the system does not meet its requirements, then performance analysis works much cheaper because it shows what has to be done and what the effects are up front. Performance analysis though a more complicated affair is very effective and relatively cheap. The cause of any performance problem is revealed and the measures to be taken as well as their effects are completely clear in advance.

Combined performance test and analysis

Integrating performance test and performance analysis delivers the best results. The measurement data for a performance analysis are gathered in a performance test. The data gathered at high transaction volumes are used to validate the model.

The mBrace approach covers performance analysis either in combination or without performance testing.

3 The example case

The approach is outlined with a fictional example. The example is based on real life experience from a number of assignments in this field, in which the mBrace approach had a vital role. It concerns a system under development to be used in a sales support process for insurance products of average complexity.

3.1 IT infrastructure

In this paper the term information system is used to describe the combination of an IT infrastructure and an application. The application runs on the infrastructure depicted in figure 5. It consists of the following infrastructure components: PC’s, Servers, a Storage System, LAN’s and a WAN. The components may have one or more resources; e.g. a server has at least the following resources: processors, (access to) a storage system or local disks and memory. If transactions spend enough time at a resource to be noticed by the users, that resource is considered critical.

There are on-line interfaces that provide connections to the General Ledger and Customers functions. Resources can be horizontally and/or vertically scaled. Access to disks can be scaled vertically by replacing them with faster ones and horizontally by increasing their number. Processors can be scaled vertically by replacing them with faster ones and horizontally by increasing their number (in the case of symmetric multiprocessing).

Resources are designated critical or not in an arbitrary way. Also software resources may be identified as critical, e.g. a lock or a fixed listening time.

Figure 5

Figure 5. The infrastructure

In the example, the LAN’s and the incoming side of the WAN do not contribute significantly to the response times, so they are considered non-critical resources.

3.2 Application

The sales support application is called the observed application. In the example, it is in the process of being developed. In total the application has 389 transaction types. However since only 111 are frequently used, we focus on these. This group of transactions is called the transaction focus. Apart from the sales support application, there are other applications that share the same infrastructure. These are named the environmental applications. Observed and environmental applications influence each other’s response times. Commonly the environmental applications are already fully operational. Measurement data of the environmental applications are restricted to operationally measured resource utilisations at peak times. Without the environmental applications, the response times of the observed application would be shorter.

There are transaction types and transactions. A transaction is an occurrence of a transaction type. However, in this paper the word transaction may be used even if we mean transaction type. Therefore be careful about the context in which these terms are used.

A transaction is fired by the user and is recognisable by the user. The user triggers a transaction by a mouse click, hitting the enter-key or a function-key. A transaction ends when the output on the display is complete.

One of the challenges is that application developers do not always bother to identify the transaction types. There are exceptions such as in the case of CICS where the use of a unique four-letter label is enforced for each transaction type. It is important to identify the transaction types in a way linked to the business and if there is no identification available they are labelled ad hoc.

3.3 Response time breakdowns

Transactions spend time using resources or waiting for resources before they can use them. In a response time breakdown the time a transaction spends on each resource is visualised. The mBrace approach graphically produces a response time breakdown for each transaction type in one chart. In the chart there is a multi coloured horizontal bar for each transaction type. Each coloured part shows the contribution to the response time by the corresponding resource. In general only resources that noticeably contribute to the response time are included in the chart.

The data collected on each transaction type of the transaction focus is kept in a transaction bookkeeping. The transaction bookkeeping is maintained in a spreadsheet together with the model and its dashboard (see next section: The user interface of the model), etc.

3.4 Business volumes and transaction volumes

The transaction volume is an important item. The value of a response time is only relevant at a given transaction volume. To be in control of response times it is necessary to be in control of response times at certain given transaction volumes.

To forecast transaction volumes is a complicated task for a number of reasons. It is much easier to forecast business volumes. In most cases the management of a company knows its business volume fairly well. In our example the target business volume is 60,000 sales orders per year. This is called the base volume. From this it is possible to derive the transaction volume in the peak hour. It is important to preserve the relationship between business volume and transaction volume. The base volume is the central parameter.

The sales process for handling one sale in this example can be divided into five use cases. For each use case it is determined which system transactions are done, e.g. for the first use case 12 transactions are executed. For each use case it is not too difficult to estimate how many times it is executed on average for each sales order. In our example each use case is executed once per sale. Each time the user works his way through the five use cases in order to process a sale he / she fires 295 transactions on average. This determines both the system transaction volume and transaction mix. In this example 17.7 million transactions are processed for the targeted 60,000 sales orders per year resulting in 12.6 transactions per second in a peak hour. The peak hour is taken on the highest day of the year. A peak day happens to see twice the average number of sales processed. A peak hour sees twice as many sales as the average. Consequently peak traffic in this example is four times as high as the average. In practice peak values are seldom that heavy. If possible the peak-average factor should be obtained from observing the business.

The concept of transaction may need further exploring. A transaction may be built up from sub-transactions. Examples can be found in most web applications. To fill a screen of a browser for one transaction a number of items are retrieved and filled into an HTML page. Analysing the performance of one transaction may require the analysis of more than 10 sub-transactions.

3.5 The norm for response times

The norm set for response times is:

    100% of the system transactions processed should be below 2 seconds, however at full transaction volume only 95% need to be faster than 2 seconds.

4 The user interface of the model

The user interface of the model is described in here; the model itself is described next month in Part II. The user interface consists of the dashboard to control the model and to show the outcomes. It plays an important role. Figure 6 shows an overview of its structure.

The left side of the dashboard shows the parameters and detailed outcomes of the model. In the middle there are two graphical charts. The larger one shows the response time breakdowns of the transactions. In the dashboard only a selection of 25 transaction types with the higher response times are shown. The transaction types are sorted by decreasing response time. One must be aware that the scale of the chart may vary with the results.

The smaller chart down at the right shows the utilisations of the critical resources.

Figure 6

Figure 6. The dashboard is a crucial element of the model

4.1 The parameters and outcomes

The upper left part consists of a number of general parameters and outcomes including transaction volume, peak volume factors, average value of response time, its 95 percentile and maximum.

Further down there are sections with parameters and outcomes for terminal servers, the application server, the database server and the Wide Area Network. Within these sections there is a subsection headed by a colour-marked bar for each critical resource.

The outcomes

Mixed with the parameters there are outcomes such as average response time and ´required´ and ´realised´ resource utilisations. Later in this paper in the section "The resource utilisations" the terms required and realised utilisations are explained.

The servers

At a minimum, for each server there are sub-sections for CPU, Disk storage and dynamically used memory. Each section has some parameters that can be used to alter the capacities of the server to show the effects of horizontally or vertically scaling the resources. The outcomes presented include resource utilisations from the observed applications that we analyse in detail and utilisations from environmental applications. Environmental applications share the infrastructure with the observed application but are not analysed in detail. It is mainly the utilisations they cause that we need to know for our performance analysis. As to memory, only the dynamically used part of memory is of interest, i.e. the amount of memory available and temporarily used by the transactions.

The terminal servers have a special position. They are considered to be a section of a larger server farm. The number of allocated terminal servers can be easily altered according to the need. The model calculates the number of terminal servers required for a particular transaction volume and adjusts that number automatically.

The Wide Area Network

In the example, the Wide Area Network is modelled with an input and an output part. When this is of interest the model for the WAN can be detailed much further. The WAN can be both horizontally and vertically scaled by increasing the number of parallel channels or changing the speed per channel.

4.2 The response time breakdowns

The dashboard shows the breakdown of a selection of the 25 transactions with the longest response times in a chart sorted by decreasing response time. The legend in the upper part of the chart defines the colours. The following table explains the abbreviations used in that box:

Figure 7

The response times in the example are broken down into 18 parts that are related to 11 resources. The intermediate results determine which resources are involved in the response time breakdowns.

The chart in the dashboard shows a selection of 25 out of all 111 transactions of the transaction focus that have been investigated. Since the transactions with the largest response times are the most interesting, a selection of 25 transactions is adequate for the dashboard. When necessary the other transaction types can be viewed in other charts.

4.3 The resource utilisations

Figure 8 shows two examples of the charts with the utilisations of the critical resources. For each critical resource there is a vertical bar with four possible colours.

Figure 8a

Figure 8a. The critical resources with normal utilisations


Figure 8b

Figure 8b. Most of the critical resources are overloaded

The red part (Environment) shows the utilisation caused from environmental applications that share the same resource. The environmental applications are already in the Exploitation stage and are not investigated in detail now. The utilisations they cause on the resources are known from operational measurements. Yellow is the idle part of the resource. The dark blue part indicates the resource utilisation caused by the transaction types of the observed application (Application X). Light-blue, Not realised util, shows the utilisation of the observed application in excess of 100%. Of course this is physically impossible, but it indicates the utilisation of the resource if it had sufficient capacity. In the example DSCPU, the CPU’s of the database server, show a total for realised and required utilisation together of 195%. This means that if the CPU capacity increases by a factor of 1.95 its utilisation becomes 100%. So in order to target operating at 60% the CPU capacity must increase by a factor 1.95 / 0.6 = 3.25. This concept of not realised utilisation is very useful, because it reveals exactly how the resource must be scaled.

To determine the resource utilisations exactly the detailed outcomes are found in the dashboard in the section of the corresponding infrastructure component.

5 Performance analysis and proactive improvement

In this section the use of the model for analysing and improving the information system is demonstrated. The model has been filled with real life data obtained from measurements and has been validated. The performance of the information system is shown with help of the model. The effects of measures for performance improvement are shown with the dashboard. In this way the effects of these measures can be tested with the model before they are implemented. So the best measures for improvement can be chosen.

5.1 Single user performance

With only one active user, the base volume of 484 sales orders processed per year results in a negligible 0.1 transactions per second. So there are no waiting times for resources and each transaction can be executed without concurrency of other transactions. As a consequence only the items 1.2 TSCPU, 1.4 TSSTG, 2.2 ASCPU, 2.4 ASSTG, 2.6 ASINT, etc., the items without a "Q" in their names, are visible. Looking at the response time breakdowns at single user load (figure 9) reveals much of the application’s performance. Several things are worth noting.

Figure 9. The dashboard with single user load at the start.

Figure 9

The chart shows the response times in seconds. The dashboard shows that at least 25 transaction types exceed the norm of 2 seconds. There are three transactions that have excessive response times: UC005TX008 with 13.7 seconds, UC005TX006 with 12.9 seconds and UC005Tx011 with 9 seconds respectively. For UC005TX008, it is the CPU usage at the Database Server (3.2 DSCPU) with 7.5 seconds that dominates. For UC005TX006, it is the 9.5 seconds spent accessing the storage system by the Database Server (3.4 DSSTG) that is dominant. Further, this transaction has a CPU usage (3.2 DSCPU) on the Database Server of 2.2 seconds. CPU usage (2.2 ASCPU, 3.1 seconds) and access to storage at the Application Server (2.4 ASSTG, 3.5 seconds) dominate transaction UC005Tx011. A magenta component describes the contribution of the interfaces (2.6 ASINT) (Remember that the observed application interfaces to Customers and General Ledger applications. See figure 5.). Since this is outside the scope of the study, only the total time spent on the interfaces was determined. Their breakdowns were not analysed in detail. Thus only the overall time the interfaces take in the test environment is known. Consequently, there is also no further information that provides insight into how the times of the interfaces further develop at higher transaction volumes.

Average response time equals 1.8 seconds, which is acceptable, however a 95-percentile value of 4.9 and a maximum of 13.7 seconds are far above acceptable.

The Application Development group was invited to pay attention to these findings.

5.2 Application improved

After Application Development improved some transaction types, fixed a bug in the DBMS and solved a problem with the interfaces, new measurement data were collected again and fed into the model. As can be seen from figure 10, the transaction types UC005TX008, UC005Tx006 and UC005Tx011 were drastically improved and the interfaces had their contributions to the response times drastically reduced.

Figure 10

Figure 10. After improving the application the transaction types have been sorted by decreasing response times again.

The transaction types UC004Tx018 and UC005Tx020, with their large contributions for the CPU (blue-grey) of the Database Server (3.2 DSCPU) however have not changed yet and are now the transactions with the longer response times. Application Development plans to improve them at a later stage.

Average response time is now 0.9 seconds, 95-percentile value is estimated at 1.7 seconds and the largest response time equals 3.9 seconds.

Though not all transactions show response times that meet the norm yet there is a basis now for the roll out to the 124 users. So now we can study what will happen when production is raised to the target base volume of 60,000 sales orders per year.

5.3 Load from environmental applications introduced in the model

After introducing the utilisations of environmental applications, the response times are slightly longer but the changes are not spectacular.

Figure 11

Figure 11. The performance after introducing the utilisations of environmental applications

5.4 Transaction volume increased to target volume

In Figure 12, the number of users has increased to the target of 124 (in the model of course) and consequently the base volume increased over the target of 60,000 sales per year. Obviously there are a few challenges to be met before the system is ready for use. The model reports that the maximum value of the response time has changed from 4.0 seconds to over 150 seconds!

The analysis in the next steps focuses on the capacities of the infrastructure and is purely done with the model. No new measurements are collected.

All resources except those from the Terminal Servers are now overloaded. Their number was simply increased from 3 to 6. The utilisation of the dynamically used part of the memory of the Terminal Servers is even low with 25%.

The chart of the Utilisations of critical resources reveals all bottlenecks at once. It shows that the resources of the Application Server and the Database Server as well as the outgoing side of the Wide Area Network are overloaded. Some of their required utilisations even exceed 200%. The accurate values of the required utilisations can be found in the left part of the dashboard. The scale of the resource chart is clipped to 200%.

Figure 12

Figure 12. The system is not yet ready for use. Notice that the scale of the chart of the response time breakdowns has changed again.

The colour pattern has completely changed. This is an important feature of the dashboard. The dominant parts of the breakdowns take so much space that they tend to squeeze the other colours out of the figure. The response times now merely correspond with waiting times for the overloaded resources: 2.1 ASQCPU, 2.3 ASQSTG, 2.5 ASQMEM, 3.1 DSQCPU, 3.3 DSQSTG, 3.5 DSQMEM and 9.4QNETW OUT. They dominate the figure to such an extent that the times spent on the resources themselves are hardly visible anymore. This helps to focus on the important aspects.

The model reveals in a practical way all corrective measures to be taken in one stroke. The real system would not allow for much increase in transaction volume after the first resource becomes overloaded. For the model this is not a problem. It keeps providing fair estimates of waiting times and provides a useful picture even when all resources are overloaded. Now the parameters can be altered for scaling the overloaded resources one by one until acceptable response times are realised.

5.5 Application server and Database server upgraded

After increasing the capacities of the Application and Database Server a new state is reached. Increasing relative processor speed from 1 to 5 vertically scaled the CPU’s. This large increase leads to under-utilisation of the CPU’s again. Therefore the number of Application Server CPU’s was safely reduced from 4 to 2. Together these measures bring the CPU utilisations down to a more moderate 50% and 39%. The chart shows that this eliminates most of the time spent by waiting for the CPU’s (2.1 ASQCPU and 3.1 DSQCPU). Further, the time spent on the CPU’s, is reduced by 80% by increasing the speed of the processors (this can’t be seen in the chart).

Adding 5 disks at the Application Server and 40 at the Database Server increases the access capacity of storage. Of course the traffic and thus the data have to be spread over all the disks again. This reduces the utilisation of the storage to a more moderate 63% and 60% respectively and, as the model shows, eliminates most of the time spent on waiting for storage (2.3 ASQSTG and 3.3 DSQSTG).

Figure 13

Figure 13. Outcomes after increasing capacity of the Application Server

The utilisation of the dynamically used part of memory and the corresponding wait time have decreased considerably, but are still too high with a reported wait time (2.5 ASQMEM) of 12 seconds and a required utilisation of 476% at the Application Server. Since we still have delays at the Wide Area Network that influence memory demand, increasing memory can wait for the time being until the WAN is corrected.

Obviously all response times are still far above the norm of 2 seconds.

5.6 Capacity of Wide Area Network increased

Increasing the speed from 2 Mbits/sec to 6 Mbits/sec expands the outgoing part of the WAN. Utilisation is brought down from a required utilisation of 174% to an acceptable 58%.

Figure 14

Figure 14. The outgoing side of the WAN scaled from 2 Mbits / sec to 6 Mbits / sec.

As a result not only the utilisation of the WAN is back to normal but also the utilisation of the memory of the Application Server reduced to 67% without specific intervention. Waiting times for the WAN (9.3 QNETWOUT) and for the memory of the Application Server (2.5 ASQMEM) have consequently disappeared.

Because the CPU’s of the Application Server, the Database Server and the Wide Area Network have been scaled vertically the response times even improved. The average response time at the target transaction volume is now 0.8 seconds, the 95-percentile value is estimated at 1.3 and the maximum equals 2.2 seconds. Only 2 transactions still exceed the 2-second norm. This is better than the norm of 95% below 2.0 seconds so there is a basis for a start of the roll out, but it is better to look for some extra reserve. Thus it is important to see what measures should be taken to further improve performance. The slowest four transactions of the application are still candidates for improvement.

For many transactions the terminal servers contribute noticeably to the response times. This means that the client side of the application could use some extra attention. Also the terminal servers themselves may be candidates for expansion.

The analysis demonstrated that improvement of the network also eliminated the over-utilisation of the application server (sections 5.5 and 5.6). This shows that memory of the application and database servers, which closely collaborate, are sensitive to changing response times. Any small increase of response time may cause an increase of memory utilisation at the servers that causes long queuing times, which increases response times again etc. This way an increasing spiral is started that ends up with a performance collapse of the two servers. So memory utilisation of the Application Server and the Database Server are at the high side with approximately 60%. Therefore the target utilisation for the dynamically used memory of the Application and Database Server is set at 30%.

5.7 Testing extendibility: further increase of volume

Finally the number of active users is increased from 124 to 200 In order to get some feeling about the extendibility of the system. The base volume increases accordingly from 60.000 to over 96.000 sales per year.

Figure 15

Figure 15. The performance when the number of active users is increased from 124 to 200

The previous procedure is followed again, which reveals that the following alterations would be needed:

  • CPU’s that are twice as fast to replace the Terminal Server CPU’s.
  • The number of disks of the application server increases from 10 to 17.
  • The processors of the database server horizontally scale from 2 to 3 CPU’s.
  • The number of disks of the database server increases from 60 to 90.
  • Dynamically used memory increases to 400 Mbyte at both the Application and the Database Server.
  • The WAN vertically scales to 9 Mbits/second.

The resource utilisations are then reasonably balanced, utilisations of the terminal server and application server CPU’s are a bit low.

The response times would be well acceptable with an average of 0.8 seconds and a 95-percentile value of 1.3 seconds. Thus supporting an increase in base volume from 60,000 to over 96,000 sales per year only needs scaling of a few resources.

6 Summary and conclusions

The mBrace approach covers the information system end-to-end in terms of both application and infrastructure. In order to predict response times it also covers forecasting the volume of the transactions.

In the example case the focus was first on the application. Transactions that needed improvement were identified. As a consequence, application development optimised some application transaction types and did troubleshooting at the interfacing. Next, after the application was sufficiently improved, the model assisted in scaling the infrastructure for the target base volume.

The response times were estimated at the target business volume in order to assess the viability of the roll out. The extra capacities for the resources of the infrastructure needed to secure these response times were determined.

The mBrace model driven approach shows the performance of every individual transaction type and allows for performance analysis in any test environment that is not particularly production like. The model enables accurate capacity planning.

The graphics of the dashboard appear to have strong visualisation ability. The dashboard provides:

  • An immediate total overview about the performance of the entire information system.
  • Clear visualisation of resource utilisations, response times and the causes of the response times.
  • With the graphics a better sense of what is important and what is not.
  • Overview about all bottlenecks with one keystroke.

The model helps effectively to gain overview over the measures that must be taken to secure performance:

  • In practice it happens quite often that after one bottleneck is eliminated the next bottleneck is found the hard way a month or two later. This may even repeat several times. With the help of the model we can avoid this because it shows all bottlenecks and supports determining all necessary interventions at once.
  • Normally it is not easy to determine what to do to secure response times for a release with some 400 transaction types. Which transaction types should receive attention and which are OK? A lot of work can be wasted with the wrong transactions. In the example shown, first the scope of work narrows from 389 to the transaction focus with 111 transaction types (a given fact in the example case, section 3.2). Subsequently less than 10, only a relatively small number of transaction types, were identified as candidates for corrective action.

Stalling of the rollout can be completely prevented with the mBrace approach and since all interventions are known, the necessary costs for improvement can be made clear in advance. The cost - performance curve (figure 3) showing the direct costs of better or poorer response times can be determined. However with adequate accuracy this is only achieved in stage 2 of the systems lifecycle, Commissioning. Controlling this matter in earlier stages still remains a challenge.

The mBrace approach supports both creating and using the model with the purpose of controlling response times and capacity. This first part of the paper outlines the usage of the model as it supports a performance analysis and proactive improvement effort. This comes down to the Process steps 6. Analyse application performance and 7. Analyse infrastructure capacity (see figure 4).

Part II of this paper, which will be published in the next edition of MeasureIT will cover measuring, creating and validating the model.


Last Updated