CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

Benchmarking Blunders and Things That Go Bump in the Night: Part I
June, 2006
by Neil J. Gunther

About the Author
Neil J. Gunther, Performance Dynamics ConsultingSM

Neil Gunther, M.Sc., Ph.D. is an internationally recognized consultant who founded Performance Dynamics Company (www.perfdynamics.com) in 1994. Prior to that, Dr. Gunther applied his training in theoretical physics to research and management positions at San Jose State University, JPL/NASA (Voyager and Galileo missions), Xerox PARC and Pyramid/Siemens Technology. His computer performance analysis and capacity planning classes have been given at both corporate and academic institutions including AOL, Boeing, FedEx, Motorola, Stanford University, Sun Microsystems, SAGE-Australia and Thales Group (Holland).

Dr. Gunther is the author of numerous papers on computer performance, as well as three books: THE PRACTICAL PERFORMANCE ANALYST, McGraw-Hill (1998), ANALYZING COMPUTER SYSTEM PERFORMANCE WITH PERL::PDQ, Springer-Verlag (2005), and GUERRILLA CAPACITY PLANNING, Springer-Verlag (2006). He is well-known to CMG (Computer Measurement Group) audiences for his presentations since 1993, and his very popular articles in the CMG MeasureIT online magazine. In 1996 Dr. Gunther was awarded Best Technical Paper at CMG and in 1997 he was nominated for the A.A. Michelson Award.

Performance Dynamics has recently embarked on joint research into QIT (Quantum Information Technology) and Dr. Gunther has developed a theory of "qubit bifurcation", which is being tested experimentally. Since Dr. Gunther has Dirac number 2 (his M.Sc. supervisor was Prof. C. J. Eliezer; one of Dirac's few research students) and his Ph.D. was awarded in the UK for studies in quantum field theory and phase transition phenomena, he is well-equipped to explore the new frontier between classical IT and QIT.

Dr. Gunther was born in Melbourne, Australia and is a member of the AMS, APS, ACM, CMG, IEEE, and INFORMS.

[Hide]

Benchmarking, by which I mean any computer platform that is driven by a controlled workload using tools like Mercury's LoadRunner, IBM's TPNS, or Microsoft's WAS Tool, is the ultimate performance simulator. Benchmarking is often used to test application software performance just one step away from going live. It is also used to evaluate the performance of hardware platforms from competing vendors during a procurement cycle. Because benchmarking is actually a very complex simulation, it also affords countless opportunities for making blunders including: how the workloads are executed, and how the resulting measurements are interpreted. Right test, wrong conclusion is a ubiquitous mistake that happens because performance engineers tend to treat benchmark data as divine. Such reverence is not only misplaced, it can also be a sure ticket to hell once the application finally goes live. Over the course of these two articles, I will try to show you how such mistakes can be avoided using the vehicle of two war stories:

  1. Using the "psychic hotline" to resolve benchmark inconsistencies, and


  2. Understanding how benchmarks can go flat with too much Java juice.

In each case the vital but missing element is a simple performance model against which the measurements can be assessed. If you are not set up to make such assessements, how are you going to know when you're wrong? In general, you can only avoid such mistakes by recognizing that data comes from the Devil, only models come from God.


Capacity planning is about setting expectations. Even wrong expectations are better than no expectations!
---Entry from the online Guerrilla Manual.

1 Introduction

Benchmarking is the ultimate in performance analysis viz., workload simulation. It is often made more difficult because it takes place in a competitive environment: be it vendors competing against each other publicly using the TPC (Transaction Processing Performance Council), and SPEC (Standard Performance Evaluation Corporation) benchmarks, or a customer assessing vendor platform performance by running their own application as part of a procurement cycle. In the latter case, visiting cutomer-engineers are kept unaware of vendor performance improvements that are introduced onto the benchmark rig under the stealth of night 1. So much for the "rigorous" testing of the customer's application. Notwithstanding the fact that benchmarking is a form of institutionalized cheating (which everybody knows but won't admit to publicly) there are countless opportunities for blunders in the way the workloads are constructed and the system is tuned. Most discussions about how benchmarks should be conducted, revolve around such tuning issues.

A far more significant problem, and one that everyone seems to be blissfully unaware of, is interpreting the benchmark measurements. Right test, wrong conclusion is a much more common blunder than many people realize. This happens because test engineers tend to treat performance data as something divine. The huge cost and effort required to set up a benchmark system can lead to a false sense of security based on the unstated premise that the more complex the system, the more accurate it must be. As a consequence, whatever data it generates is considered correct by design and to even suspect otherwise verges on the sacrilegious. Such reverence for performance data is not only misplaced, it often guarantees a free trip to hell when the application finally goes live.

In Part I, I will show by example how such benchmark blunders arise and, more importantly, how they can be avoided with simple performance models that provide the correct conceptual framework in which to interpret the data.

2 Canonical Curves

Let's begin by reviewing some canonical performance characteristics that occur in all benchmark measurements.

Figure 1: Canonical throughput characteristic.

Fig. 1 shows the canonical system throughput characteristic X (the dark curve). This curve is generated by taking the statistical average of the instantaneous throughput measurements at successive client load points N once the system has reached steady state as explained in Fig. 2 as well as [Gunther, 2005], [Zadrozny et al., 2002], and [Joines et al., 2002].

Figure 2: Relationship between the instantaneous throughput data X(i) (shown as insets) over the course of measurement runs from t = 0 to t=T under a load of N = 20 and N = 100 users, respectively. The extracted steady-state values (shown as diamonds on the larger curve) are produced by calculating the time-averaged throughput as
X(N) = ∑iT XN(i) / T.

The dashed lines in Fig. 1 represent bounds on the throughput characteristic. The horizontal dashed line is the ceiling on the achievable throughput Xmax. This ceiling is controlled by the bottleneck resource in the system; which also has the longest service time Smax. The variables are related inversely by the formula:
Xmax = 1

Smax
(1)
which tells us that the bigger the bottleneck, the lower the maximum throughput; which is why we worry about bottlenecks. The point on the N-axis where the two bounding lines intersect is a first order indicator of optimal load Nopt. In this case, Nopt = 21 VUsers.

The sloping dashed line in Fig. 1 shows the best case throughput if their were no contention for resources in the system-it represents equal bang for the buck-an ideal case that cannot be achieved in reality.

Figure 3: Canonical delay characteristic. NOTE: This is R vs. N

Similarly, Fig. 3 shows the canonical system response time characteristic R (the dark curve). This shape is often referred to as the response hockey stick. It is the kind of curve that would be generated by taking time-averaged delay measurements in steady state at successive client loads.

The dashed lines in Fig. 3 also represent bounds on the response time characteristic. The horizontal dashed line is the floor of the achievable response time Rmin. It represents the shortest possible time for a request to get though the system in the absence of any contention. The sloping dashed line shows the worst case response time once saturation has set in. These things will constitute our principal performance models.

Figure 4: Non-canonical throughput-delay curve. NOTE: This is R vs. X. (Cf. Fig. 3)

In passing, it is worth noting that the throughput X(N) and response time R(N) data can be combined into a single plot like Fig. 4[See e.g., [Splaine and Jaskiel, 2001]]. Although useful in some contexts (e.g., network packet performance), the combined plot suffers from the limitation of not being able to determine the location of Nopt.

3 Miami Vice and the Psychic Hotline

With the basics under our belt, let's move on to the first war story.

3.1 Background

Dateline: Miami Florida, sometime in the dim distant past. Names and numbers have been changed to protect the guilty. I was based in San Jose, California, at the time, and all communications occurred over the phone. I never actually went to Florida (although I was fully expecting to) and I never saw the benchmark system, which involved multiple servers including an IBM mainframe running a CICS (Customer Information Control System) application. The benchmark workload simulator was built using TPNS.

During the prior eighteen months, a large benchmarking rig had been set up to test the functionality and performance of a complex third-party application. Using this platform, the test engineers had consistently measured system throughput at 300 TPS (transactions per second) with an average think time of Z = 10 seconds between sequential transaction requests. Moreover, during the course of development, the test engineers had managed to have the application specially instrumented so they could see where time was being spent internally when the application was running. This is a good thing and a precious rarity today! The instrumented application had therefore logged internal processing times. So far, so good.

3.2 Benchmark Results

In the subsequent discussion we'll suppose that there were three sequential processing stages. In reality, there were many more but the point should become clear with this simplified example. During the course of my conversation with the engineers in Miami, I enquired about typical processing times logged by their instrumented application. As part of their response to this question, I was given a list of average times which I shall represent here by just the three token values: 3.5, 5.0, and 2.0 milliseconds. On seeing these numbers, I immediately responded, "Something is rotten in Denmark ... err, I mean .... Florida!"

3.3 The Psychic Insight

This is our first benchmarking blunder. I can see from the reported instrumentation data that the largest time is 5 milliseconds or 0.005 seconds. That means the bottleneck processing stage has a service time of Smax = 0.005 seconds and applying the bounds analysis of Sect. 2 I can further deduce:
Xmax = 1

0.005
= 200 TPS
(2)
In other words, this piece of instrumentation data (if it's correct) tells me that the overall system throughput cannot be greater than 200 TPS. Incidentaliy, I can also predict the optimal user load as:
Nopt = 3.5 + 5.0 + 2.0 + (10.0 ×1000)

5
= 2002.1 VUsers
(3)
which is the ratio of the total round-trip time (including the thinktime at the client driver) to the bottleneck time [Gunther, 2005]. Note that I've replaced Z = 10 seconds in eqn.(3) by Z = 10,000 milliseconds for the purpose of keeping the arithmetic straight. Nopt in eqn.(3) corresponds to the point on the horizontal-axis of a plot of throughput (like Fig. 1) where the two dashed lines cross each other. In Fig. 1, Nopt ? 22 because it is not a plot of the Florida data (I was never in possession of a complete throughput profile). Knowing Nopt can be important for the following reason. If the user load (N) is too far below this point (i.e., N < Nopt), system capacity will be under utilized. If the user load is too far above this point (i.e., N > Nopt), the system will be driven into saturation and the user response time begins to climb dramatically (i.e., linearly) with a profile similar to Fig. 3.

But the Florida performance engineers were claiming 300 TPS as their maximum measured throughput. Naturally, they were not enthusiastic about my psychic intrusion into their hard-won eighteen months of performance measurements.


Busy work does not accrue enlightenment.
---Entry from the online Guerrilla Manual.

Nonetheless, this vital performance data they had collected so diligently was trying to tell them something. Either:
  • the 300 TPS measurement was wrong or,
  • the instrumentation measurements were wrong.

If only they had constructed a simple performance model against which to compare their data, they would've seen this message themselves. Once this inconsistency was pointed out to them, the performance engineers decided to thoroughly review their measurement methodology 2. We got off the phone with the agreement that we would resume the discussion the following day. During the night (remember the title?), the other shoe dropped. The engineers realized that the Devil was in the data (not just the details).

Each thinktime value used in the driver-script is generated randomly on the fly in such a way that when averaged over a sufficient number of calls, it approaches the average value (Z) designated in script. In this case, the average thinktime was designated as Z = 10 seconds. The individual values that conform to the statistical average are called random variates. The client-side scripts contained an if() statement to calculate each variate between transactions. The Miami engineers discovered that this branch of the script code was not being executed properly (I'm not sure I ever understood the precise explanation) and consquently the average think time was effectively zero (i.e., Z = 0 seconds). In essence, it was as though the transaction measured at the client-driver side Xclient was comprised of two contributions such that:
Xclient = Xactual + Xerrors
where Xactual = 200 TPS (consistent with the application instrumention data) and Xerrors = 100 TPS. That's a whopping 50% error between the reported and actual database throughput; but only if you're aware of it.

In other words, with zero think time, the test platform was being over-stressed in batch mode and that caused a request rate that was too high for the CICS application to handle. (See the interesting discussion in [Zadrozny et al., 2002] on this point). A database transaction that it could not be completed correctly was not being marked as a failure, rather an ACK was returned to the driver script which then scored it as a successful transaction.

Even after eighteen months hard labor, the test engineers remained blissfully unaware of this large error margin because they had not constructed a simple performance model like that presented in Sect. 2. Once they were apprised of the inconsistency, however, they were very quick to assess and correct the problem. The engineers in Florida did all the work, I just read the message in the medium. Dionne Warwick and her Psychic Friends Network probably would've been proud of me.

Next month, the second war story: Falling Flat on Java Juice!


References
[Gunther 2005]
Gunther, N. J. (2005). Analyzing Computer System Performance with Perl::PDQ. Springer-Verlag, Heidelberg, Germany.
[Joines et al. 2002]
Joines, S., Willenborg, R., and Hygh, K. (2002). Performance Analysis for Java Web Sites. Addison-Wesley, Boston, Mass.
[Splaine and Jaskiel 2001]
Splaine, S. and Jaskiel, S. P. (2001). The Web Testing Handbook. STQE Publishing, Inc., Orange Park, Florida.
[Zadrozny et al. 2002]
Zadrozny, P., Aston, P., and Osborne, T. (2002). J2EE Performance Testing with BEA WebLogic Server. Expert Press, Birmingham, UK.

Footnotes:

1The second part of the title for this article.

2The more likely incentive is that they urgently wanted to prove me wrong.

 

Last Updated 06/20/06


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group