CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

Benchmarking Blunders and Things That Go Bump in the Night: Part II
August, 2006
by Neil J. Gunther

About the Author
Neil J. Gunther, Performance Dynamics ConsultingSM

Neil J. Gunther, M.Sc., Ph.D., SMIEEE, is an internationally recognized IT researcher and author who founded Performance Dynamics Company (www.perfdynamics.com) in 1994. He is well-known to CMG audiences for his conference presentations since 1993 and his popular articles in CMG MeasureIT. Dr. Gunther was awarded Best Technical Paper at CMG'96 and received the A.A. Michelson Award at CMG'08.

Prior to founding Performance Dynamics, Dr. Gunther held teaching, research and management positions at San Jose State University, JPL/NASA, Xerox PARC and Pyramid/Siemens Technology. His "Guerrilla Capacity Planning" training-classes have been presented world wide at both corporate and academic institutions including: AOL, Boeing, FedEx, Motorola, Nokia, Stanford Univ. and UCLA. He is a member of AMS, APS, ACM and SPIE. More details are available on his Wiki page.

[Hide]

1  Reprise

In Part I, some simple performance relationships pertaining to benchmark measurements were presented and applied to a case study we called, Miami Vice and the Psychic Hotline. In this second installment, we show how one of the simplest and most fundamental performance relationships, Little's law, can reveal incorrect measurements and help prevent benchmark blunders.

Benchmarking, by which I mean any computer platform that is driven by a controlled workload is the ultimate performance simulator. Benchmarking is often used to test application software performance just one step away from going live. It is also used to evaluate the performance of hardware platforms from competing vendors during a procurement cycle. Because benchmarking is actually a very complex simulation, it also affords countless opportunities for making blunders including: how the workloads are executed, and how the resulting measurements are interpreted. Right test, wrong conclusion is a ubiquitous mistake that happens because performance engineers tend to treat benchmark data as divine. Such reverence is not only misplaced, it can also be a sure ticket to hell once the application finally goes live. In these two articles, I try to show you how such mistakes can be avoided. The vital but missing element is a simple performance model against which the measurements can be assessed. If you are not set up to make such assessements, how are you going to know when you're wrong?

The following examples are taken from erroneous benchmark measurements reported in various publications. Since these data were published, it tells you that authors actually thought their measurements were correct. This unfortunate situation can only occur when the collected data are not compared against simple performance truths, such as Little's law.

2  Little's Law Means a Lot

Little's law provides a connection between the measured throughput X and the measured response time R; two quantities commonly reported by benchmarking and load testing tools like Mercury's LoadRunner, IBM's TPNS, or Microsoft's WAS Tool.

2.1  A Little Intuition

Suppose you've just pulled up behind a line of cars at one of the tollway entrances to a freeway. As a performance person, while you're waiting, you look directly across at the line next to you and see that a yellow Volkswagon corresponds to your position in that line. You count off 16 cars ahead of the Volkswagon (including the driver currently paying), and in the next 30 seconds 5 more cars arrive behind the Volkswagon. How long will you spend at the tollway?


Time at the tollbooth =    16 cars ahead of you

10 new cars per minute
  = 1.6 minutes
(1)

You just used Little's law: the time in the system (R) is determined by the number of cars already in the system (N) divided by the rate at which new cars arrive into the system (?). More commonly, (1) is written as a product rather than a fraction by rearranging the terms:
N = ? R

(2)
The 1.6 minutes is just an estimate because its value rests on the following assumptions:
  • the system is in steady-state so that N, ?, R are meaningful statistical averages
  • the measurements are carried out over a long enough period to see steady-state

It could have been that the cars you counted at the toolbooth queue were part of an instantaneous fluctuation that was bigger or smaller than the long-term average, so we can't take the 1.6 minutes too seriously. But at least you made good use of the waiting time by practicing your performance analysis skills. If you repeated your estimate each day, you might improve it.

Moreover, (2) is quite remarkable in that it is not even based on queues or tollways. In fact, John Little (the who came up with the mathematical proof in 1961) is a professor in the business school at MIT, not the computer science department. Queueing sytems are merely a subset of the systems to which Little's law can be applied. We use it most often in the queueing context as performance analysts.

2.2  Demonized Performance Measurements



With Little's law in hand, let's look at another benchmark blunder that could have been avoided if the author had understood (2).
The load test measurements in Fig. 1 were made on a variety of HTTP demons [McGrath and Yeager, 1996].

 
Figure 1: Measured throughput (conn / s) for a suite of HTTPd servers.

Each curve roughly exhibits the expected throughput characteristic as described in Part I. The slightly odd feature, in this case, is the fact that the servers under test appear to saturate rapidly for user loads between 2 and 4 clients. Turning to Fig. 2, we see similar features in those the curves.

 
Figure 2: Measured response times (ms) of the same suite of HTTPd servers.

Except for the bottom curve, the other four curves appear to saturate between N = 2 and 4 client generators. Beyond the knee one dashed curve even exhibits retrograde behavior; something that would take us too far afield to discuss here.
But these are supposed to be response time characteristics, not throughput characteristics. According to Part I this should never happen! This is our second benchmarking blunder. These data defy the laws of queueing theory [Gunther, 2005]. Above saturation, the response time curves should start to climb up a hockey stick handle with a slope determined by the bottleneck service time Smax. But this is not an isolated mistake.

3  Falling Flat on Java Juice!



In their book on Java performance analysis, Wilson and Kesselman [2000] refer to the classic convex response time characteristic discussed in Part I as being undesirable for good scalability. I quote ...

(Fig. 3 Left) "isn't scaling well" because response time is increasing "exponentially" with increasing user load.

They define scalability rather narrowly as "the study of how systems perform under heavy loads." This is not necessarily so. As I have just shown in Sect. 2.2, saturation may set in with just a few active users. Their conclusion is apparently keyed off the incorrect statement that the response time is increasing "exponentially" with increasing user load. No other evidence is provided to support this claim.

 
Figure 3: Comparison of java scalability taken from [Wilson and Kesselman, 2000]. "Undesirable" on the left, "desirable" on the right.

Not only is the response time not rising exponentially, the application may be scaling as well as it can on that platform. We know from Part I, that saturation is to be expected and growth above saturation is linear, not exponential. Moreover, such behavior does not, by itself, imply poor scalability. Indeed, the response time curve may even rise super-linearly in the presence of thrashing effects but this special case is not discussed either. These authors then go on to claim that a response time characteristic of the type shown in Fig. 2 is more desirable.

(Fig. 3 Right) "scales in a more desirable manner" because response time degradation is more gradual with increasing user load.

Assuming these authors did not mislabel their own plots (and their text indicates that they did not), they have failed to comprehend that the flattening effect is a signal that something is wrong. Whatever the precise cause, this snapping of the hockey stick handle should be interpreted as a need to scrutinize their measurements, rather than hold them up as a gold standard for scalability. This is not only a benchmarking blunder but, to paraphrase the physicist Wolfgang Pauli:

This analysis is so bad, it's not even wrong!


3.1  Threads That Throttle

The right approach to analyzing sublinear response times has been presented by Buch and Pentkovski [2001] (They even got a CMG Best Paper Award).


 
Figure 4: Wintel WAS Throughput.

The context for their measurements was a three-tier e-business application comprising:

  1. Web services
  2. Application services
  3. Database backend

and using Microsoft's Web Application Stress (WAS) tool as the test harness. The measured throughput in Fig. 4 exhibits saturation in the range 100 < Nwas < 150 clients. The corresponding response time data in Fig. 5 (measured in milliseconds) exhibits sublinear behavior of the type discussed in Sects. 2.2 and 3.


Figure 5: Response time as measured by the WAS tool.

In Table 1, Nwas is the number of client threads that are assumed to be running. The number of threads that are actually executing can be determined from the WAS data using Little's law [Gunther, 2005] in the form:
Nrun = Xwas × Rwas
(3)
with the timebase converted to seconds.


Table 1:

Nwas	Xwas	Rwas	Nrun	Nidle
1 24 40 0.96 0.04 5 48 102 4.90 0.10 10 99 100 9.90 0.10 ... ... ... ... .... 120 423 276 116.75 3.25 200 428 279 119.41 80.59 300 420 285 119.70 180.30 400 423 293 123.94 276.06

We see immediately in the fourth column of Table 1 that no more than 120 threads (shown in bold) are ever actually running (Fig. 6) on the client CPU even though up to 400 client processes have been initiated.

 
Figure 6: Plot of Nrun determined by applying Little's law to the data in Table 1.

In fact, there are Nidle = Nwas - Nrun threads that remain idle in the pool. This throttling by the size of the client thread pool shows up in the response data of Fig. 5 and also accounts for the sublinearity discussed in Sects. 2.2 and 3. The complete analysis of this and similar results are presented in [Gunther, 2005]. Because the other authors were apparently unaware of the major performance model expressed in Little's law, they failed to realize that their benchmark was broken.

4  Conclusion

Performance data is not divine, even in tightly controlled benchmark environments. Misinterpreting performance measurements is a common source of problems; some of which may not appear until the application has been deployed. One needs a conceptual framework within which to interpret all performance measurements. I have attempted to demonstrate by example how simple performance models can fulfill this role. These models provide a way of expressing our performance expectations. In each of the cited cases, it is noteworthy that when the data did not meet the expectations set by its respective performance model, it was not the model that needed to be corrected but the data. The corollary to this is, having some expectations is better than having no expectations.

5  Acknowledgements

The idea for reviving this article (which I originally wrote several years ago) is due to the editor, Rick Ralston; one of the unsung heroes of CMG who tirelessly seeks out material to maintain Measure IT as a valuable online communication vehicle.

References

[Buch and Pentkovski 2001]
Buch, D. K. and Pentkovski, V. M. (2001). "Experience in characterization of typical multi-tier e-Business systems using operational analysis". In Proc. CMG Conference, pages 671-681, Anaheim, California.
[Gunther 2005]
Gunther, N. J. (2005). Analyzing Computer System Performance with Perl::PDQ. Springer-Verlag, Heidelberg, Germany.
[McGrath and Yeager 1996]
McGrath, R. and Yeager, N. (1996). Web Server Technology: The Advanced Guide for World-Wide Web Information Providers. Morgan Kaufmann, San Francisco, California.
[Wilson and Kesselman 2000]
Wilson, S. and Kesselman, J. (2000). Java Platform Performance: Strategies and Tactics. SunSoft Press, Indianapolis, Indiana.

 

Last Updated 06/05/09


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group