Benchmarking Blunders and Things That Go Bump in the Night: Part II
by Neil J. Gunther
In Part I, some simple performance relationships pertaining to benchmark
measurements were presented and applied to a case study we called,
Miami Vice and the Psychic Hotline. In this second installment, we
show how one of the simplest and most fundamental performance
relationships, Little's law, can reveal incorrect measurements and
help prevent benchmark blunders.
Benchmarking, by which I mean any computer platform that is driven by a
controlled workload is the ultimate performance simulator. Benchmarking
is often used to test application software performance just one step away
from going live. It is also used to evaluate the performance of hardware
platforms from competing vendors during a procurement cycle. Because
benchmarking is actually a very complex simulation, it also affords
countless opportunities for making blunders including: how the workloads
are executed, and how the resulting measurements are interpreted. Right
test, wrong conclusion is a ubiquitous mistake that happens because
performance engineers tend to treat benchmark data as divine. Such
reverence is not only misplaced, it can also be a sure ticket to hell
once the application finally goes live. In these two articles, I try to
show you how such mistakes can be avoided. The vital but missing element
is a simple performance model against which the measurements can be
assessed. If you are not set up to make such assessements, how are you
going to know when you're wrong?
The following examples are taken from erroneous benchmark measurements
reported in various publications. Since these data were published, it
tells you that authors actually thought their measurements were correct.
This unfortunate situation can only occur when the collected data are not
compared against simple performance truths, such as Little's law.
2 Little's Law Means a
Little's law provides a connection between the measured throughput X and
the measured response time R; two quantities commonly reported by
benchmarking and load testing tools like Mercury's
2.1 A Little
Suppose you've just pulled up behind a line of
cars at one of the tollway entrances to a freeway. As a performance
person, while you're waiting, you look directly across at the line next
to you and see that a yellow Volkswagon corresponds to your position in
that line. You count off 16 cars ahead of the Volkswagon (including the
driver currently paying), and in the next 30 seconds 5 more cars arrive
behind the Volkswagon. How long will you spend at the tollway?
Time at the tollbooth = 16 cars ahead of you
10 new cars per minute
= 1.6 minutes
You just used Little's law: the time in the system (R) is determined by
the number of cars already in the system (N) divided by the rate at which
new cars arrive into the system (?). More commonly,
(1) is written as a
product rather than a fraction by rearranging the terms:
N = ? R
The 1.6 minutes is just an estimate because its value rests on the
- the system is in steady-state so that N, ?, R are meaningful
- the measurements are carried out over a long enough period to see
It could have been that the cars you counted at the toolbooth queue
were part of an instantaneous fluctuation that was bigger or smaller than
the long-term average, so we can't take the 1.6 minutes too seriously.
But at least you made good use of the waiting time by practicing your
performance analysis skills. If you repeated your estimate each day, you
might improve it.
Moreover, (2) is quite remarkable in that it is
not even based on queues or tollways. In fact,
John Little (the who came up with the mathematical proof in 1961) is
a professor in the business school at MIT, not the computer science
department. Queueing sytems are merely a subset of the systems to which
Little's law can be applied. We use it most often in the queueing context
as performance analysts.
2.2 Demonized Performance
With Little's law in hand, let's look at another
benchmark blunder that could have been avoided if the author had
The load test measurements in Fig. 1 were
made on a variety of HTTP demons [McGrath and
Figure 1: Measured throughput (conn / s) for a suite of HTTPd
Each curve roughly exhibits the expected
throughput characteristic as described in Part I. The slightly odd
feature, in this case, is the fact that the servers under test appear to
saturate rapidly for user loads between 2 and 4 clients. Turning to Fig.
2, we see
similar features in those the curves.
Figure 2: Measured response times (ms) of the same suite of HTTPd
Except for the bottom curve, the other
four curves appear to saturate between N = 2 and 4 client
generators. Beyond the knee one dashed curve even exhibits
retrograde behavior; something that
would take us too far afield to discuss here.
But these are supposed to be response time characteristics, not
throughput characteristics. According to Part I this should never happen!
This is our second benchmarking blunder. These data defy the laws of
queueing theory [Gunther, 2005]. Above
saturation, the response time curves should start to climb up a hockey
stick handle with a slope determined by the bottleneck service time
Smax. But this is not an isolated mistake.
3 Falling Flat on Java
In their book on Java performance analysis,
Wilson and Kesselman  refer to the
classic convex response time characteristic discussed in Part I as being
undesirable for good scalability. I quote ...
- (Fig. 3 Left) "isn't scaling
well" because response time is increasing "exponentially"
with increasing user load.
They define scalability rather narrowly as "the study of
how systems perform under heavy loads." This is not necessarily so.
As I have just shown in Sect. 2.2, saturation
may set in with just a few active users. Their conclusion is apparently
keyed off the incorrect statement that the response time is increasing
"exponentially" with increasing user load. No other evidence is
provided to support this claim.
Figure 3: Comparison of java scalability taken from
Wilson and Kesselman, 2000].
"Undesirable" on the left, "desirable" on the
Not only is the response time not rising
exponentially, the application may be scaling as well as it can on that
platform. We know from Part I, that saturation is to be expected and
growth above saturation is linear, not exponential. Moreover, such
behavior does not, by itself, imply poor scalability. Indeed, the
response time curve may even rise
super-linearly in the presence of
thrashing effects but this special case is not discussed either.
These authors then go on to claim that a response time characteristic of
the type shown in Fig. 2 is more
- (Fig. 3 Right) "scales in a more
desirable manner" because response time degradation is more gradual
with increasing user load.
Assuming these authors did not mislabel their own plots (and their
text indicates that they did not), they have failed to comprehend that
the flattening effect is a signal that something is wrong. Whatever
the precise cause, this snapping of the hockey stick handle should be
interpreted as a need to scrutinize their measurements, rather than hold
them up as a gold standard for scalability. This is not only a
benchmarking blunder but, to paraphrase the physicist Wolfgang Pauli:
- This analysis is so bad, it's not even wrong!
3.1 Threads That
The right approach to analyzing sublinear response times has been
presented by Buch and Pentkovski  (They even
got a CMG Best Paper Award).
Figure 4: Wintel WAS Throughput.
The context for their measurements was a
three-tier e-business application comprising:
and using Microsoft's Web Application Stress (WAS) tool as the test
harness. The measured throughput in Fig.
4 exhibits saturation in the range 100
< Nwas < 150 clients. The corresponding response time data in
Fig. 5 (measured in milliseconds)
exhibits sublinear behavior of the type discussed in Sects.
2.2 and 3.
- Web services
- Application services
- Database backend
Figure 5: Response time as measured by the WAS tool.
In Table 1,
Nwas is the number of client
threads that are assumed to be running. The number of threads that are
actually executing can be determined from the WAS data using Little's law
[Gunther, 2005] in the form:
Nrun = Xwas × Rwas
with the timebase converted to seconds.
Nwas Xwas Rwas Nrun Nidle
1 24 40 0.96 0.04
5 48 102 4.90 0.10
10 99 100 9.90 0.10
... ... ... ... ....
120 423 276 116.75 3.25
200 428 279 119.41 80.59
300 420 285 119.70 180.30
400 423 293 123.94 276.06
We see immediately in the fourth column of Table
1 that no more than 120 threads (shown in
bold) are ever actually running (Fig. 6)
on the client CPU even though up to 400 client processes have been
Figure 6: Plot of Nrun determined by applying Little's law to the
data in Table 1
In fact, there are Nidle =
Nwas - Nrun threads that remain idle in the pool. This
throttling by the size of the client thread pool shows up in the response
data of Fig. 5 and also accounts for the
sublinearity discussed in Sects. 2.2 and
complete analysis of this and similar results are presented in [Gunther,
2005]. Because the other authors were apparently unaware of the major
performance model expressed in Little's law, they failed to realize that
their benchmark was broken.
Performance data is not divine, even in tightly controlled benchmark
environments. Misinterpreting performance measurements is a common source
of problems; some of which may not appear until the application has been
deployed. One needs a conceptual framework within which to interpret all
performance measurements. I have attempted to demonstrate by example how
simple performance models can fulfill this role. These models provide a
way of expressing our performance expectations. In each of the cited
cases, it is noteworthy that when the data did not meet the expectations
set by its respective performance model, it was not the model that needed
to be corrected but the data. The corollary to this is, having some
expectations is better than having no expectations.
The idea for reviving this article (which I originally wrote several
years ago) is due to the editor, Rick Ralston; one of the unsung heroes
of CMG who tirelessly seeks out material to maintain Measure IT as a
valuable online communication vehicle.
References[Buch and Pentkovski
- Buch, D. K. and Pentkovski, V. M. (2001). "Experience in
characterization of typical multi-tier e-Business systems using
operational analysis". In Proc. CMG Conference, pages 671-681,
[McGrath and Yeager 1996]
- Gunther, N. J. (2005). Analyzing Computer System Performance with
Perl::PDQ. Springer-Verlag, Heidelberg, Germany.
[Wilson and Kesselman 2000]
- McGrath, R. and Yeager, N. (1996). Web Server Technology: The
Advanced Guide for World-Wide Web Information Providers. Morgan
Kaufmann, San Francisco, California.
- Wilson, S. and Kesselman, J. (2000). Java Platform Performance:
Strategies and Tactics. SunSoft Press, Indianapolis, Indiana.
Last Updated 06/05/09