Benchmarking Blunders and Things That Go Bump in the Night: Part I
by Neil J. Gunther
Benchmarking, by which I mean any computer platform
that is driven by a controlled workload using tools like Mercury's LoadRunner,
IBM's TPNS, or
Tool, is the ultimate performance simulator. Benchmarking is often used to
test application software performance just one step away from going live. It is
also used to evaluate the performance of hardware platforms from competing
vendors during a procurement cycle. Because benchmarking is actually a very
complex simulation, it also affords countless opportunities for making blunders
including: how the workloads are executed, and how the resulting measurements
are interpreted. Right test, wrong conclusion is a ubiquitous mistake that
happens because performance engineers tend to treat benchmark data as divine.
Such reverence is not only misplaced, it can also be a sure ticket to hell once
the application finally goes live. Over the course of these two articles, I will
try to show you how such mistakes can be avoided using the vehicle of two war
- Using the "psychic hotline" to resolve benchmark inconsistencies, and
- Understanding how benchmarks can go flat with too much Java juice.
In each case the vital but missing element is a simple
performance model against which the measurements can be assessed. If you are not
set up to make such assessements, how are you going to know when you're wrong?
In general, you can only avoid such mistakes by recognizing that data comes from
the Devil, only models come from God.
Capacity planning is about setting expectations. Even wrong
expectations are better than no expectations!
---Entry from the online Guerrilla
Benchmarking is the
ultimate in performance analysis viz., workload simulation. It is often made
more difficult because it takes place in a competitive environment: be it
vendors competing against each other publicly using the TPC (Transaction Processing Performance Council),
and SPEC (Standard Performance Evaluation
Corporation) benchmarks, or a customer assessing vendor platform performance by
running their own application as part of a procurement cycle. In the latter
case, visiting cutomer-engineers are kept unaware of vendor performance
improvements that are introduced onto the benchmark rig under the stealth of
night 1. So much for
the "rigorous" testing of the customer's application. Notwithstanding the fact
that benchmarking is a form of institutionalized cheating (which everybody knows
but won't admit to publicly) there are countless opportunities for blunders in
the way the workloads are constructed and the system is tuned. Most discussions
about how benchmarks should be conducted, revolve around such tuning issues.
A far more significant problem, and one that everyone
seems to be blissfully unaware of, is interpreting the benchmark measurements.
Right test, wrong conclusion is a much more common blunder than many people
realize. This happens because test engineers tend to treat performance data as
something divine. The huge cost and effort required to set up a benchmark system
can lead to a false sense of security based on the unstated premise that the
more complex the system, the more accurate it must be. As a consequence,
whatever data it generates is considered correct by design and to even suspect
otherwise verges on the sacrilegious. Such reverence for performance data is not
only misplaced, it often guarantees a free trip to hell when the application
finally goes live.
In Part I, I will show by example how such benchmark
blunders arise and, more importantly, how they can be avoided with simple
performance models that provide the correct conceptual framework in which to
interpret the data.
2 Canonical Curves
Let's begin by reviewing some canonical performance
characteristics that occur in all benchmark measurements.
Figure 1: Canonical throughput characteristic.
Fig. 1 shows the
canonical system throughput characteristic X (the dark curve). This curve is
generated by taking the statistical average of the instantaneous throughput
measurements at successive client load points N once the system has reached
steady state as explained in Fig. 2
as well as [Gunther,
2005], [Zadrozny et al.,
2002], and [Joines
et al., 2002].
Figure 2: Relationship between the instantaneous throughput
data X(i) (shown as insets) over the course of measurement runs from t = 0 to
t=T under a load of N = 20 and N = 100 users, respectively. The extracted
steady-state values (shown as diamonds on the larger curve) are produced by
calculating the time-averaged throughput as
X(N) = ∑iT
XN(i) / T.
The dashed lines in Fig. 1 represent bounds on the throughput characteristic. The
horizontal dashed line is the ceiling on the achievable throughput
Xmax. This ceiling is controlled by the bottleneck resource in the
system; which also has the longest service time Smax. The variables
are related inversely by the formula: which tells us that the bigger the
bottleneck, the lower the maximum throughput; which is why we worry about
bottlenecks. The point on the N-axis where the two bounding lines intersect is a
first order indicator of optimal load Nopt. In this case,
Nopt = 21 VUsers.
The sloping dashed line in Fig. 1 shows the best case throughput if their were no
contention for resources in the system-it represents equal bang for the
buck-an ideal case that cannot be achieved in reality.
Figure 3: Canonical delay characteristic. NOTE: This is R vs.
Similarly, Fig. 3
shows the canonical system response time characteristic R (the dark curve). This
shape is often referred to as the response hockey stick. It is the kind
of curve that would be generated by taking time-averaged delay measurements in
steady state at successive client loads.
The dashed lines in Fig. 3 also represent bounds on the response time
characteristic. The horizontal dashed line is the floor of the achievable
response time Rmin. It represents the shortest possible time for a
request to get though the system in the absence of any contention. The sloping
dashed line shows the worst case response time once saturation has set in. These
things will constitute our principal performance models.
Figure 4: Non-canonical throughput-delay curve. NOTE: This is R vs. X.
(Cf. Fig. 3)
In passing, it is worth noting that the throughput
X(N) and response time R(N) data can be combined into a single plot like
Fig. 4[See e.g., [Splaine and Jaskiel, 2001]]. Although useful in some
contexts (e.g., network packet performance), the combined plot suffers from the
limitation of not being able to determine the location of Nopt.
3 Miami Vice and the Psychic Hotline
With the basics under our belt, let's move on to the first
Florida, sometime in the dim distant past. Names and numbers have been changed
to protect the guilty. I was based in San Jose, California, at the time, and all
communications occurred over the phone. I never actually went to Florida
(although I was fully expecting to) and I never saw the benchmark system, which
involved multiple servers including an IBM mainframe running a CICS (Customer Information
Control System) application. The benchmark workload simulator was built using TPNS.
During the prior eighteen months, a large benchmarking
rig had been set up to test the functionality and performance of a complex
third-party application. Using this platform, the test engineers had
consistently measured system throughput at 300 TPS (transactions per second)
with an average think time of Z = 10 seconds between sequential transaction
requests. Moreover, during the course of development, the test engineers had
managed to have the application specially instrumented so they could see where
time was being spent internally when the application was running. This is a good
thing and a precious rarity today! The instrumented application had therefore
logged internal processing times. So far, so good.
3.2 Benchmark Results
subsequent discussion we'll suppose that there were three sequential
processing stages. In reality, there were many more but the point should become
clear with this simplified example. During the course of my conversation with
the engineers in Miami, I enquired about typical processing times logged by
their instrumented application. As part of their response to this question, I
was given a list of average times which I shall represent here by just the three
token values: 3.5, 5.0, and 2.0 milliseconds. On seeing these numbers, I
immediately responded, "Something is rotten in Denmark ... err, I mean ....
3.3 The Psychic Insight
This is our first benchmarking blunder. I can see from the
reported instrumentation data that the largest time is 5 milliseconds or 0.005
seconds. That means the bottleneck processing stage has a service time of
Smax = 0.005 seconds and applying the bounds analysis of
Sect. 2 I can further deduce: In other words, this piece of
instrumentation data (if it's correct) tells me that the overall system
throughput cannot be greater than 200 TPS. Incidentaliy, I can also predict the
optimal user load as:
which is the ratio of the total
round-trip time (including the thinktime at the client driver) to the bottleneck
time [Gunther, 2005]. Note that
I've replaced Z = 10 seconds in eqn.(3) by Z = 10,000
milliseconds for the purpose of keeping the arithmetic straight. Nopt
in eqn.(3) corresponds to the point on the
horizontal-axis of a plot of throughput (like Fig. 1) where the two dashed lines cross each other. In
Fig. 1, Nopt ? 22 because it is not a
plot of the Florida data (I was never in possession of a complete throughput
profile). Knowing Nopt can be important for the following reason. If
the user load (N) is too far below this point (i.e., N < Nopt),
system capacity will be under utilized. If the user load is too far above this
point (i.e., N > Nopt), the system will be driven into saturation
and the user response time begins to climb dramatically (i.e., linearly) with a
profile similar to Fig. 3.
|| 3.5 + 5.0 + 2.0 + (10.0 ×1000)
|= 2002.1 VUsers
But the Florida performance engineers were claiming
300 TPS as their maximum measured throughput. Naturally, they were not
enthusiastic about my psychic intrusion into their hard-won eighteen months of
Nonetheless, this vital performance data they had collected so
diligently was trying to tell them something. Either:
Busy work does not accrue enlightenment.
---Entry from the online Guerrilla
- the 300 TPS measurement was wrong or,
- the instrumentation measurements were wrong.
If only they
had constructed a simple performance model against which to compare their data,
they would've seen this message themselves. Once this inconsistency was pointed
out to them, the performance engineers decided to thoroughly review their
measurement methodology 2. We got off the phone with the agreement that
we would resume the discussion the following day. During the night (remember the
title?), the other shoe dropped. The engineers realized that the Devil was in
the data (not just the details).
Each thinktime value used in the driver-script is
generated randomly on the fly in such a way that when averaged over a sufficient
number of calls, it approaches the average value (Z) designated in script. In
this case, the average thinktime was designated as Z = 10 seconds. The
individual values that conform to the statistical average are called random
variates. The client-side scripts contained an if() statement to
calculate each variate between transactions. The Miami engineers discovered that
this branch of the script code was not being executed properly (I'm not sure I
ever understood the precise explanation) and consquently the average think time
was effectively zero (i.e., Z = 0 seconds). In essence, it was as though the
transaction measured at the client-driver side Xclient was comprised
of two contributions such that:
Xactual = 200 TPS (consistent with the application instrumention
data) and Xerrors = 100 TPS. That's a whopping 50% error between the
reported and actual database throughput; but only if you're aware of it.
|Xclient = Xactual +
In other words, with zero think time, the test
platform was being over-stressed in batch mode and that caused a
request rate that was too high for the CICS application to handle. (See the
interesting discussion in [Zadrozny
et al., 2002] on this point). A database transaction that it could not
be completed correctly was not being marked as a failure, rather an ACK
was returned to the driver script which then scored it as a successful
Even after eighteen months hard labor, the test
engineers remained blissfully unaware of this large error margin because they
had not constructed a simple performance model like that presented in
Sect. 2. Once they were apprised of the
inconsistency, however, they were very quick to assess and correct the problem.
The engineers in Florida did all the work, I just read the message in the
medium. Dionne Warwick and her Psychic Friends Network probably would've been
proud of me.
Next month, the second war story: Falling Flat on
- [Gunther 2005]
- Gunther, N. J. (2005). Analyzing Computer System Performance with
Perl::PDQ. Springer-Verlag, Heidelberg, Germany.
- [Joines et al. 2002]
- Joines, S., Willenborg, R., and Hygh, K. (2002). Performance Analysis
for Java Web Sites. Addison-Wesley, Boston, Mass.
- [Splaine and Jaskiel 2001]
- Splaine, S. and Jaskiel, S. P. (2001). The Web Testing
Handbook. STQE Publishing, Inc., Orange Park, Florida.
- [Zadrozny et al. 2002]
- Zadrozny, P., Aston, P., and Osborne, T. (2002). J2EE Performance
Testing with BEA WebLogic Server. Expert Press, Birmingham, UK.
1The second part of the title for this
2The more likely incentive is that they
urgently wanted to prove me wrong.
Last Updated 06/05/09