|
Visualizing Virtualization
March, 2007
by Dr. Neil J. Gunther, Guest Editor
This issue of CMG MeasureIT focuses on the hot topic of Virtualization. It includes a selection of previously
published MeasureIT articles, one paper reproduced from the CMG 2006 conference proceedings, as well as a late-breaking
paper on consolidation under virtual machines in a Microsoft Windows environment. Each of these selected papers, along
with some additional papers cited in the References section below *, highlights an important aspect of virtualization from
the standpoint of performance analysis and capacity planning.
Virtualization is about creating illusions. Although this concept first appeared on mainframe computers (because they
had the horsepower to support it, and
still do),
all computer systems today are now sufficiently powerful to present users with the illusion of a single physical machine
appearing as multiple virtual machines (VMs). This multiplicity takes on a number of different guises, e.g.,
hyperthreaded virtual processors or guest operating systems (OS) running as VMs on a hypervisor OS, also
called a virtual machine monitor (VMM). Many readers will already have been exposed to at least one type of VM.
Less apparent to many readers is the notion that virtualized services or hyperservices, such as GRID computing and
peer-to-peer (P2P) services like
BitTorrent,
also rely on VM-style architectures. I shall return to this point shortly.
Moreover, as evidenced by the presentations at CMG conferences over the past several years, each of these VM types is
usually thought of as being quite distinct and unrelated. In fact, I held this same view until I came to write the
chapter on the Fundamentals of Virtualization for my new book
Guerrilla Capacity Planning.
Based on some presentations at CMG 2004, I already suspected that hypervisors might be based on something called a
fair-share (FS) scheduler [2]; a topic I had already discussed at CMG in 1999 [1]. Whereas a
time-share (TS) scheduler provides each user with the illusion that they are the only user of the physical processors,
FS provides each user (or group of users) with the illusion that they possess their own VM whose service rate is scaled
according to the allocated resource entitlement [1]. The system administrator prorates entitlement by allocating
different numbers of shares to different users and groups, just like owning equity shares in a corporation. The
greater your share entitlement, the greater your maximal allowed resource consumption.
This hunch became reality while I was perusing the references in Gene Fernando's comprehensive CMG 2005
paper [3]. In particular, he cites an
online document
entitled "ESX Server Performance and Resource Management for CPU-Intensive Workloads" from VMware, Inc. The section
on "Allocating CPU Resources via Shares" (starting on p. 14) explicitly discusses how the choice of share allocations
can significantly impact the performance of a guest VM (the default allocation is 1000 shares per guest).
They employ the 164.gzip benchmark code from the
SPEC suite
of integer-based benchmarks as the test workload. This is a purely CPU-intensive workload, but it also makes it easy
to understand the impact of share allocations.
I also knew that the FS
scheduler uses a polling protocol to provide physical service for the allocated virtual services. The FS polling
rate is typically on the order of five seconds [1].
Armed with this insight, I then went back and reviewed some earlier CMG articles on hyperthreading.
These included two Measure IT articles; one by Mark Friedman entitled
"Hyperthreading - Two For The Price Of One?"
and another by Ellen Friedman entitled
"Tales from the Lab: Best Practices in Application Performance Testing"
where they discuss respectively how hyperthreading works and how difficult it is to measure. In addition, a CMG conference paper by
Scott Johnson [4] provided valuable performance data generated by
carefully controlled laboratory measurements using multi-threaded workloads.
These data enabled me to build some elementary performance models in
PDQ
from which it was clear that some of the overhead in hyperthreading was also due to a polling, albeit at a much higher rate
than is true for FS polling. If your application thread is sitting in the buffer that is currently not being serviced,
it gets to wait. This appears to the OS as service-time stretching for that application [4].
It is noteworthy that Fernando [3] reports a similar effect in certain BMC Patrol 2000 data.
Frustrated at being broadsided by the virtual, Fernando suggests disabling HTT if you are serious about server capacity planning.
I came to call this effect the Missing MIPS paradox [5]. More on this in a minute.
Generalizing this insight led me to construct a unified framework; the VM Spectrum [5], by which
the variety of VMs could classified. The continuous electromagnetic spectrum can be broken into three primary regions: the
visible region (VR) with frequencies that our eyes respond to, the ultraviolet region with radiation
frequencies much higher than our eyes can detect, and the infrared region with frequencies much
lower than our eyes can detect. This choice of regions is arbitrary in that it is biased by the fact that
we see. Similarly, the VM spectrum can be defined in terms of three primary regions:
- Micro region (like UV) involving high-frequency polling VMs such as hyperthreading, e.g., Xeon processor
- Meso region (like VR) involving intermediate-frequency polling VMs such as hypervisors, e.g., VMware and Xen
- Macro region (like IR) involving low-frequency polling VMs such as hyperservices, e.g., GRIDs and P2P
The implication that Meso-VMs are somewhat more "visible" to the capacity planner is intentional in that there are
more tuning knobs available through features like share allocation, whereas Micro- and Macro-VMs remain "invisible" in
the sense of black boxes. One difference, of course, is that the VM spectrum is discrete rather than continuous
because each VM type possesses its own characteristic polling frequency. A polling system is already less
efficient than a simple queueing system. Those of you who have taken my classes will recall that I often introduce
queueing concepts using the familiar example of a grocery store checkout. The cashier is the server and customers form a
single waiting line at the checkout to have their groceries rung up. A more efficient queueing system has multiple
cashiers servicing the single waiting line because this introduces a weak form of parallelism. I know of a Safeway store
in Melbourne, Australia that implements a mutliserver queue with six cashiers for its Express Lane. By analogy, a
polling system is more like a grocery store with only one cashier to service all the checkout stands!
Sound insane? Well, it can make sense for the case where, e.g., each customer only has one item to check out. In fact, most
operating systems implement round-robin priority queues in this way, and polling is used by certain high-speed packet
switches [6]. The important point here is to recognize that polling represents another dimension for
performance trade-offs in VMs.
Peg McMahon's MeasureIT article discusses the difficulties of capacity planning for
GRID hyperservices.
It is likely (and hoped) that we will see more CMG papers on the performance management of Macro-VM hyperservices (e.g.,
so-called Service Oriented Architectures [7]) in the future. To make life even more exciting for system
analysts and capacity planners, it is possible to build Macro-VMs that run on top of Meso-VMs, which run on top of
Micro-VMs. Reminds me of the old rhyme: Big bugs have little bugs upon their backs to bite 'em. And little bugs
have littler bugs and so ad infinitum. This can really mess with your head if you do not appreciate the commonalities
between these VMs as defined by the VM spectral regions. In fact, Meso-on-Micro is probably the most ubiquitous VM configuration
discussed (so far) in CMG presentations on virtualization.
The Missing MIPS paradox [3,4] can now be understood in terms of how Micro-VMs poll
for threads under certain conditions. Hyperthreading, also known as Hyper-Threading Technology (HTT) or Simultaneous
Multi-Threading (SMT) in Intel parlance, is primarily a way to saturate a single physical execution unit (EU) by
soaking up any remaining idle cycles. A processor like the Xeon, has two ports (AS registers in Intel parlance)
available to the same EU. When HTT is disabled, only one AS register is accessible to the OS run-queue, so TS scheduling
works the same way as it does for a single physical CPU with time-slicing. However, when HTT is enabled, the OS has to know how to schedule work onto both AS registers. These two registers act like 1-deep thread buffers. Provided different software
applications are appropriately threaded (and that's often a big assumption), one set of application threads can be
scheduled onto to one of the AS buffers (say AS0), while another set of application threads can be scheduled onto the
other AS buffer (say AS1). When a thread stalls on AS0, the EU would normally become idle, but with HTT enabled
the EU can service the AS1 buffer until the AS0 becomes ready again.
That is analogous to the single cashier switching between two checkout stands in a grocery store.
Actually, Intel does not tell us what the
exact scheduling discipline is inside the Xeon. Anyway, this is all well and good if you're on AS0, but not so wonderful if
you are on AS1 because you may spend a lot of time waiting for AS0 to stall. Moreover, from the standpoint of capacity
analysis for servers using Xeon parts, the OS gets fooled into thinking AS0 and AS1 represent two
virtual CPUS or VPUs [5] which potentially offer twice the CPU capacity of a non-HTT processor, viz.,
200%. In reality, this 200% capacity cannot be realized because the physical EU is generally more than 50% busy. Best
case controlled measurements [4] show that the EU only has about 25% idle cycles (1/4 × 100%)
available to service register AS1, and therefore the OS never sees more than 2 × 75% = 150%
(virtual) capacity
consumed. As reported in many CMG presentations, this virtual arithmetic (propagated to performance tools via the OS) has often led system
analysts and capacity planners on a wild goose chase looking for the
remaining 1/4 × 200% = 50% of virtual cycles, which never exisited in the first place.
Similar effects have been measured on Meso-VMs [5,8,9].
The plot on the left shows controlled measurements for a WebLogic-J2EE production application performed by
my colleagues at RSA Security.
It is quite apparent that approximately 25% of the expected throughput (X) is missing, relative to
PDQ
and that, in turn, is due to the knee occurring at N = 6 threads running rather than the expected
N = 8 threads running. The platform was a Dell PowerEdge 1750 with dual 3 GHz Xeon processors.
This effect was isolated to listen-thread contention in WebLogic. Dual Xeons (i.e., 2 EUs) with HTT enabled is
equivalent to 4 VPUs. If there were 2 threads per VPU, we would expect to see 8 listen-threads executing. In
fact, only 6 threads appear to be executing i.e., 2−ports × 150% = 3 VPUs instead of the expected 2−ports × 200% = 4 VPUs. Once again, we see the "missing" 50% signature.
(Details can be found in [5]).
Salsburg, Karnazes and Maimone employ the 164.gzip benchmark
to measure the processor overhead of a VMware hypervisor under various settings running on an
8-way Unisys ES7000-540-G3 using 3 GHz Xeons. As mentioned earlier, this is the same CPU-intensive benchmark workload
used by VMware Inc. to discuss the performance impact of share allocations on their
ESX Server.
The interested reader should compare these two reports. The original Xen development team, in addition to
constructing an FS scheduler (see Section 3 in [8]), performed a number of controlled measurements on both Xen
and VMware hypervisors using workloads ranging from Linux builds to OLTP database benchmarks (see Section 4
in [8]). For general applications that invoke significant amounts of I/O, memory and network activity, their
results show clearly that the overheads can be far greater than the simple Missing MIPS problem for
CPU-intensive applications [9]. Thus, from a performance perspective, server consolidation using Meso-VMs may
be regarded either as a many-to-one advantage or a many-headed
hydra.
Finally, the question arises: should you adopt VM technologies or not? As Friedman points out [9], you
need to remain cognizant that VMs can introduce significant performance penalties. On the other hand [5], you
may still choose to implement your applications on Micro-VMs or Meso-VMs for reasons other than performance, e.g.,
improved security enforcement, power reduction
or just satisfying internal politics.
McMahon
astutely remarks that users need to apply pressure on all the commercial vendors to make more VM performance statistics
available to the OS and to the human system analyst. Indeed, this situation is slowly starting to improve
with the advent of internal hardware-state information such as the
PURR register in the IBM
Power5
processor.
PURR
stands for Processor Utilization Resource Register. Hardware registers of this type will help to ameliorate
performance conundrums like the Missing MIPS problem in Micro-VMs. Additional instrumentation is still needed at the
Meso-VM and Macro-VM levels to help elucidate where service times are being stretched between the guest VM and the
hypervisor running on the physical platform. The message to commercial VM vendors is clear:
Constructing illusions by hiding physical information from users is
one thing, but propagating that illusion to the system analyst by hiding
vital performance information is considered harmful and ultimately bad for business.
Perhaps their watchword should be fewer bells, more whistles.
I hope that the central concept of polling and the organization of the VM spectrum that follows from it will help you
to better appreciate the presentations of the CMG authors compiled here and, even better,
help you to write your own paper on virtualization for
CMG 2007.
References
- [1]
-
N. J. Gunther,
"Capacity Planning for Solaris Resource Manager: All I Ever Wanted was My Unfair Advantage (And Why You Can't Get
It!),"
CMG Proceedings (on CD), Reno, Nevada, 1999
- [2]
-
J. Kay and P. Lauder,
"A Fair Share Scheduler,"
Communications of the ACM, 31, 44-55, 1988
- [3]
-
G. Fernando,
"To V or Not to V: A Practical Guide To Virtualization,"
CMG Proceedings (on CD), Orlando, Florida,
2005
- [4]
-
S. Johnson,
"Measuring CPU Time from Hyper-Threading Enabled Intel Processors,"
CMG Proceedings (on CD), Dallas, Texas,
2003
- [5]
-
N. J. Gunther,
"The Virtualization Spectrum from Hyperthreads to GRIDs,"
CMG Proceedings (on CD), Reno, Nevada, 2006
- [6]
-
N. Gunther, K. Christensen and K. Yoshigoe,
"Characterization of the Burst Stabilization Protocol for the RR/CICQ Switch,"
IEEE Conference on Local Computer Networks,
October 20-24, Bonn, Germany, 2003
- [7]
-
A. W. Shum and J. P. Buzen,
"Achieving Business Agility with SOA: Governance and SLA Management of Shared Service Ecosystems,"
CMG Proceedings (on CD), Reno, Nevada, 2006
- [8]
-
P. T. Barham, B. Dragovic, K. Fraser,
S. Hand, T. Harris, A. Ho, R. Neugebauer,
I. Pratt and A. Warfield,
"Xen and the Art of Virtualization,"
SOSP (ACM Symposium on Operating Systems Principles),
164-177, 2003
- [9]
-
M. Friedman,
"The Reality of Virtualization for Windows Servers,"
CMG Proceedings (on CD), Reno, Nevada, 2006
Last Updated 03/20/07
Home |
Conference |
Groups |
National |
Members |
Links |
Site Map
|