Introduction.
Virtualization appeals to computer professionals as a means to address the
proliferation of under-utilized Windows servers, which is widely perceived
as a significant system management problem. An ambitious server consolidation
initiative that relies on virtualization can achieve significant administrative
and other cost efficiencies. In this article we hope to inject a healthy dose
of realism about the performance of current virtualization technology into
a planning process that has high expectations for its value proposition. In
practice, virtualization can be deployed to considerable benefit, but we worry
that these benefits can also be oversold. For instance, we take the somewhat
contrarian view that under-utilized Windows servers is not the profligate
waste of resources that the evangelists for virtualization suggest it is.
Nor is virtualization the most direct approach to its solution.
This article consolidates our research into the performance of virtualization
technology, focusing on the popular VMware ESX product. It is based primarily
on material from two recent papers [1, 2]
presented at the annual CMG Conference in 2006, which themselves synthesize
the results of numerous other investigators that we have borrowed from freely.
Anyone interested in pursuing this topic further is advised to consult the
original sources referenced in the bibliography that is provided.
Software developers were among the earliest adopters of the virtualization
technology that is available today for Windows. They faced a common problem,
namely, the need to subject new releases of software to rigorous testing on
a wide variety of platforms on a tight budget for both time and materials.
Virtualization software allows a single machine to be configured to run multiple
operating system images that can then be used to ensure that the software
being developed functions correctly in diverse configurations. Software development
and Quality Assurance testing remains the one area where virtualization can
be deployed with the greatest unqualified chance of success.
So long as the virtualization technology being deployed was confined primarily
to assisting with software development and testing, the technology raised
few pressing capacity planning or performance concerns. Running the sort of
application stress-testing workload where performance actually mattered could
always be diverted to a dedicated machine. (See, for example, [6]
for a thorough discussion of these issues.) It was only when this same virtualization
technology was re-positioned as a way to achieve server consolidation that
serious capacity planning and performance considerations began to surface.
To the degree that generously-sized current hardware capabilities often lead
to machines that appear to be massively over-provisioned, sizing the host
machine ought, in principle, to be relatively easy.
Capacity planning for virtualization is best described as an n:1 folding
problem. The capacity planner must assure that the guest workloads can fit
into the one physical machine managed by the virtual machine host. This is
a multi-dimensional problem where the capacity planner must assure that the
processor, disk, memory and network bandwidth of the combined guest machines
– plus some allowance for virtual machine “overheads” – does not exceed the
physical capabilities of the underlying hardware. This basic problem becomes
difficult in the Windows Server environment due to having only limited measurement
data that can be used reliably to estimate the amount of virtualization overhead
to expect in advance for a given workload.
Discussing the sources of various virtualization “overheads” inevitably leads
to considering the performance issues that currently arise when Windows runs
as a guest machine. We mention two significant problems that have not been
discussed much by other commentators. One concern is the technique used to
schedule virtual machines that has a serious performance impact on I/O bound
workloads. A second problem area involves the measurement perturbations that
occur on the Windows platform after virtualization technologies are
introduced.
After providing a realistic assessment of the virtues of current virtualization
technology, we discuss recent hardware improvements and their prospects for
resolve these performance problems in a satisfactory manner.
Reigning in the Server Farm.
Mainly out of concern for easing the burden of system administration, Windows
servers are usually configured to run a single, homogenous workload Some
of the historical reasons for preferring isolated Windows servers running
single applications are (1) increased stability, (2) better reliability, and
(3) simpler problem-solving. As capacity planners, we’d like to be able to
say that simpler capacity planning is another worthy goal associated with
this policy, but it seldom plays a significant role in the decision-making
process.
A powerful argument for running homogenous, single workload Windows machines
was increased stability. Indeed, one of the present authors has been one
of the forceful public advocates for this strategy to deploy Windows servers
successfully on a large scale. [3] The practices
and procedures forged in the crucible of experience that led in the past to
successful deployment of large numbers of Windows servers running mission
critical applications have an understandable resilience in the face of change.
At the present time, however, it is worth considering whether the historical
conditions that favored configuring Windows machines to run a single, homogenous
workload are still present today. This argument is made in more detail in
[2].
Current industry best practices recommend that the Technical Support group
responsible for Windows servers and desktops adopts a stable image
of the operating system and application software that it intends to maintain
and support going forward after a lengthy period of concentrated acceptance
testing. The image is then cloned every time there is an organizational requirement
to support expansion of this application or support it in a new location.
This practice has deep roots. Consider that distributed computing initiatives
allowed pockets of control by various departments of what was perceived as
their hardware and applications. “Gold image” strategies played into that
history, making it easier for distributed administration to be centralized
as cost drivers increasingly brought the distributed world back into the data
center. This re-centralization often occurred with little no involvement from
capacity planning professionals.
The widespread practice of cloning operating system and application software
images that are certified by sysadmins for stability and reliability does
not alone lead to massive over-provisioning of Windows servers. The hardware
to run Windows keeps getting more powerful in leaps and bounds. Current generation
hardware from Intel and AMD offers 64-bit addressing, massive amounts of RAM,
multi-core processors, and hyper-threading. These machines offer more processing
power than many single workload application servers will ever need. Out of
necessity, IT organizations continuously update their inventories due depreciation
and lease replacement, as well as both operating system and server application
evolution. This process increases the number of these larger, more powerful
servers that are deployment for the Windows server platform. Over-provisioning
inevitably occurs when a disciplined capacity management approach is not coupled
with the hardware replacement strategy.
The lack of involvement in capacity planning has positioned IT departments
where they now finds themselves: with abundant opportunities to save administrative
costs by reducing the total number of machines that need to be managed. In
this environment, server consolidation makes evident good economic sense,
with virtualization being one obvious path to server consolidation.
In the case of a remote field office operation, for example, an accomplished
system administrator might think it is necessary to supply a minimum of three
separate OS images: one to provide the essential Messaging application like
MS Exchange or Lotus Notes to tie employees at the remote office to the corporate
e-mail network; one to provide Active Directory-based security and authentication
services, and a third to supply data protection (i.e., back-up) and data recovery.
Conventional practice would be to supply three separate, dedicated machines
to perform these functions, all of which would likely be severely under-utilized.
Virtualization is an appealing option in this instance because it allows
all three machine images to be packaged together and run inside a single box
without losing the isolation of multiple operating system environments.
We are not ready to concede that this over-provisioning is nearly as big
a problem as IT professionals reared on the frugality required to manage expensive
mainframe technology cost-effectively presume it is. In the face of the changing
economics of Information Technology, it is certainly worthwhile to examine
the assumptions behind this presumption that delivery of service always requires
separate machines. An application of rudimentary capacity planning practices
and procedures would sharply reduce the degree of over-provisioning that is
occurring today.
Why virtualization. This brings us to the heart of the matter. Servers
running single application workloads are often massively over-provisioned.
Virtualization promises to enable current hardware to be used more efficiently.
Many IT organizations also view virtualization technology as a viable solution
to the evident problem that the organization has way too many machines to
manage. In theory, at least, virtualization technology is positioned to address
the inefficiencies of running machines that are severely over-provisioned.
Virtualization does provide a mechanism that allows system administrators
to utilize current hardware more effectively, while retaining all the administrative
advantages of isolating workloads on dedicated servers. Using virtualization,
it is, for example, possible to configure and run two or more virtual machines
– each devoted to running a single, isolated workload – on a single hardware
platform. In the multiple machine scenario described above, instead of provisioning
three separate machines to run the mail server application, the domain controller,
and the back-up server, all three server machines can be consolidated on a
single hardware platform running virtualization software.
Virtualization allows the system administration to supply all three essential
services discussed above on a single piece of hardware, which is certainly
a simplification along that important dimension. Moreover, the virtualization
solution has the additional benefit that it appears not to require major changes
in current system administration practice. (The recognition that virtualization
itself might add significant complexity to the operation is something that
usually only surfaces later with experience. See, for example [4].)
Potential performance concerns with virtualization are deemphasized, based
on the assumption that the hardware used to consolidate these workloads is
so much more powerful than what its OS guests demand. But, as will be discussed
below in more detail, this working assumption is naive. Having witnessed a
number of large server consolidation efforts that relied on virtualization
that came up far short of their ambitious consolidation goals, we believe
it is important to raise a warning flag about the potential performance problems
that arise even in the face of what appears to be massive over-provisioning.
Curiously, the quaint possibility that multiple workloads can be readily
consolidated today and delivered by a single operating system image does not
seem to occur to most Windows Technical Support professionals. Yet while
many of the system management deficiencies of early versions of the Windows
NT platform have been addressed, the “best practices” associated with deploying
single application servers has barely noticed. As an alternative to virtualization,
server consolidation through consolidation of multiple workloads under a single,
native OS image actually reduces the number of machine images that are deployed
and permits these workloads to run at native execution speeds. We concede
that executing multiple workloads inside a single OS image requires the application
of performance management and capacity planning skills that are in woefully
short supply for the Windows platform. Nevertheless, the application of this
core systems management discipline, in principle at least, is far simpler
in the native environment.
Virtualization technology today.
Virtualization is accomplished as illustrated in Figure 1, by installing
a virtual machine host on the bare metal that is then capable of running numerous
virtual machines guest operating system images beneath it; in practice, as
many guest machines as will fit.

Figure 1. The architecture of VMware ESX version 3.
Figure 1 illustrates several key architectural features of the most popular
virtualization solution for server consolidation, which is the VMware ESX
Server product. (The discussion that follows is based mainly on ESX version
3.) Note that the VMware ESX software functions as the primary operating system
(OS) supervisor that interacts directly with the physical hardware – the processor,
RAM, the disks, the network, the video display, etc. Due to its status as
a base platform that runs the virtual machines, the VM Host software layer
that you install the virtual machine OS on top of is sometimes called the
hypervisor [7]. For an academic audience, the VM
Host software is known as the Virtual Machine Monitor [8]. Both are terms for an operating system supervisor
that is very limited in scope.
Unlike a more general purpose operating system, the functions that the VM
Host software performs are narrowly delineated to those that are required
to define the virtual machine guests and sustain them once they are activated.
Since VM Host software for Windows originated with Independent Software Vendors
(ISVs) who had no preferred access to Windows internals, an additional goal
of the 3rd party developers who created the VM Host software was
to run Windows guest machines transparently.
In the ESX architecture the VM Host software is responsible for all native
devices attached to the machine. The requirement that VMware be able to provide
native device drivers is a major encumbrance, the burden of which VMware attempts
to minimize by using a Linux-compatibility module. This Linux interoperability
makes it relatively easy to adapt existing Linux device driver modules so
that they can be re-compiled into the VMware Host kernel. (Due to the GNU
Open Source licensing restrictions, VMware is careful to say that ESX was
not derived directly from Linux, despite outward evidence of its family resemblance.)
In practice, ESX supports a wide variety of disk, network, and SAN-attached
devices (see http://www.vmware.com/pdf/esx_io_guide.pdf
for reference), similar to the range of devices that can be attached to most
Linux servers.
Relying on the VM Host software to provide native device drivers to support
all attached physical devices is not the only way to achieve virtualization’s
goals. The VMware GSX and Workstation products, as well as the Microsoft
Virtual Server 2005 product based on the software Microsoft originally acquired
from Connectix in 2003, provide virtualization software that is installed
on top of a standard Windows OS installation. This approach allows you to
install and run native Windows device drivers, which are more widely available
for some peripherals than Linux drivers. Native Windows device drivers typically
exploit Windows Plug-and-Play technology during installation and set-up. They
frequently also have more elaborate feature sets and user interfaces than
their Linux counterparts. On the other hand, the ESX dedicated Virtual Machine
Monitor approach permits a greater degree of vm guest isolation, such that
a problem with a device driver on one virtual machine guest is less likely
to impact other vm guests that offer shared access to the same device. The
ESX Host software also supplies a dedicated Scheduler service, as illustrated
in Figure 1, to dispatch VM guests, rather than rely on the standard Windows
priority-based thread Scheduler which was never designed with virtualization
in mind.
Figure 1 also illustrates the paravirtualization approach where VM-aware
device drivers may be installed on the guest OS Host to improve performance.[20]
In VMware ESX, a prominent example of this approach is the vmxnet virtual
network adapter. According to VMware documentation, vmxnet driver “implements
an idealized network interface that passes through network traffic from the
virtual machine to the physical cards with minimal overhead.”
Both the VMware Host and Virtual Server 2007 also inject one or more service
processes into the guest Windows machine. These are used to facilitate communication
between the Host and the Guest. In the case of ESX, the Windows guest OS
runs VMWareService, plus the VMWareTray and VMWareUser processes. Communication
between the VM Host and the Windows Guest OS occurs across a virtual NIC that
simulates a generic AMD PCNET Family Ethernet adapter (Host-guest machine
communication in Virtual Server 2007 is similar). From a performance monitoring
perspective, when ever you can detect that the VMWareService process is active,
then you can safely deduce that the machine in question is a virtual machine
guest.
Sizing virtualization
environments.
In principle, capacity planning to size the machine hosting two or more virtual
machines should be quite simple. It corresponds to the problem of folding
n virtual machines into one container, the machine that hosts the VM
Monitor. It does require an optimal solution over time that factors in whether
individual workload peaks overlap or not (if you can consolidate non-overlapping
workloads, you can achieve significantly more efficient operations). Some
care must also be taken to ensure that the solution is optimal across multiple
dimensions, where each dimension corresponds to utilization of some physical
resource that the host machine must apportion among the guest machines – the
processor, RAM, the disks, and the network interfaces. In addition, capacity
planning discipline would forecast this resource demand for the applications
over some specified planning horizon.
Processor overheads. The processor, for example should be large enough
to handle the sum of the processor demand from each configured virtual machine,
plus additional headroom to accommodate some amount of st the inevitable virtual
machine management “overhead” (to be dissected in somewhat more detail below):
Physical CPU Capacity > VMM
management overhead +
∑(VM-Guestn CPU
+ Overheadn)
In sizing the processor, at least three major sources of VM overhead can
be identified:
·
Virtual machine Scheduler overhead
·
Privileged instruction emulation overhead
·
Duplication of the I/O driver code path by the native VM Host
device drivers
Scheduler overhead. In [10], Gunther identifies
the VMM Scheduler that is responsible for either round-robin or weighted dispatching
of virtual machine guests as one source of overhead. Gunther relies on a
VMware-published ESX benchmark [11] that shows that this management overhead
is minimal and well-behaved. It appears to scale linearly with the number
of VM guests that are defined. The ESX benchmark data from [11] is summarized in Figure 2 for a four-way machine.
The benchmark workload in [11] is severely CPU-constrained.
It is also designed to minimize the other two major sources of VMware management
overhead. Notice that overall throughput tails off slightly as more virtual
machines are configured to run than there are physical processors available
to run them. Therefore, the difference between the dotted horizontal line
at the top of the chart identified as “theoretical” and the actual Completion
rate represents the processor scheduling overhead. The scheduling overhead
per guest machine can be calculated as:
(Theoretical – Actual throughput)
* 100 / Theoretical throughput / # of VMs

Figure 2. ESX scalability on a CPU-bound workload. Taken
from benchmark results published by the VMware Corporation. [11]
Privileged instruction emulation. In [12], Menascé identifies a second source of virtualization
processor “overhead,” which arises because the guest OS executes in User mode
(Ring 3 on an Intel processors). Under VMware, every time an OS function
inside the guest machine attempts to issue a privileged instruction, a hardware
exception is raised. The VM Host interrupt handler has to trap this exception
and recover from it. It does this by emulating the privileged instruction
issued by the guest OS that failed. In practice, the emulation routine can
be quite involved, depending on the function that VMware must mimic.
To illustrate this process, let’s look, for example, at what happens when
the guest OS needs to perform an I/O operation to disk. In the course of
generating an I/O request, the Windows kernel-mode I/O Manager and the physical
disk device’s associated driver code normally operate exclusively in Privileged
mode, or Ring 0. In the virtualization environment, all of this code is executed
in User mode, or Ring 3. Whenever any kernel mode instruction that is only
valid in Privileged mode is issued in User mode, the instruction fails. At
this point, the VM Host software intervenes. After trapping the invalid instruction
interrupt that occurs, the VM Monitor runs an emulation routine that mimics
the original intent of the guest OS, and then returns control to the guest
machine. Menascé characterizes this overhead in modeling terms as an execution
delay, which it certainly is. In the virtualization environment, each attempt
to execute a privileged instruction by the guest OS is replaced by an interrupt,
the execution of the VM Host interrupt handler, and, finally, the execution
of the emulation routine. The instruction path length associated with the
function increases enormously. Unfortunately, the full extent of the associated
delay is impossible to characterize accurately without measurements taken
by the VM Host on the number of privileged instructions emulated. Note that
this important performance data is not currently provided by VMware.
One approach is that the overhead associated with emulating instructions
that require privileged mode can be based on the amount of time the guest
OS spends in kernel-mode (% Privileged Time). This suggestion is helpful,
but woefully imprecise. Very few of the instructions executing in kernel-mode
are actually privileged instructions. Depending on the OS function being executed,
we might make be able to pursue this line of reasoning deeper. Device-driver
functions that issue instructions that reference physical addresses (not virtual
ones) require Privileged mode to succeed. Major OS functions related to I/O
processing in general, including the Cache Manager, the Workstation and Server
services, processing within the TCP/IP stack, and the kernel-mode http.sys
driver in IIS 6.0, all make extensive use of physical addressing mode.
The performance of all functions that rely on memory buffers pinned in the
Nonpaged Pool suffers in the virtualization environment, in some cases prohibitively
so. Menascé [8] suggests that the cycles lost to
privileged instruction emulation are at least partially offset by running
the virtual machine on a correspondingly faster processor. This planning assumption
is far too optimistic. If you are consolidating workloads running on previous
generation hardware onto new machines where a virtualization engine is installed,
the newer machines are likely to have a clock rate that is twice as fast.
However, the virtualization cost of emulating guest OS privileged instructions
is on the order of hundreds, or even thousands, of additional instructions
to be executed, for each privileged instruction that fails.
Duplication of the I/O driver code. A third source of virtual machine
management overhead is a doubling of the number of instructions to both initiate
I/O operations and service I/O completion interrupts. When a hardware-related
device interrupt occurs, the native device driver code running in the VM Host
software layer is driven initially. Once the native device driver services
the interrupt, the VM Host software must determine which guest OS initiated
the request and how to map the physical request into the appropriate virtualized
context. Once this is accomplished, the VM Host software queues a virtual
interrupt for the guest OS, which must then await dispatching by the VM Host
guest machine Scheduler before the guest machine can detect that a device
interrupt has occurred and process it. (This leads to delays in I/O interrupt
servicing that are discussed in the section entitled “I/O interrupt delays.”) This has been dramatically
experienced in long delays associated with I/O intensive applications at known
sites. When the device interrupt is received by the guest OS, its version
of the interrupt handler is then dispatched to deal with it. Clearly, two
similar sets of code are traversed, where only one set would be executed in
a native run-time environment.
For Windows guest machines running under VMware, the sum of the amount of
% Interrupt Time and % DPC Time, multiplied by two, recorded at the Processor
level when the system was running in native mode, provides a good estimate
of the amount of additional CPU time that the VM Host software will require
to process device interrupts on behalf of the guest machine. Multiply the
sum of % Interrupt Time and % DPC Time by 2 to reflect the device driver code
path to initiate an I/O request that is not broken out of the % Privileged
Time measurement. Presumably, the native device driver code that the VM Host
runs to initiate I/O requests and process I/O interrupts has performance characteristics
that are comparable to the native Windows device driver.
Sizing RAM. VMware must provide a shadow copy of every page table
entry (PTE) that is present on each guest OS (note that page tables are built
per process). The VM Host software must intervene to maintain the consistency
of the shadow PTEs any time their status changes on the guest OS due to routine
memory management.[21] In practice, the overhead
of virtualized memory management can be a serious performance factor that
is not limited to configurations that are memory-constrained.
In the case of sizing RAM, the memory requirements of a Windows guest can
usually be reliably estimated by subtracting the Memory/Available Bytes counter
from the size of RAM when the machine runs natively. VMware advises allowing
for an additional 32 MB of RAM per virtual machine, plus the VM Host software
itself, which requires about 400 MB of RAM.
VMware exploits the shadow PTE mechanism to wring some counter-balancing
efficiencies from the virtual memory management process. Guest machines that
have memory resident pages that are identical are able to share a single,
memory-resident copy of the page. This effort is more noteworthy for the effort
involved in identifying pages that are eligible to be shared, than the result,
which is limited when very heterogeneous servers are consolidated. In the
case of homogenous workloads, Waldpurger in [19]
reports an impressive level of savings, nearly 40% in the case where 10 guest
machines all running Windows NT 4.0 were defined. Even in less homogenous
workloads, in almost every case we have observed, the memory footprint of
guest machines under ESX was smaller than the standalone RAM requirements
of those machine workloads. Nevertheless, consolidating the workloads under
a single OS image where possible remains the superior approach. In Virtual
Server 2005, a full complement of RAM on the host machine must be provided
to the guest OS or else the guest machine simply will not boot.
The VMM maintains a single set of page tables that the virtual address translation
mechanism in the hardware recognizes, which essentially duplicates the virtual
address mapping information that each guest OS must itself maintain in virtual
memory. It should go without saying that you would not want to configure
a memory-constrained guest OS that had high paging rates. VMware uses a technique
called ballooning [19] where it injects a device
driver into a guest OS that ties up large amounts of physical memory in order
to force the guest OS to trim unused virtual memory. Ballooning thus allows
VMware to defer making page replacement decisions to the guest OS, which is
in a far better position to make them intelligently. The ability of Windows
Server 2003 to communicate directly with server applications that perform
their own memory management (predominantly in support of I/O buffering) [15]
suggests ballooning could be quite effective in this environment. The VMware
Host software can determine which memory locations the OS has freed up by
examining the guest OS page tables. Then VMware can re-distribute these available
pages to other guest machines.
I/O interrupt delays. The foregoing
catalog of the performance issues impacting the scalability of virtualization
technology would not be complete without some mention of significant interrupt
processing delays that can easily arise. Significant interrupt delays are
likely whenever there are more virtual machines defined than there are physical
processors to run them. The problem can grow acute when some of the virtual
machine workloads are I/O bound. This potential impact should be treated as
a capacity planning issue during consolidation and deployment.
In a virtualization environment, servicing a high priority device interrupt
becomes a two stage process. The VM Host driver software services the native
device interrupt on demand as a necessarily high priority operation that preempts
any lower priority task that is currently dispatched. One effect of this
is that a guest machine that is currently dispatched can be interrupted by
an I/O completion that was initiated by a different virtual machine. After
the VM Host software services the interrupt, it queues a virtual interrupt
to be processed by the initiating virtual machine. The virtual machine waiting
on the interrupt may also be waiting to be dispatched on the VM Host queue.
The guest machine cannot service the interrupt until the next VMware Host
Scheduler interval in which it is scheduled to run. Any dispatchability delays
effectively increase the time it takes the virtual machine to process device
interrupts. This interrupt processing delay can be considerable.This delay
may be reduced by limiting the number of guest machines and by limiting the
number of processors being dispatched by each guest (based on the individualized
needs of the guest).
Past experience with virtualization technology in the mainframe world (see,
for example [13]) shows that I/O bound virtual
machine workloads are prone to a secondary effect if they endure extended
periods when they are ineligible to be dispatched to service the interrupts
they are waiting for. I/O interrupts that are queued to be serviced can only
be processed when the virtual machine is finally dispatched. The effect is
to stagger interrupt processing at the guest machine in a manner that leads
to a skewed arrival rate distribution, a worst case that maximizes the queuing
delays that are experienced. Ultimately, this secondary effect proved so
powerful that the dispatcher mechanisms in mainframe partitioning schemes
had to be modified to counteract it. The same behavior is currently evident
in the virtualization solutions available for the Windows platform.
The performance of virtual machines during I/O intensive operations like
back-up illustrates the problem. Suppose the virtual machine running the back-up
task is one of two virtual machines vying for a processor. When the vm is
able to execute, it usually has a backlog of I/O interrupts to service. After
the interrupts are serviced and the next I/Os in the sequence are initiated,
there may be little or no other processor-oriented work that needs to be done.
The virtual machine idles its way through the remainder of the time slice
that it is eligible to run. When its time slice expires, the virtual machine
waits. Meanwhile, some of the I/Os that were initiated during its last cycle
of activity complete, but they cannot be serviced because the vm is not eligible
to be dispatched. At the next interval where the vm is dispatched, the cycle
repeats itself. Compared to running native, the I/O throughput of the virtual
machine is slashed by 50% or more.
The best way to minimize the impact of this scheduling delay currently is
to configure an I/O bound workload so it has guaranteed access to at least
one dedicated physical processor.
Performance expectations.
You are apt to be disappointed if you do not commence a server consolidation
effort without some reasonable expectations about the performance of your
applications as you start running them under virtualization. The previous
section discussed some of the important architectural features of the virtualization
software available for the Windows platform that have a major resource capacity
impact. Focused primarily on the popular VMware ESX package, it identified
virtualization overheads that need to be factored into an initial server sizing
effort. Three main issues were raised in the previous section that impact
capacity. In this section we focus on the performance impact of these architectural
features.
- Virtual machine guests are subject to either round-robin or time-weighted
dispatching by the VM Host software on the physical processors assigned
for their use. (In VMware ESX version 3, a guest machine can be assigned
to use from 1-4 physical processors.) The VMware Scheduler overhead used
to switch the processor between virtual machines is minimal, but scales
linearly with the number of virtual machines that are defined.
For any given processor workload, instruction execution throughput is degraded
in proportion to its scheduling weight and the contention for the physical
processors from virtual machines. VMware claims that ESX can detect when
a virtual machine is idle, and will dispatch a different eligible virtual
machine when it does so. This is critical to the performance of Windows guests
because Windows does not issue a processor HLT instruction when it has no
work to do [15].
Using a periodic timer check, currently set to fire every 1.5 milliseconds
by default, the ESX Host Scheduler has an opportunity to preempt an idle virtual
machine, assuming it can detect the system state reliably. As reported in
[15], the Windows Idle loop is a succession of
no-op instructions issued from the HAL. On machines that support APCI power
management, the HAL transfers control from the Idle loop to a model-dependent
processr.sys driver module to take appropriate action, which may include slowing
down or powering off the processor. In practice, it is critical that VMware
ensure that idle Windows guests do not waste processor cycles. A utility called
the Idler Service is available for the Windows platform to help customers
improve idle loop detection.
Using a series of benchmarks, Shelden tried to characterize ESX processor
dispatcher queuing delay in [16], which he discovered
could be substantial in ESX version 2. He found, for example, that when a
virtual machine was eligible to execute on two processors, VMware waited until
it could dispatch the guest OS on both processors simultaneously. This technique,
which VMware calls co-scheduling, provides a consistent multiprocessor
environment for the guest OS, but it also means potentially significant dispatching
delays when there is contention among virtual machines for the processor.
Shelden’s results suggest that co-scheduling virtual machines causes significant
performance degradation when there is CPU contention among the OS guests.
Performance analysts need to be on the lookout for runaway application threads
that are looping, something that sites can blithely ignore in many dedicated
server application environments. A guest machine running a thread stuck in
a loop has the potential to create significant contention for any processors
it shares with other guest machines.
- The guest OS is run in User mode. All attempts by the guest OS code to
issue privileged instructions are trapped by the VM Host software and emulated.
The additional interrupts that are generated (due to the failed instructions)
and the emulation code that substitutes for the original instruction add
significant path length to routine attempts by the guest OS to execute privileged
instructions.
Server applications that rely on physical addresses are likely to be the
most vulnerable to this delay. The mechanism used in Windows to make physical
addresses available to device driver and other system modules is for them
to allocate memory in the system’s Nonpaged Pool. This should make it relatively
easy to identify applications that are subject to this delay. Any Windows
server application that makes extensive use of physical addresses is potentially
at risk.
Windows has a number of subsystems that remain in kernel-mode so that they
can work directly with DMA controllers and utilize physical addresses allocated
from the Nonpaged Pool. These include the MDL Cache interface (MDL stands
for Memory Descriptor List) used by the file Server service and IIS.
High performance Fibre Channel and SCSI device drivers also make extensive
use of physical addressing and MDLs. When issued by a virtual machine guest
running in User mode, the privileged instruction to disable paging in order
to utilize physical addresses or to re-enable paging when it is safe to do
so fails. The failed instruction must then be emulated by the VM Host software.
For example, the http.sys driver module introduced in IIS 6.0 can process
Get Requests for static html objects entirely in kernel-mode. For the sake
of efficiency, this process relies on instructions that are accessing physical
memory addresses directly. (The popular Apache web server application that
runs primarily on Linux has comparable facilities.) In IIS 6.0, TCP/IP, also
running in kernel mode, queues an HTTP Get Request object to an http.sys worker
thread. The worker thread has access to a physical-memory resident look-aside
buffer (known as the Web Service Cache) where fully rendered HTTP Response
objects are cached. If a cache hit occurs, the HTTP Response object can then
be transferred from the cache to a physical memory-resident MDL output buffer
for transmission by the NIC back to the requestors IP address without leaving
kernel mode. Extensive intervention by the VMware Host, including emulation
of the failed privileged instructions, is required to complete these functions.
- The guest OS code to initiate and service I/Os is executed twice, once
by the guest OS and a second time by the native device drivers running on
the VM Host. This slows down the execution of both disk and network operations.
This strongly suggests that I/O bound workloads with critical performance
requirements – and that includes network I/O of all types – should be run
in native mode, not as virtual machine guests. Workloads that are not primarily
I/O bound may still have periods where I/O performance is critical. It may
be necessary, for example, to migrate guest machine back-up operations to
a VM Host native back-up operation if back-ups cannot complete in the window
that is available. The performance trade-off here is that VM Host native back-up
operations view the guest machine’s disk storage as a container file that
must be backed up or restored in its entirety. Application-level file back-ups,
like those performed on SQL Server or Exchange databases, must be run on the
platform that runs the application. Of course, this type of application server
may fall into the category of I/O bound workloads that need to run natively
anyway.
The larger concern for I/O bound workloads is where there is processor contention
among virtual machine guests and I/Os to the guest OS remain pending during
relatively long periods when the virtual machine is not eligible to be dispatched,
as discussed in the section entitled "I/O interrupt delays".
- The balloon technique which forces page replacement back to the OS in
various guest machines, along with the ability to share common pages among
homogenous guests, makes memory management relatively efficient in the virtualization
environment. Whatever additional memory that is required in the VM Host
software and on VM guests to make virtualization work is usually more than
offset by this bag of clever memory management tricks.
- The simple virtual machine processor dispatching method used in VMware,
along with the co-scheduling of processors to virtual machines defined to
run on multiprocessors, leads to long delays in I/O interrupts processed
by the guest OS. This, in turn, can lead to an artificial staggering in
the rate of I/O initiation and completion. This secondary effect of the
virualization engine’s Scheduler can have a large negative impact on even
modestly I/O bound workloads.
The cumulative effect of these negative performance impacts can be substantial.
Shelden’s benchmark testing [16], which was designed
to isolate the performance impact of the VMware ESX processor scheduler showed
significant degradation due to co-scheduling. In a test case where two virtual
machines were active, each with two active threads, each capable of consuming
approximately 55% of a processor in a conventional environment, overall utilization
of each physical processor in the 2-way VM Host machine managed to reach only
about 65% busy, instead of an expected 100% busy. One explanation is that
the VM Host software was unable to detect correctly when a Windows processor
is idling. Another possibility that Shelden explores is that processor co-scheduling
requires that physical processors be assigned to virtual machines in tandem.
When Shelden changed his simulation to use a Dispatch in Pairs scheduling
algorithm, instead of the expected Dispatch When Ready, his simulation results
were a significantly better fit to the actual data. Experiences in real world
environments have demonstrated that reducing the number of defined processors
(from 2 to 1) increase the efficiency of scheduling and therefore guest image
performance. This is discussed further in [1].
In [9], the principal researchers behind the development
of Xen compare and contrast the performance of native Linux, Xen, and VMware
Workstation on a series of benchmarks. Figure 3 below summarizes three of
the comparisons: a CPU stress test running specint; a database-oriented benchmark;
and the SpecWeb99 test suite that is a very comprehensive test of web server
capabilities (it stresses the CPU, the network and the disks). On the CPU-bound
workload, VMware results keep pace. But on the database and web server workloads,
VMware lags native performance considerably. On the web server tests, the
Xen developers observed, “VMware suffers badly due to the increased proportion
of time spent emulating ring 0 code while executing the guest OS kernel.”

Figure 3. Benchmark results comparing native Linux, Xen,
and VMware on three different workloads. From a paper written by the principal
developers of Xen at Cambridge University. [9]
In a battle of the benchmarks, VMware has recently published its own set
of head-to-head comparisons with Xen that shows a markedly different picture.
[22] Showcasing the performance of the VMware-aware
vmxnet network driver, [22] VMware reports ESX achieves performance on the
Netperf benchmark at close to the level of a native OS, with Xen able to drive
network traffic at less than 10% of the native configuration. Unfortunately,
there are no published reports that provide an apples-to-apples comparison
where both virtualization products are running in their optimal configurations.
Monitoring the virtualization environment.
Performance monitoring for virtual machines running under VMware ESX is complicated
today because many of the familiar guest OS measurements become difficult
to interpret and cannot be relied upon. It is necessary to augment guest OS
monitoring procedures with measurements from the VM Host software, especially
regarding the use of the processor by various guest machines. Fortunately,
VMware ESX does provide some necessary processor utilization statistics, but
it lacks measurements that can give you a clear picture of what is going wrong
when performance issues surface.
One side-effect of the two-step process in which device interrupts are handled
by the VM Host software is that the timing mechanisms on a Windows guest machine
are not reliable. This affects all timer-based measurements that any performance
monitoring software running inside the Windows guest makes, including the
% Processor Time utilization measurements at the Thread, Process, and Processor
level and the Avg. Disk Secs/Transfer response time measurements for both
Logical and Physical Disk.
The Processor utilization measurements in Windows are derived from samples
taken once every clock interval. (See [15] for details.) The precision of the system clock is
maintained as if ticks occur every 100 nanoseconds. In reality, the system
clock time is updated much more slowly, normally about once every 15 milliseconds.
(In official Microsoft documentation, this duration between clock ticks is
sometimes referred to as the periodic interval.) The mechanism used
to maintain the system time is a timer interrupt that is set to fire regularly
every periodic interval.
When the timer interrupt occurs, the Windows OS advances the system clock.
It also samples the state of the machine at the time of the interrupt and
determines what thread from what process was running when the clock interrupt
occurred. All of the processor utilization measurements at the thread, process,
and processor level are based on this sampling technique.
The timer is a peripheral device on the Intel platform – it normally resides
on one of the supporting chip sets external to the processor. The timer functions
performed by the HPET and ACPI Timers require the equivalent of a very fast
I/O operation to access the current clock value. Intel-compatible processors
beginning with the Pentium do contain a Timestamp Counter (TSC) that reflects
the internal frequency of the microprocessor and the TSC can be used very
efficiently to time the duration of events. VMware ESX, in fact, uses the
TSC for its measurements of processor utilization. But the TSC can only be
used to tell time reliably relative to a single processor core. The TSCs on
different processor cores in a multiprocessor are likely to be out of sync.
In addition, on older Intel and AMD microprocessors with power-saving features,
the TSC does not maintain a constant uniform clock rate when it is executing
in reduced power mode.
The high priority clock interrupt that Windows relies upon to keep time internally
and for all its measurements of processor utilization is subject to interrupt
pending delays in a virtualization environment. The periodic interval that
Windows relies upon to keep accurate time is subject to erratic behavior,
as documented in [17]. Consequently, the processor
utilization samples that Windows gathers are no longer uniformly distributed
in time. This should not invalidate the measurements completely, especially
when you are looking at long enough intervals with sufficient samples to minimize
normal sampling error. However, Windows performance monitoring software does
not have direct access to the number of processor utilization samples. The
utilization metrics the software calculates are based on the assumption that
the intervals between samples are uniform. Obviously, this is not correct
in the virtualization environment. The resulting calculation of % Processor
Time is subject to major error because VMware does not deliver virtual clock
interrupts at uniform time intervals.
A number of knowledgeable observers have puzzled over the interpretation
of the processor utilization measurements that rely on this timing mechanism
when they are taken in a virtualization environment. They have had little
success trying to make sense of the normally reliable processor utilization
measurements provided by the Windows guest machine and even less correlating
them with measurements taken by the ESX host software. See for example, [5] and [16]. The problem is
that it is difficult to know what to make of any of the Windows % Processor
Time measurements (all counters of type PERF_100NSEC_TIMER are impacted) as
calculated by the Windows guest OS. The way interrupts in general, the clock
interrupt included, are stacked up waiting for service when the virtual machine
is finally dispatched undermines the uniform sampling methodology that Windows
relies on in its % Processor Time calculations.
Direct measurements taken by VMware ESX are necessary to augment the guest
OS measurements. (Generous portions of ESX version 3 and its Infrastructure
add-ons appear to be devoted to performance monitoring.) ESX currently provides
an interface to allow programs running on the VM guest to pull performance
metrics, including the Total CPU Seconds used by the VM (in milliseconds).
At least one third party performance monitoring product currently supplies
measurements of VM Guest CPU % Used and % Ready (purportedly, the percentage
of time a guest OS was Ready to run on the VM Host, but was unable to). Consolidating
this information with data from various guest machines is always a challenge,
of course.
Even though Windows presents system timer values in ticks of a 100 nanosecond
clock, the system time value is actually only adjusted every periodic interval.
The actual granularity of clock intervals is about 15 milliseconds, as discussed
above. This granularity proved too coarse for many performance-oriented measurements.
So, beginning in Windows 2000, a new timer facility based on the TSC was introduced.
This newer timing facility, also known as a high precision clock, is accessed
using the QueryPerformanceCounter Win32 API call, somewhat inappropriately
named because it is a general purpose interface. In Windows XP and Server
2003, the QueryPerformanceCounter Win32 API call issues an RDTSC instruction.
In Windows Vista and the forthcoming Longhorn Server, QueryPerformanceCounter
was changed to use the HPET instead.
In VMware, the TSC is virtualized and the RDTSC instruction, which can only
be issued in privileged mode, is emulated. [17]. In [17], VMware warns,
Reading the TSC takes a single instruction (rdtsc) and is fast on real hardware,
but in a virtual machine this instruction incurs substantial virtualization
overhead. Thus, software that reads the TSC very frequently may run more slowly
in a virtual machine. Also, some software uses the TSC to measure performance,
and such measurements are less accurate using apparent time than using real
time. In a virtualization environment, there is no guarantee that the I/O
interrupt from the clock will be issued or serviced by the guest OS in a timely
manner. [Emphasis added.]
The Windows performance measurements that are impacted by this anomaly are
the Logical and Physical Disk % Idle Time counters that measure disk utilization,
one of several disk performance counters of type PERF_PRECISION_100NS_TIMER.
The accuracy of the Avg. Disk secs/Transfer counter that furnishes disk response
time measurements is also suspect in a VMware environment. Both the Logical
and Physical Disk measurement layers in the I/O Manager stack issue a call
to QueryPerformanceCounter at the start and end of every I/O operation. Unfortunately,
native ESX only measures disk and network throughput per virtual guest, which
does not close the gap that not having these important guest OS disk performance
statistics leaves.
The Future of Virtualization.
In 2005 both Intel and AMD announced strikingly similar new hardware designed
with virtualization in mind. The Intel initiative is called VT, while the
AMD product is called Pacifica. This new hardware first became available
in 2006. In both vendors’ specifications, a new privilege level is added where
the virtualization monitor runs. That will allow the virtualization monitor
to run a guest OS at its normal ring 0 privilege level, rendering current
privileged instruction emulation techniques obsolete. The new hardware also
includes some new instructions to allow the Virtual Machine Monitor to launch
a guest virtual machine and facilitate communication between a guest OS and
the VM Host software layer.
A new generation of virtualization software is required to exploit the new
hardware. All the leading vendors of virtualization software report that
they are actively working on new versions that will support the new hardware.
Preliminary reports indicate the new hardware addresses some, but not all,
of the performance problems that plague current virtualization solutions.
Adams and Agesen in [21] offer this somewhat over-harsh
judgment, “While the new hardware removes the need for BT [just-in-time Binary
Translation] and simplifies VMM design, in our experiments it rarely improves
performance.” Drilling down into the details of the series of micro-benchmark
runs they report that compare native performance to existing VMware software
and a VMware prototype that supports the new hardware, Adams and Agesen found
·
both virtualization approaches were equivalent in being able
to execute compute-bound workloads at near native levels of performance
·
the hardware approach is notably superior on any workload that
is rich in system calls because it is no longer necessary to emulate the privileged
instructions issued
·
the software approach runs an order of magnitude slower during
processing of page faults and the processing of other machine exceptions
·
the current VMware software approach outperforms the hardware-enabled
prototype on I/O bound workloads, ones where processes are frequently created
and destroyed, and ones with frequent address space context switches
Moreover, Adams and Agesen were discouraged to find that the 1st
generation virtualization hardware support provides no mechanism to assist
the Virtual Memory Monitor in maintaining a coherent set of shadow PTEs on
behalf of guest OS machines, a major source of overhead currently. They advocate
that the hardware vendors consider further extensions to their initial virtualization
specification that would aid in memory management, similar to the Start Interpretative
Execution (SIE) instruction that was added to the IBM 370 instruction set
to boost the performance of preferred guest machines under its VM operating
system. This is, in fact, something both AMD and Intel have done in specifying
a 2nd generation virtualization interface specification. [23,
24] For example, AMD’s Secure Virtual Machine specification
provides
·
faster switching between the VMM and guest OS, including a new
VMMCALL instruction that can used by a guest to call the VMM explicitly
·
The ability for the VMM to trap or intercept selected instructions
executed in the guest OS like RDTSC or IOIO instruction to specific ports
and events like addressing exceptions
·
A Nested Paging Facility that provides for two levels of address
translation, eliminating the need for the VMM to maintain shadow page tables
·
External (DMA) access protection for memory
·
Assists for interrupt handling and virtual interrupt support
·
A guest/host tagged virtual address Translation Lookaside Buffer
(TLB) to reduce memory management overhead
Once it is supported by systems software, the 2nd generation hardware
environment would allow preferred guest machines to access native devices
for better performance.
It should be apparent from this discussion that virtualization technology
is in its infancy in the Windows environment. Server consolidation using current
virtualization technology has a number of serious performance limitations.
It needs to be deployed judiciously. If you are really interested in reducing
the total number of OS images you are managing, consider consolidating workloads
under a single OS image instead. In either approach, there is no substitute
for a rigorous application of traditional performance monitoring and capacity
planning techniques.
References.
[1] Marksamer, S. and Weilnau,
P., “Real World Adventures in Server Virtualization”, CMG Proceedings
2006.
[2] Friedman, M., “The reality
of virtualization for Windows Servers,” CMG Proceedings 2006
[3] Friedman, M., “Can Windows
NT be tuned?” CMG Proceedings 1996.
[4] Friedman, M. and Pentakalos,
O., Windows 2000 Performance Guide, Boston, MA: O’Reilly Associates,
2002.
[5] Fernando, G., “To V or not
to V: a practical guide to virtualization,” CMG Proceedings 2005.
[6] Friedman, E., “Tales from
the lab: Best Practices in application performance testing,” CMG MeasureIT,
November 2005, available at http://www.cmg.org/measureit/issues/mit27/m_27_10.html.
[7] Seawright, L., and MacKinnon,
R., “VM/370 – a study of multiplicity and usefulness, IBM Systems Journal,
1979, p. 4-17.
[8] Intel Virtualization Technology
Specification for the IA-32 Intel Architecture, ftp://download.intel.com/technology/computing/vptech/C97063-002.pdf.
[9] Barham, et.al., “Xen and
the Art of Virtualization,” University of Cambridge Computer Laboratory, published
at SOSP 2003, available at http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2003-xensosp.pdf.
[10] “ESX server performance and
resource-management for CPU-intensive workloads.” Available at www.vmware.com.
[11] Gunther, N., “The Virtualization
Spectrum from Hyperthreads to Grids,” CMG Proceedings 2006.
[12] Menasce, D., “Virtualization:
concepts, applications and performance modeling,” CMG Proceedings 2005.
[13] Young, D., “Partitioning
Large Processors,” CMG Proceedings, 1988, p. 67-74.
[14] “VMware ESX Server 2: Architecture
and performance implications.” Available at www.vmware.com.
[15] Friedman, M., Windows
Server 2003 Performance Guide, a volume in the Microsoft Windows Server
2003 Resource Kit, Microsoft Press, 2005.
[16] Shelden, W., “Modeling VMware
ESX performance,” CMG Proceedings 2005.
[17] “Time-keeping in VMware virtual
machines,” Available at www.vmware.com.
[18] Robin, J.S., and Irvine,
C.E., “Analysis of the Intel Pentium’s Ability to Support a Secure Virtual
Machine Monitor,” Proceeding 9th USENIX Security Symposium,
2000. Available at http://www.cs.nps.navy.mil/people/faculty/irvine/publications/2000/VMM-usenix00-0611.pdf.
[19] Waldspurger, C., “Memory
Resource Management in VMware ESX Server,” Proc. Fifth Symposium on Operating
Systems Design and Implementation (OSDI ’02), Dec. 2002. Available at
http://www.waldspurger.org/carl/papers/esx-mem-osdi02.pdf.
[20] “Network throughput in a
virtual infrastructure.” Available at http://www.vmware.com/pdf/esx_network_planning.pdf.
[21] Adams, K. and Agesen, O.,
“A comparison of software and hardware techniques for x86 virtualization,”
ASPLOS Oct. 2006. Available at http://www.vmware.com/pdf/hypervisor_performance.pdf.
[22] “A performance comparison
of hypervisors.” Available at http://www.vmware.com/pdf/hypervisor_performance.pdf.
[23] Neiger, G., et. al., “Hardware
support efficient processor virtualization,” Intel Technology Journal,
Vol. 10, No. 3, August, 2006.
[24] AMD64 Architecture Programmer’s
Manual, Volume 2: System Programming.