CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

A Realistic Assessment of the Performance of Windows Guest Virtual Machines
March, 2007
by Mark Friedman and Stephen Marksamer

About the Author
Mark Friedman, Demandtech

Mark is the co-author of "Windows 2000 Performance Guide," published by O’Reilly in 2001. He is a frequent contributor to CMG conferences on performance and tuning topics, concentrating on Microsoft Windows NT in 1996. Also in 1996, he began developing Performance SeNTry, performance monitoring software for Windows NT and 2000 also known as NTSMF.

He edited an industry newsletter entitled "Mark Friedman on Storage Management," published by Demand Technology. In 1994, he founded OnDemand Software, which developed and sold the award-winning WinInstall software distribution package for Windows networks. From 1987-1991 he worked at Landmark Systems where he was the architect and the led the development team that built The Monitor for MVS.

[Hide]

Introduction.

Virtualization appeals to computer professionals as a means to address the proliferation of under-utilized Windows servers, which is widely perceived as a significant system management problem. An ambitious server consolidation initiative that relies on virtualization can achieve significant administrative and other cost efficiencies. In this article we hope to inject a healthy dose of realism about the performance of current virtualization technology into a planning process that has high expectations for its value proposition. In practice, virtualization can be deployed to considerable benefit, but we worry that these benefits can also be oversold. For instance, we take the somewhat contrarian view that under-utilized Windows servers is not the profligate waste of resources that the evangelists for virtualization suggest it is. Nor is virtualization the most direct approach to its solution.

This article consolidates our research into the performance of virtualization technology, focusing on the popular VMware ESX product. It is based primarily on material from two recent papers [1, 2] presented at the annual CMG Conference in 2006, which themselves synthesize the results of numerous other investigators that we have borrowed from freely. Anyone interested in pursuing this topic further is advised to consult the original sources referenced in the bibliography that is provided.

Software developers were among the earliest adopters of the virtualization technology that is available today for Windows. They faced a common problem, namely, the need to subject new releases of software to rigorous testing on a wide variety of platforms on a tight budget for both time and materials. Virtualization software allows a single machine to be configured to run multiple operating system images that can then be used to ensure that the software being developed functions correctly in diverse configurations.  Software development and Quality Assurance testing remains the one area where virtualization can be deployed with the greatest unqualified chance of success.

So long as the virtualization technology being deployed was confined primarily to assisting with software development and testing, the technology raised few pressing capacity planning or performance concerns. Running the sort of application stress-testing workload where performance actually mattered could always be diverted to a dedicated machine. (See, for example, [6] for a thorough discussion of these issues.)  It was only when this same virtualization technology was re-positioned as a way to achieve server consolidation that serious capacity planning and performance considerations began to surface. To the degree that generously-sized current hardware capabilities often lead to machines that appear to be massively over-provisioned, sizing the host machine ought, in principle, to be relatively easy.

Capacity planning for virtualization is best described as an n:1 folding problem.  The capacity planner must assure that the guest workloads can fit into the one physical machine managed by the virtual machine host. This is a multi-dimensional problem where the capacity planner must assure that the processor, disk, memory and network bandwidth of the combined guest machines – plus some allowance for virtual machine “overheads” – does not exceed the physical capabilities of the underlying hardware. This basic problem becomes difficult in the Windows Server environment due to having only limited measurement data that can be used reliably to estimate the amount of virtualization overhead to expect in advance for a given workload.

Discussing the sources of various virtualization “overheads” inevitably leads to considering the performance issues that currently arise when Windows runs as a guest machine.  We mention two significant problems that have not been discussed much by other commentators.  One concern is the technique used to schedule virtual machines that has a serious performance impact on I/O bound workloads. A second problem area involves the measurement perturbations that occur on the Windows platform after virtualization technologies are introduced.

After providing a realistic assessment of the virtues of current virtualization technology, we discuss recent hardware improvements and their prospects for resolve these performance problems in a satisfactory manner.

Reigning in the Server Farm.

 Mainly out of concern for easing the burden of system administration, Windows servers  are usually configured to run a single, homogenous workload  Some of the historical reasons for preferring isolated Windows servers running single applications are (1) increased stability, (2) better reliability, and (3) simpler problem-solving. As capacity planners, we’d like to be able to say that simpler capacity planning is another worthy goal associated with this policy, but it seldom plays a significant role in the decision-making process.

A powerful argument for running homogenous, single workload Windows machines was increased stability.  Indeed, one of the present authors has been one of the forceful public advocates for this strategy to deploy Windows servers successfully on a large scale. [3] The practices and procedures forged in the crucible of experience that led in the past to successful deployment of large numbers of Windows servers running mission critical applications have an understandable resilience in the face of change. At the present time, however, it is worth considering whether the historical conditions that favored configuring Windows machines to run a single, homogenous workload are still present today. This argument is made in more detail in [2].

Current industry best practices recommend that the Technical Support group responsible for Windows servers and desktops adopts a stable image of the operating system and application software that it intends to maintain and support going forward after a lengthy period of concentrated acceptance testing. The image is then cloned every time there is an organizational requirement to support expansion of this application or support it in a new location.

This practice has deep roots. Consider that distributed computing initiatives allowed pockets of control by various departments of what was perceived as their hardware and applications. “Gold image” strategies played into that history, making it easier for distributed administration to be centralized as cost drivers increasingly brought the distributed world back into the data center. This re-centralization often occurred with little no involvement from capacity planning professionals.

The widespread practice of cloning operating system and application software images that are certified by sysadmins for stability and reliability does not alone lead to massive over-provisioning of Windows servers.  The hardware to run Windows keeps getting more powerful in leaps and bounds. Current generation hardware from Intel and AMD offers 64-bit addressing, massive amounts of RAM, multi-core processors, and hyper-threading. These machines offer more processing power than many single workload application servers will ever need. Out of necessity, IT organizations continuously update their inventories due depreciation and lease replacement, as well as both operating system and server application evolution. This process increases the number of these larger, more powerful servers that are deployment for the Windows server platform.  Over-provisioning inevitably occurs when a disciplined capacity management approach is not coupled with the hardware replacement strategy.  

The lack of involvement in capacity planning has positioned IT departments where they now finds themselves:  with abundant opportunities to save administrative costs by reducing the total number of machines that need to be managed. In this environment, server consolidation makes evident good economic sense, with virtualization being one obvious path to server consolidation.

In the case of a remote field office operation, for example, an accomplished system administrator might think it is necessary to supply a minimum of three separate OS images: one to provide the essential Messaging application like MS Exchange or Lotus Notes to tie employees at the remote office to the corporate e-mail network; one to provide Active Directory-based security and authentication services, and a third to supply data protection (i.e., back-up) and data recovery. Conventional practice would be to supply three separate, dedicated machines to perform these functions, all of which would likely be severely under-utilized.  Virtualization is an appealing option in this instance because it allows all three machine images to be packaged together and run inside a single box without losing the isolation of multiple operating system environments.

We are not ready to concede that this over-provisioning is nearly as big a problem as IT professionals reared on the frugality required to manage expensive mainframe technology cost-effectively presume it is.  In the face of the changing economics of Information Technology, it is certainly worthwhile to examine the assumptions behind this presumption that delivery of service always requires separate machines.  An application of rudimentary capacity planning practices and procedures would sharply reduce the degree of over-provisioning that is occurring today.

Why virtualization. This brings us to the heart of the matter. Servers running single application workloads are often massively over-provisioned. Virtualization promises to enable current hardware to be used more efficiently.

Many IT organizations also view virtualization technology as a viable solution to the evident problem that the organization has way too many machines to manage.  In theory, at least, virtualization technology is positioned to address the inefficiencies of running machines that are severely over-provisioned. Virtualization does provide a mechanism that allows system administrators to utilize current hardware more effectively, while retaining all the administrative advantages of isolating workloads on dedicated servers. Using virtualization, it is, for example, possible to configure and run two or more virtual machines – each devoted to running a single, isolated workload – on a single hardware platform.  In the multiple machine scenario described above, instead of provisioning three separate machines to run the mail server application, the domain controller, and the back-up server, all three server machines can be consolidated on a single hardware platform running virtualization software.

Virtualization allows the system administration to supply all three essential services discussed above on a single piece of hardware, which is certainly a simplification along that important dimension. Moreover, the virtualization solution has the additional benefit that it appears not to require major changes in current system administration practice. (The recognition that virtualization itself might add significant complexity to the operation is something that usually only surfaces later with experience. See, for example [4].)

Potential performance concerns with virtualization are deemphasized, based on the assumption that the hardware used to consolidate these workloads is so much more powerful than what its OS guests demand.  But, as will be discussed below in more detail, this working assumption is naive. Having witnessed a number of large server consolidation efforts that relied on virtualization that came up far short of their ambitious consolidation goals, we believe it is important to raise a warning flag about the potential performance problems that arise even in the face of what appears to be massive over-provisioning.

Curiously, the quaint possibility that multiple workloads can be readily consolidated today and delivered by a single operating system image does not seem to occur to most Windows Technical Support professionals.  Yet while many of the system management deficiencies of early versions of the Windows NT platform have been addressed, the “best practices” associated with deploying single application servers has barely noticed. As an alternative to virtualization, server consolidation through consolidation of multiple workloads under a single, native OS image actually reduces the number of machine images that are deployed and permits these workloads to run at native execution speeds. We concede that executing multiple workloads inside a single OS image requires the application of performance management and capacity planning skills that are in woefully short supply for the Windows platform. Nevertheless, the application of this core systems management discipline, in principle at least, is far simpler in the native environment.

Virtualization technology today.

Virtualization is accomplished as illustrated in Figure 1, by installing a virtual machine host on the bare metal that is then capable of running numerous virtual machines guest operating system images beneath it; in practice, as many guest machines as will fit.

 

Figure 1.  The architecture of VMware ESX version 3.

Figure 1 illustrates several key architectural features of the most popular virtualization solution for server consolidation, which is the VMware ESX Server product. (The discussion that follows is based mainly on ESX version 3.) Note that the VMware ESX software functions as the primary operating system (OS) supervisor that interacts directly with the physical hardware – the processor, RAM, the disks, the network, the video display, etc.  Due to its status as a base platform that runs the virtual machines, the VM Host software layer that you install the virtual machine OS on top of is sometimes called the hypervisor [7]. For an academic audience, the VM Host software is known as the Virtual Machine Monitor [8].  Both are terms for an operating system supervisor that is very limited in scope.

Unlike a more general purpose operating system, the functions that the VM Host software performs are narrowly delineated to those that are required to define the virtual machine guests and sustain them once they are activated.  Since VM Host software for Windows originated with Independent Software Vendors (ISVs) who had no preferred access to Windows internals, an additional goal of the 3rd party developers who created the VM Host software was to run Windows guest machines transparently.

In the ESX architecture the VM Host software is responsible for all native devices attached to the machine. The requirement that VMware be able to provide native device drivers is a major encumbrance, the burden of which VMware attempts to minimize by using a Linux-compatibility module.  This Linux interoperability makes it relatively easy to adapt existing Linux device driver modules so that they can be re-compiled into the VMware Host kernel. (Due to the GNU Open Source licensing restrictions, VMware is careful to say that ESX was not derived directly from Linux, despite outward evidence of its family resemblance.)  In practice, ESX supports a wide variety of disk, network, and SAN-attached devices (see http://www.vmware.com/pdf/esx_io_guide.pdf for reference), similar to the range of devices that can be attached to most Linux servers.

Relying on the VM Host software to provide native device drivers to support all attached physical devices is not the only way to achieve virtualization’s goals.  The VMware GSX and Workstation products, as well as the Microsoft Virtual Server 2005 product based on the software Microsoft originally acquired from Connectix in 2003, provide virtualization software that is installed on top of a standard Windows OS installation.  This approach allows you to install and run native Windows device drivers, which are more widely available for some peripherals than Linux drivers.  Native Windows device drivers typically exploit Windows Plug-and-Play technology during installation and set-up.  They frequently also have more elaborate feature sets and user interfaces than their Linux counterparts. On the other hand, the ESX dedicated Virtual Machine Monitor approach permits a greater degree of vm guest isolation, such that a problem with a device driver on one virtual machine guest is less likely to impact other vm guests that offer shared access to the same device. The ESX Host software also supplies a dedicated Scheduler service, as illustrated in Figure 1, to dispatch VM guests, rather than rely on the standard Windows priority-based thread Scheduler which was never designed with virtualization in mind.

When you set up a Windows server guest machine to run under either VMware ESX or MS Virtual Server 2007, the guest OS sees only the virtual disk, network, and video devices exposed by the VM Host software. The VM Host software exposes a limited set of generic devices that the guest OS detects, configures, and uses.  In the case of ESX, it also limits the number of logical processors that the guest OS is able to detect. This is both an upper and lower limit; i.e., the number of configured processors (raised to a maximum of four CPUs per guest machine in ESX version 3) are always available while the guest OS is dispatched. 

Figure 1 also illustrates the paravirtualization approach where VM-aware device drivers may be installed on the guest OS Host to improve performance.[20] In VMware ESX, a prominent example of this approach is the vmxnet virtual network adapter. According to VMware documentation, vmxnet driver “implements an idealized network interface that passes through network traffic from the virtual machine to the physical cards with minimal overhead.”

Both the VMware Host and Virtual Server 2007 also inject one or more service processes into the guest Windows machine. These are used to facilitate communication between the Host and the Guest.  In the case of ESX, the Windows guest OS runs VMWareService, plus the VMWareTray and VMWareUser processes. Communication between the VM Host and the Windows Guest OS occurs across a virtual NIC that simulates a generic AMD PCNET Family Ethernet adapter (Host-guest machine communication in Virtual Server 2007 is similar).  From a performance monitoring perspective, when ever you can detect that the VMWareService process is active, then you can safely deduce that the machine in question is a virtual machine guest.

Sizing virtualization environments.

In principle, capacity planning to size the machine hosting two or more virtual machines should be quite simple.  It corresponds to the problem of folding n virtual machines into one container, the machine that hosts the VM Monitor.  It does require an optimal solution over time that factors in whether individual workload peaks overlap or not (if you can consolidate non-overlapping workloads, you can achieve significantly more efficient operations).  Some care must also be taken to ensure that the solution is optimal across multiple dimensions, where each dimension corresponds to utilization of some physical resource that the host machine must apportion among the guest machines – the processor, RAM, the disks, and the network interfaces. In addition, capacity planning discipline would forecast this resource demand for the applications over some specified planning horizon.

Processor overheads.  The processor, for example should be large enough to handle the sum of the processor demand from each configured virtual machine, plus additional headroom to accommodate some amount of st the inevitable virtual machine management “overhead” (to be dissected in somewhat more detail below):

Physical CPU Capacity > VMM management overhead +

∑(VM-Guestn CPU + Overheadn)

In sizing the processor, at least three major sources of VM overhead can be identified:

·         Virtual machine Scheduler overhead

·         Privileged instruction emulation overhead

·         Duplication of the I/O driver code path by the native VM Host device drivers

Scheduler overhead. In [10], Gunther identifies the VMM Scheduler that is responsible for either round-robin or weighted dispatching of virtual machine guests as one source of overhead.  Gunther relies on a VMware-published ESX benchmark [11] that shows that this management overhead is minimal and well-behaved.  It appears to scale linearly with the number of VM guests that are defined.  The ESX benchmark data from [11] is summarized in Figure 2 for a four-way machine.

The benchmark workload in [11] is severely CPU-constrained.  It is also designed to minimize the other two major sources of VMware management overhead.  Notice that overall throughput tails off slightly as more virtual machines are configured to run than there are physical processors available to run them.  Therefore, the difference between the dotted horizontal line at the top of the chart identified as “theoretical” and the actual Completion rate represents the processor scheduling overhead.  The scheduling overhead per guest machine can be calculated as:

(Theoretical – Actual throughput) * 100 / Theoretical throughput / # of VMs

Figure 2. ESX scalability on a CPU-bound workload. Taken from benchmark results published by the VMware Corporation. [11]

Privileged instruction emulation. In [12], Menascé identifies a second source of virtualization processor “overhead,” which arises because the guest OS executes in User mode (Ring 3 on an Intel processors).  Under VMware, every time an OS function inside the guest machine attempts to issue a privileged instruction, a hardware exception is raised.  The VM Host interrupt handler has to trap this exception and recover from it.  It does this by emulating the privileged instruction issued by the guest OS that failed.  In practice, the emulation routine can be quite involved, depending on the function that VMware must mimic.

To illustrate this process, let’s look, for example, at what happens when the guest OS needs to perform an I/O operation to disk.  In the course of generating an I/O request, the Windows kernel-mode I/O Manager and the physical disk device’s associated driver code normally operate exclusively in Privileged mode, or Ring 0. In the virtualization environment, all of this code is executed in User mode, or Ring 3.  Whenever any kernel mode instruction that is only valid in Privileged mode is issued in User mode, the instruction fails.  At this point, the VM Host software intervenes.  After trapping the invalid instruction interrupt that occurs, the VM Monitor runs an emulation routine that mimics the original intent of the guest OS, and then returns control to the guest machine.  Menascé characterizes this overhead in modeling terms as an execution delay, which it certainly is.  In the virtualization environment, each attempt to execute a privileged instruction by the guest OS is replaced by an interrupt, the execution of the VM Host interrupt handler, and, finally, the execution of the emulation routine.  The instruction path length associated with the function increases enormously. Unfortunately, the full extent of the associated delay is impossible to characterize accurately without measurements taken by the VM Host on the number of privileged instructions emulated.  Note that this important performance data is not currently provided by VMware.

One approach is that the overhead associated with emulating instructions that require privileged mode can be based on the amount of time the guest OS spends in kernel-mode (% Privileged Time). This suggestion is helpful, but woefully imprecise.  Very few of the instructions executing in kernel-mode are actually privileged instructions. Depending on the OS function being executed, we might make be able to pursue this line of reasoning deeper.  Device-driver functions that issue instructions that reference physical addresses (not virtual ones) require Privileged mode to succeed.  Major OS functions related to I/O processing in general, including the Cache Manager, the Workstation and Server services, processing within the TCP/IP stack, and the kernel-mode http.sys driver in IIS 6.0, all make extensive use of physical addressing mode.  

The performance of all functions that rely on memory buffers pinned in the Nonpaged Pool suffers in the virtualization environment, in some cases prohibitively so. Menascé [8] suggests that the cycles lost to privileged instruction emulation are at least partially offset by running the virtual machine on a correspondingly faster processor. This planning assumption is far too optimistic. If you are consolidating workloads running on previous generation hardware onto new machines where a virtualization engine is installed, the newer machines are likely to have a clock rate that is twice as fast. However, the virtualization cost of emulating guest OS privileged instructions is on the order of hundreds, or even thousands, of additional instructions to be executed, for each privileged instruction that fails.

Duplication of the I/O driver code. A third source of virtual machine management overhead is a doubling of the number of instructions to both initiate I/O operations and service I/O completion interrupts.  When a hardware-related device interrupt occurs, the native device driver code running in the VM Host software layer is driven initially.  Once the native device driver services the interrupt, the VM Host software must determine which guest OS initiated the request and how to map the physical request into the appropriate virtualized context.  Once this is accomplished, the VM Host software queues a virtual interrupt for the guest OS, which must then await dispatching by the VM Host guest machine Scheduler before the guest machine can detect that a device interrupt has occurred and process it. (This leads to delays in I/O interrupt servicing that are discussed in the section entitled “I/O interrupt delays.”)  This has been dramatically experienced in long delays associated with I/O intensive applications at known sites.  When the device interrupt is received by the guest OS, its version of the interrupt handler is then dispatched to deal with it. Clearly, two similar sets of code are traversed, where only one set would be executed in a native run-time environment.

For Windows guest machines running under VMware, the sum of the amount of % Interrupt Time and % DPC Time, multiplied by two, recorded at the Processor level when the system was running in native mode, provides a good estimate of the amount of additional CPU time that the VM Host software will require to process device interrupts on behalf of the guest machine. Multiply the sum of % Interrupt Time and % DPC Time by 2 to reflect the device driver code path to initiate an I/O request that is not broken out of the % Privileged Time measurement. Presumably, the native device driver code that the VM Host runs to initiate I/O requests and process I/O interrupts has performance characteristics that are comparable to the native Windows device driver.

Sizing RAM. VMware must provide a shadow copy of every page table entry (PTE) that is present on each guest OS (note that page tables are built per process). The VM Host software must intervene to maintain the consistency of the shadow PTEs any time their status changes on the guest OS due to routine memory management.[21] In practice, the overhead of virtualized memory management can be a serious performance factor that is not limited to configurations that are memory-constrained.

In the case of sizing RAM, the memory requirements of a Windows guest can usually be reliably estimated by subtracting the Memory/Available Bytes counter from the size of RAM when the machine runs natively.  VMware advises allowing for an additional 32 MB of RAM per virtual machine, plus the VM Host software itself, which requires about 400 MB of RAM.

VMware exploits the shadow PTE mechanism to wring some counter-balancing efficiencies from the virtual memory management process. Guest machines that have memory resident pages that are identical are able to share a single, memory-resident copy of the page. This effort is more noteworthy for the effort involved in identifying pages that are eligible to be shared, than the result, which is limited when very heterogeneous servers are consolidated.  In the case of homogenous workloads, Waldpurger in [19] reports an impressive level of savings, nearly 40% in the case where 10 guest machines all running Windows NT 4.0 were defined. Even in less homogenous workloads, in almost every case we have observed, the memory footprint of guest machines under ESX was smaller than the standalone RAM requirements of those machine workloads. Nevertheless, consolidating the workloads under a single OS image where possible remains the superior approach. In Virtual Server 2005, a full complement of RAM on the host machine must be provided to the guest OS or else the guest machine simply will not boot.

The VMM maintains a single set of page tables that the virtual address translation mechanism in the hardware recognizes, which essentially duplicates the virtual address mapping information that each guest OS must itself maintain in virtual memory.  It should go without saying that you would not want to configure a memory-constrained guest OS that had high paging rates.  VMware uses a technique called ballooning [19] where it injects a device driver into a guest OS that ties up large amounts of physical memory in order to force the guest OS to trim unused virtual memory.  Ballooning thus allows VMware to defer making page replacement decisions to the guest OS, which is in a far better position to make them intelligently. The ability of Windows Server 2003 to communicate directly with server applications that perform their own memory management (predominantly in support of I/O buffering) [15] suggests ballooning could be quite effective in this environment.  The VMware Host software can determine which memory locations the OS has freed up by examining the guest OS page tables.  Then VMware can re-distribute these available pages to other guest machines.

I/O interrupt delays. The foregoing catalog of the performance issues impacting the scalability of virtualization technology would not be complete without some mention of significant interrupt processing delays that can easily arise. Significant interrupt delays are likely whenever there are more virtual machines defined than there are physical processors to run them. The problem can grow acute when some of the virtual machine workloads are I/O bound. This potential impact should be treated as a capacity planning issue during consolidation and deployment.

In a virtualization environment, servicing a high priority device interrupt becomes a two stage process.  The VM Host driver software services the native device interrupt on demand as a necessarily high priority operation that preempts any lower priority task that is currently dispatched.  One effect of this is that a guest machine that is currently dispatched can be interrupted by an I/O completion that was initiated by a different virtual machine. After the VM Host software services the interrupt, it queues a virtual interrupt to be processed by the initiating virtual machine.  The virtual machine waiting on the interrupt may also be waiting to be dispatched on the VM Host queue. The guest machine cannot service the interrupt until the next VMware Host Scheduler interval in which it is scheduled to run. Any dispatchability delays effectively increase the time it takes the virtual machine to process device interrupts. This interrupt processing delay can be considerable.This delay may be reduced by limiting the number of guest machines and by limiting the number of processors being dispatched by each guest (based on the individualized needs of the guest).

Past experience with virtualization technology in the mainframe world (see, for example [13]) shows that I/O bound virtual machine workloads are prone to a secondary effect if they endure extended periods when they are ineligible to be dispatched to service the interrupts they are waiting for. I/O interrupts that are queued to be serviced can only be processed when the virtual machine is finally dispatched. The effect is to stagger interrupt processing at the guest machine in a manner that leads to a skewed arrival rate distribution, a worst case that maximizes the queuing delays that are experienced.  Ultimately, this secondary effect proved so powerful that the dispatcher mechanisms in mainframe partitioning schemes had to be modified to counteract it. The same behavior is currently evident in the virtualization solutions available for the Windows platform.

The performance of virtual machines during I/O intensive operations like back-up illustrates the problem. Suppose the virtual machine running the back-up task is one of two virtual machines vying for a processor.  When the vm is able to execute, it usually has a backlog of I/O interrupts to service.  After the interrupts are serviced and the next I/Os in the sequence are initiated, there may be little or no other processor-oriented work that needs to be done.   The virtual machine idles its way through the remainder of the time slice that it is eligible to run. When its time slice expires, the virtual machine waits. Meanwhile, some of the I/Os that were initiated during its last cycle of activity complete, but they cannot be serviced because the vm is not eligible to be dispatched. At the next interval where the vm is dispatched, the cycle repeats itself. Compared to running native, the I/O throughput of the virtual machine is slashed by 50% or more.

The best way to minimize the impact of this scheduling delay currently is to configure an I/O bound workload so it has guaranteed access to at least one dedicated physical processor.

Performance expectations.

You are apt to be disappointed if you do not commence a server consolidation effort without some reasonable expectations about the performance of your applications as you start running them under virtualization. The previous section discussed some of the important architectural features of the virtualization software available for the Windows platform that have a major resource capacity impact. Focused primarily on the popular VMware ESX package, it identified virtualization overheads that need to be factored into an initial server sizing effort. Three main issues were raised in the previous section that impact capacity. In this section we focus on the performance impact of these architectural features.

  • Virtual machine guests are subject to either round-robin or time-weighted dispatching by the VM Host software on the physical processors assigned for their use. (In VMware ESX version 3, a guest machine can be assigned to use from 1-4 physical processors.) The VMware Scheduler overhead used to switch the processor between virtual machines is minimal, but scales linearly with the number of virtual machines that are defined.

For any given processor workload, instruction execution throughput is degraded in proportion to its scheduling weight and the contention for the physical processors from virtual machines.  VMware claims that ESX can detect when a virtual machine is idle, and will dispatch a different eligible virtual machine when it does so. This is critical to the performance of Windows guests because Windows does not issue a processor HLT instruction when it has no work to do [15].  

Using a periodic timer check, currently set to fire every 1.5 milliseconds by default, the ESX Host Scheduler has an opportunity to preempt an idle virtual machine, assuming it can detect the system state reliably. As reported in [15], the Windows Idle loop is a succession of no-op instructions issued from the HAL. On machines that support APCI power management, the HAL transfers control from the Idle loop to a model-dependent processr.sys driver module to take appropriate action, which may include slowing down or powering off the processor. In practice, it is critical that VMware ensure that idle Windows guests do not waste processor cycles. A utility called the Idler Service is available for the Windows platform to help customers improve idle loop detection.

Using a series of benchmarks, Shelden tried to characterize ESX processor dispatcher queuing delay in [16], which he discovered could be substantial in ESX version 2.  He found, for example, that when a virtual machine was eligible to execute on two processors, VMware waited until it could dispatch the guest OS on both processors simultaneously.  This technique, which VMware calls co-scheduling, provides a consistent multiprocessor environment for the guest OS, but it also means potentially significant dispatching delays when there is contention among virtual machines for the processor.

Shelden’s results suggest that co-scheduling virtual machines causes significant performance degradation when there is CPU contention among the OS guests. Performance analysts need to be on the lookout for runaway application threads that are looping, something that sites can blithely ignore in many dedicated server application environments.  A guest machine running a thread stuck in a loop has the potential to create significant contention for any processors it shares with other guest machines.

  • The guest OS is run in User mode.  All attempts by the guest OS code to issue privileged instructions are trapped by the VM Host software and emulated. The additional interrupts that are generated (due to the failed instructions) and the emulation code that substitutes for the original instruction add significant path length to routine attempts by the guest OS to execute privileged instructions.

Server applications that rely on physical addresses are likely to be the most vulnerable to this delay. The mechanism used in Windows to make physical addresses available to device driver and other system modules is for them to allocate memory in the system’s Nonpaged Pool. This should make it relatively easy to identify applications that are subject to this delay. Any Windows server application that makes extensive use of physical addresses is potentially at risk.

Windows has a number of subsystems that remain in kernel-mode so that they can work directly with DMA controllers and utilize physical addresses allocated from the Nonpaged Pool. These include the MDL Cache interface (MDL stands for Memory Descriptor List) used by the file Server service and IIS. High performance Fibre Channel and SCSI device drivers also make extensive use of physical addressing and MDLs. When issued by a virtual machine guest running in User mode, the privileged instruction to disable paging in order to utilize physical addresses or to re-enable paging when it is safe to do so fails. The failed instruction must then be emulated by the VM Host software.

For example, the http.sys driver module introduced in IIS 6.0 can process Get Requests for static html objects entirely in kernel-mode. For the sake of efficiency, this process relies on instructions that are accessing physical memory addresses directly. (The popular Apache web server application that runs primarily on Linux has comparable facilities.)  In IIS 6.0, TCP/IP, also running in kernel mode, queues an HTTP Get Request object to an http.sys worker thread.  The worker thread has access to a physical-memory resident look-aside buffer (known as the Web Service Cache) where fully rendered HTTP Response objects are cached.  If a cache hit occurs, the HTTP Response object can then be transferred from the cache to a physical memory-resident MDL output buffer for transmission by the NIC back to the requestors IP address without leaving kernel mode. Extensive intervention by the VMware Host, including emulation of the failed privileged instructions, is required to complete these functions.

  • The guest OS code to initiate and service I/Os is executed twice, once by the guest OS and a second time by the native device drivers running on the VM Host. This slows down the execution of both disk and network operations.

This strongly suggests that I/O bound workloads with critical performance requirements – and that includes network I/O of all types – should be run in native mode, not as virtual machine guests.  Workloads that are not primarily I/O bound may still have periods where I/O performance is critical.  It may be necessary, for example, to migrate guest machine back-up operations to a VM Host native back-up operation if back-ups cannot complete in the window that is available. The performance trade-off here is that VM Host native back-up operations view the guest machine’s disk storage as a container file that must be backed up or restored in its entirety.  Application-level file back-ups, like those performed on SQL Server or Exchange databases, must be run on the platform that runs the application. Of course, this type of application server may fall into the category of I/O bound workloads that need to run natively anyway.

The  larger concern for I/O bound workloads is where there is processor contention among virtual machine guests and I/Os to the guest OS remain pending during relatively long periods when the virtual machine is not eligible to be dispatched, as discussed in the section entitled "I/O interrupt delays".

  • The balloon technique which forces page replacement back to the OS in various guest machines, along with the ability to share common pages among homogenous guests, makes memory management relatively efficient in the virtualization environment. Whatever additional memory that is required in the VM Host software and on VM guests to make virtualization work is usually more than offset by this bag of clever memory management tricks.
  • The simple virtual machine processor dispatching method used in VMware, along with the co-scheduling of processors to virtual machines defined to run on multiprocessors, leads to long delays in I/O interrupts processed by the guest OS. This, in turn, can lead to an artificial staggering in the rate of I/O initiation and completion. This secondary effect of the virualization engine’s Scheduler can have a large negative impact on even modestly I/O bound workloads.

The cumulative effect of these negative performance impacts can be substantial. Shelden’s benchmark testing [16], which was designed to isolate the performance impact of the VMware ESX processor scheduler showed significant degradation due to co-scheduling. In a test case where two virtual machines were active, each with two active threads, each capable of consuming approximately 55% of a processor in a conventional environment, overall utilization of each physical processor in the 2-way VM Host machine managed to reach only about 65% busy, instead of an expected 100% busy. One explanation is that the VM Host software was unable to detect correctly when a Windows processor is idling. Another possibility that Shelden explores is that processor co-scheduling requires that physical processors be assigned to virtual machines in tandem. When Shelden changed his simulation to use a Dispatch in Pairs scheduling algorithm, instead of the expected Dispatch When Ready, his simulation results were a significantly better fit to the actual data.  Experiences in real world environments have demonstrated that reducing the number of defined processors (from 2 to 1) increase the efficiency of scheduling and therefore guest image performance.  This is discussed further in [1].

In [9], the principal researchers behind the development of Xen compare and contrast the performance of native Linux, Xen, and VMware Workstation on a series of benchmarks. Figure 3 below summarizes three of the comparisons: a CPU stress test running specint; a database-oriented benchmark; and the SpecWeb99 test suite that is a very comprehensive test of web server capabilities (it stresses the CPU, the network and the disks). On the CPU-bound workload, VMware results keep pace. But on the database and web server workloads, VMware lags native performance considerably.  On the web server tests, the Xen developers observed, “VMware suffers badly due to the increased proportion of time spent emulating ring 0 code while executing the guest OS kernel.”

Figure 3. Benchmark results comparing native Linux, Xen, and VMware on three different workloads. From a paper written by the principal developers of Xen at Cambridge University. [9]

In a battle of the benchmarks, VMware has recently published its own set of head-to-head comparisons with Xen that shows a markedly different picture. [22] Showcasing the performance of the VMware-aware vmxnet network driver, [22] VMware reports ESX achieves performance on the Netperf benchmark at close to the level of a native OS, with Xen able to drive network traffic at less than 10% of the native configuration. Unfortunately, there are no published reports that provide an apples-to-apples comparison where both virtualization products are running in their optimal configurations.

Monitoring the virtualization environment.

Performance monitoring for virtual machines running under VMware ESX is complicated today because many of the familiar guest OS measurements become difficult to interpret and cannot be relied upon. It is necessary to augment guest OS monitoring procedures with measurements from the VM Host software, especially regarding the use of the processor by various guest machines. Fortunately, VMware ESX does provide some necessary processor utilization statistics, but it lacks measurements that can give you a clear picture of what is going wrong when performance issues surface.

One side-effect of the two-step process in which device interrupts are handled by the VM Host software is that the timing mechanisms on a Windows guest machine are not reliable.  This affects all timer-based measurements that any performance monitoring software running inside the Windows guest makes, including the % Processor Time utilization measurements at the Thread, Process, and Processor level and the Avg. Disk Secs/Transfer response time measurements for both Logical and Physical Disk.

The Processor utilization measurements in Windows are derived from samples taken once every clock interval. (See [15] for details.) The precision of the system clock is maintained as if ticks occur every 100 nanoseconds.  In reality, the system clock time is updated much more slowly, normally about once every 15 milliseconds. (In official Microsoft documentation, this duration between clock ticks is sometimes referred to as the periodic interval.)  The mechanism used to maintain the system time is a timer interrupt that is set to fire regularly every periodic interval.[1] When the timer interrupt occurs, the Windows OS advances the system clock. It also samples the state of the machine at the time of the interrupt and determines what thread from what process was running when the clock interrupt occurred. All of the processor utilization measurements at the thread, process, and processor level are based on this sampling technique.

The timer is a peripheral device on the Intel platform – it normally resides on one of the supporting chip sets external to the processor. The timer functions performed by the HPET and ACPI Timers require the equivalent of a very fast I/O operation to access the current clock value. Intel-compatible processors beginning with the Pentium do contain a Timestamp Counter (TSC) that reflects the internal frequency of the microprocessor and the TSC can be used very efficiently to time the duration of events. VMware ESX, in fact, uses the TSC for its measurements of processor utilization. But the TSC can only be used to tell time reliably relative to a single processor core. The TSCs on different processor cores in a multiprocessor are likely to be out of sync. In addition, on older Intel and AMD microprocessors with power-saving features, the TSC does not maintain a constant uniform clock rate when it is executing in reduced power mode.

The high priority clock interrupt that Windows relies upon to keep time internally and for all its measurements of processor utilization is subject to interrupt pending delays in a virtualization environment. The periodic interval that Windows relies upon to keep accurate time is subject to erratic behavior, as documented in [17]. Consequently, the processor utilization samples that Windows gathers are no longer uniformly distributed in time. This should not invalidate the measurements completely, especially when you are looking at long enough intervals with sufficient samples to minimize normal sampling error. However, Windows performance monitoring software does not have direct access to the number of processor utilization samples. The utilization metrics the software calculates are based on the assumption that the intervals between samples are uniform. Obviously, this is not correct in the virtualization environment. The resulting calculation of % Processor Time is subject to major error because VMware does not deliver virtual clock interrupts at uniform time intervals.

A number of knowledgeable observers have puzzled over the interpretation of the processor utilization measurements that rely on this timing mechanism when they are taken in a virtualization environment. They have had little success trying to make sense of the normally reliable processor utilization measurements provided by the Windows guest machine and even less correlating them with measurements taken by the ESX host software. See for example, [5] and [16]. The problem is that it is difficult to know what to make of any of the Windows % Processor Time measurements (all counters of type PERF_100NSEC_TIMER are impacted) as calculated by the Windows guest OS. The way interrupts in general, the clock interrupt included, are stacked up waiting for service when the virtual machine is finally dispatched undermines the uniform sampling methodology that Windows relies on in its % Processor Time calculations.

Direct measurements taken by VMware ESX are necessary to augment the guest OS measurements. (Generous portions of ESX version 3 and its Infrastructure add-ons appear to be devoted to performance monitoring.) ESX currently provides an interface to allow programs running on the VM guest to pull performance metrics, including the Total CPU Seconds used by the VM (in milliseconds). At least one third party performance monitoring product currently supplies measurements of VM Guest CPU % Used and % Ready (purportedly, the percentage of time a guest OS was Ready to run on the VM Host, but was unable to). Consolidating this information with data from various guest machines is always a challenge, of course.

Even though Windows presents system timer values in ticks of a 100 nanosecond clock, the system time value is actually only adjusted every periodic interval. The actual granularity of clock intervals is about 15 milliseconds, as discussed above. This granularity proved too coarse for many performance-oriented measurements. So, beginning in Windows 2000, a new timer facility based on the TSC was introduced. This newer timing facility, also known as a high precision clock, is accessed using the QueryPerformanceCounter Win32 API call, somewhat inappropriately named because it is a general purpose interface. In Windows XP and Server 2003, the QueryPerformanceCounter Win32 API call issues an RDTSC instruction. In Windows Vista and the forthcoming Longhorn Server, QueryPerformanceCounter was changed to use the HPET instead.

In VMware, the TSC is virtualized and the RDTSC instruction, which can only be issued in privileged mode, is emulated. [17]. In [17], VMware warns,

Reading the TSC takes a single instruction (rdtsc) and is fast on real hardware, but in a virtual machine this instruction incurs substantial virtualization overhead. Thus, software that reads the TSC very frequently may run more slowly in a virtual machine. Also, some software uses the TSC to measure performance, and such measurements are less accurate using apparent time than using real time. In a virtualization environment, there is no guarantee that the I/O interrupt from the clock will be issued or serviced by the guest OS in a timely manner. [Emphasis added.]

The Windows performance measurements that are impacted by this anomaly are the Logical and Physical Disk % Idle Time counters that measure disk utilization, one of several disk performance counters of type PERF_PRECISION_100NS_TIMER. The accuracy of the Avg. Disk secs/Transfer counter that furnishes disk response time measurements is also suspect in a VMware environment. Both the Logical and Physical Disk measurement layers in the I/O Manager stack issue a call to QueryPerformanceCounter at the start and end of every I/O operation. Unfortunately, native ESX only measures disk and network throughput per virtual guest, which does not close the gap that not having these important guest OS disk performance statistics leaves.

The Future of Virtualization.

In 2005 both Intel and AMD announced strikingly similar new hardware designed with virtualization in mind.  The Intel initiative is called VT, while the AMD product is called Pacifica.  This new hardware first became available in 2006. In both vendors’ specifications, a new privilege level is added where the virtualization monitor runs. That will allow the virtualization monitor to run a guest OS at its normal ring 0 privilege level, rendering current privileged instruction emulation techniques obsolete.  The new hardware also includes some new instructions to allow the Virtual Machine Monitor to launch a guest virtual machine and facilitate communication between a guest OS and the VM Host software layer.

A new generation of virtualization software is required to exploit the new hardware.  All the leading vendors of virtualization software report that they are actively working on new versions that will support the new hardware.

Preliminary reports indicate the new hardware addresses some, but not all, of the performance problems that plague current virtualization solutions. Adams and Agesen in [21] offer this somewhat over-harsh judgment, “While the new hardware removes the need for BT [just-in-time Binary Translation] and simplifies VMM design, in our experiments it rarely improves performance.” Drilling down into the details of the series of micro-benchmark runs they report that compare native performance to existing VMware software and a VMware prototype that supports the new hardware, Adams and Agesen found

·         both virtualization approaches were equivalent in being able to execute compute-bound workloads at near native levels of performance

·         the hardware approach is notably superior on any workload that is rich in system calls because it is no longer necessary to emulate the privileged instructions issued

·         the software approach runs an order of magnitude slower during processing of page faults and the processing of other machine exceptions

·         the current VMware software approach outperforms the hardware-enabled prototype on I/O bound workloads, ones where processes are frequently created and destroyed, and ones with frequent address space context switches

Moreover, Adams and Agesen were discouraged to find that the 1st generation virtualization hardware support provides no mechanism to assist the Virtual Memory Monitor in maintaining a coherent set of shadow PTEs on behalf of guest OS machines, a major source of overhead currently. They advocate that the hardware vendors consider further extensions to their initial virtualization specification that would aid in memory management, similar to the Start Interpretative Execution (SIE) instruction that was added to the IBM 370 instruction set to boost the performance of preferred guest machines under its VM operating system. This is, in fact, something both AMD and Intel have done in specifying a 2nd generation virtualization interface specification. [23, 24] For example, AMD’s Secure Virtual Machine specification provides

·         faster switching between the VMM and guest OS, including a new VMMCALL instruction that can used by a guest to call the VMM explicitly

·         The ability for the VMM to trap or intercept selected instructions executed in the guest OS like RDTSC or IOIO instruction to specific ports and events like addressing exceptions

·         A Nested Paging Facility that provides for two levels of address translation, eliminating the need for the VMM to maintain shadow page tables

·         External (DMA) access protection for memory

·         Assists for interrupt handling and virtual interrupt support

·         A guest/host tagged virtual address Translation  Lookaside Buffer (TLB) to reduce memory management overhead

Once it is supported by systems software, the 2nd generation hardware environment would allow preferred guest machines to access native devices for better performance.

It should be apparent from this discussion that virtualization technology is in its infancy in the Windows environment. Server consolidation using current virtualization technology has a number of serious performance limitations. It needs to be deployed judiciously. If you are really interested in reducing the total number of OS images you are managing, consider consolidating workloads under a single OS image instead. In either approach, there is no substitute for a rigorous application of traditional performance monitoring and capacity planning techniques.

References.

[1]   Marksamer, S. and Weilnau, P., “Real World Adventures in Server Virtualization”, CMG Proceedings 2006.

[2]   Friedman, M., “The reality of virtualization for Windows Servers,” CMG Proceedings 2006

[3]   Friedman, M., “Can Windows NT be tuned?” CMG Proceedings 1996.

[4]   Friedman, M. and Pentakalos, O., Windows 2000 Performance Guide, Boston, MA: O’Reilly Associates, 2002.

[5]   Fernando, G., “To V or not to V: a practical guide to virtualization,” CMG Proceedings 2005.

[6]   Friedman, E., “Tales from the lab: Best Practices in application performance testing,”  CMG MeasureIT, November 2005, available at http://www.cmg.org/measureit/issues/mit27/m_27_10.html.

[7]   Seawright, L., and MacKinnon, R., “VM/370 – a study of multiplicity and usefulness, IBM Systems Journal, 1979, p. 4-17.

[8]   Intel Virtualization Technology Specification for the IA-32 Intel Architecture,  ftp://download.intel.com/technology/computing/vptech/C97063-002.pdf.

[9]   Barham, et.al., “Xen and the Art of Virtualization,” University of Cambridge Computer Laboratory, published at SOSP 2003, available at http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2003-xensosp.pdf.

[10] “ESX server performance and resource-management for CPU-intensive workloads.” Available at www.vmware.com.

[11] Gunther, N., “The Virtualization Spectrum from Hyperthreads to Grids,” CMG Proceedings 2006.

[12] Menasce, D., “Virtualization: concepts, applications and performance modeling,” CMG Proceedings 2005.

[13] Young, D., “Partitioning Large Processors,” CMG Proceedings, 1988, p. 67-74.

[14] “VMware ESX Server 2: Architecture and performance implications.” Available at www.vmware.com.

[15] Friedman, M., Windows Server 2003 Performance Guide, a volume in the Microsoft Windows Server 2003 Resource Kit, Microsoft Press, 2005.

[16] Shelden, W., “Modeling VMware ESX performance,” CMG Proceedings 2005.

[17] “Time-keeping in VMware virtual machines,” Available at www.vmware.com.

[18] Robin, J.S., and Irvine, C.E., “Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor,” Proceeding 9th USENIX Security Symposium, 2000. Available at http://www.cs.nps.navy.mil/people/faculty/irvine/publications/2000/VMM-usenix00-0611.pdf.

[19] Waldspurger, C., “Memory Resource Management in VMware ESX Server,” Proc. Fifth Symposium on Operating Systems Design and Implementation (OSDI ’02), Dec. 2002. Available at http://www.waldspurger.org/carl/papers/esx-mem-osdi02.pdf.

[20]  “Network throughput in a virtual infrastructure.” Available at http://www.vmware.com/pdf/esx_network_planning.pdf.

[21]  Adams, K. and Agesen, O., “A comparison of software and hardware techniques for x86 virtualization,” ASPLOS Oct. 2006. Available at http://www.vmware.com/pdf/hypervisor_performance.pdf.

[22]  “A performance comparison of hypervisors.” Available at http://www.vmware.com/pdf/hypervisor_performance.pdf.

[23]  Neiger, G., et. al., “Hardware support efficient processor virtualization,” Intel Technology Journal, Vol. 10, No. 3, August, 2006.

[24]  AMD64 Architecture Programmer’s Manual, Volume 2: System Programming.



[1] The timer is a peripheral device on the Intel platform – it normally resides on one of the supporting chip sets external to the processor. The timer functions performed by the HPET and ACPI Timers require an I/O operation to access the current clock value. Intel-compatible processors beginning with the Pentium do contain a Timestamp Counter (TSC) that reflects the internal frequency of the microprocessor and the TSC can be used very efficiently to time the duration of events. VMware ESX, in fact, uses the TSC for its measurements of processor utilization. But the TSC can only be used to tell time reliably relative to a single processor core. The TSCs on different processor cores in a multiprocessor are likely to be out of sync. On older Intel and AMD microprocessors with power-saving features, the TSC does not maintain a constant uniform clock rate.

 

Last Updated 03/20/07


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group