Introduction
Server virtualization technology is currently emerging and taking hold of
commercial computing. Its ramifications and impact appear to be transformational
-and perhaps as important as the emergence of TCP/IP technology two decades
ago. In the future, deployment of servers without virtualization will probably
be considered as quaint as deployment of memory without virtual memory.
On the surface, server virtualization allows multiple operating systems to
execute on the same physical server. Many of the benefits provided by server
virtualization are discussed in [08]. But the implications of this capability
are far-reaching. Since the hardware platform is now virtualized, the entire
execution environment, including the operating system, the middleware and
applications, can be moved from one system to another without interruption.
The implication here is that “planned downtime” for hardware changes, which
is by far the largest factor in system unavailability, could be significantly
reduced. A running environment can be moved to another server, leaving the
old platform available for hardware changes.
The continuing debate about the benefits of various operating systems may
fade away. Virtual appliances are already starting to emerge where the image
of the executing system can be “popped” into a virtualized environment and
run without regard for the embedded OS. The provisioning of new virtual servers
will be as simple as reading a file into the virtualization environment and
enabling the virtual machine. The operational costs per OS image will be
reduced significantly.
The current virtualization technology, like TCP/IP, is just the first stage.
Derivative technologies (like the emergence of the Internet) will mark the
truly transformational nature of server virtualization.
Given all of this, we, as performance and capacity experts, need to have
a deep understanding of the performance implications regarding virtualization.
For example, what is the basic overhead of running a hypervisor on which the
OS images dwell? How does this overhead change as a function of the number
of virtual machines and physical CPUs? Can we accurately predict the effects
of queueing both at the physical and virtual CPU levels? What is the impact
of I/O activity? How about the impact of allocating specific quanta of CPU
cycles to each machine? How is performance affected by the selection of a
specific virtual technology? Finally, what are the “gotchas” that will emerge
as the glamour of the technology wears off and reality sets in?
This paper has been written to illuminate answers to these questions, as
well as provide a basic foundation for a performance model. The benchmarks
are described so that future research can duplicate our results and extend
our findings to create a larger body of knowledge. A number of articles have
been published previously reporting various findings. For VMware, [01], [09]
and [10] have discussed VMware performance. Some of the previous findings
are available in [01][09][10]. Xen performance measurement is discussed in
[06].
Although the proposed model is independent of the virtualization technology
and includes considerations for I/O overhead, the benchmark results in this
paper only present our findings regarding processor overhead for the VMware
virtualization solution. We expect to publish future papers that will cover
Xen virtualization as well as I/O-intensive benchmark results.
Model Definition
In order to provide a framework for our research, a simple performance model
was developed. Modeling is a cornerstone of science. It frames the discussion
and offers a focal point as well as providing realistic predictions of behavior.
A model should be simple and clearly described. It should offer a path so
that researchers can continue to add to the body of knowledge. Of course,
when there are a lot of unknowns, the model can only be a hypothesis that
needs to be proved through empirical measurement. Early models of the universe
proposed the Earth as the center. Empirical evidence eventually changed that
model. Models do not have to replicate the real world to be useful. By focusing
on specific attributes of the real-world, system simple models can be very
powerful. Consider the Periodic Chart. It does not reveal anything about
the physical trajectories of electrons, but it is extremely useful in describing
specific behaviors.
This is also true of the queueing network models used to predict computer
performance. They are deceptively simple, but have been successfully used
for four decades to predict computer performance in terms of application delay
times and queueing.
One of the earliest computer performance models was proposed by Kleinrock
[04] in 1964. It assumed that there were a number of “customers” in the model,
where a customer is today considered to be a workload that serially requests
services from resources. In his simple model, the requests arrived as a Poisson
process and then entered a queue from which a single CPU would process requests
in a First Come First Served (FCFS) manner. Each request received a quantum
of service, ∆s. This was then expanded to include a second cycle
where a “customer” would proceed to request service from an I/O device after
the completion of the CPU request. But the original systems did not have
infinite “customers”. In fact, in the beginning, an entire computer system
could only process one customer at a time in a manner we would consider as
batch-oriented processing. When interactive computing, using terminals, arose,
the total number of “customers” in the system was limited to the number of
terminals. The number of independent customers was referred to as the “multiprogramming”
level. A number of mathematicians worked on these models, but the clearest
breakthrough occurred with the “Central Server Model”, as developed and described
by Buzen [02]. The graphic below shows the model. It assumes a finite multi-programming
level, MP, where requests cycle endlessly through the system. After
using the CPU, there are probabilities – P1, P2, … Pn
of requesting a specific disk for service. After service, it again returns
to request another quantum of service from the CPU until the full processing
request is completed. At that time, there is a delay where the person at
the terminal waits a random amount of time and then requests service (the
wait time is not shown in the graphic).

The Central Server Model
In this model, a very small number of parameters were used to predict performance
as a function of the multiprogramming level, the quanta of CPU service requests,
the probabilities of using devices and the wait time in the devices. As technologies
made the computing environments more and more complex, this seminal model
successfully endured and was used to size the largest computer complexes throughout
the world. For our investigation of server virtualization, we focused first
on the impact of virtualization on overall delays as a function of processor
time, queueing and virtualization overhead. We started with a well-defined
CPU-intensive benchmark that normally runs in a few minutes, completes and
reports its time. The benchmark will be described in more detail in the next
section. The graphic below shows a simplification of the central server model
where the I/O activity is restricted to two parameters. After using the CPU,
there is a probability of an I/O (P1), and, if the I/O occurs,
I/O1 is the average delay time for the I/O.

Simplified Central Server Model
In essence, a CPU-intensive benchmark will start at the left and continually
cycle around, using the CPU and with zero probability of an I/O until the
benchmark is completed and then it will exit at the right.
Now, we need to step back for a second and understand the role of the operating
system in all of these performance models. First, there is no physical entity
in the hardware called the “CPU Queue”. The CPU queue is maintained by the
operating system. The operating system is maintaining a software queue in
which CPU requests are managed. If there is only one, CPU-intensive workload,
with no I/O interruptions, the benchmark’s process will continually use the
CPU until it completes. There will be occasional interruptions for OS services,
but they should be minimal on a system that is devoted to the benchmark.
Once we introduce multiple processes into the model, we need to determine
how the OS will allow these processes to share the CPU. This is accomplished
by establishing a quantum time that each process can use the CPU. The Linux
2.6 kernel uses a base time quantum of 100 ms for default static priority
processes. Therefore, if there are two CPU-intensive processes, they will
each use 100 ms of CPU alternately until they complete. If a single instance
of the benchmark completes in 100 seconds, we could expect two instances of
the benchmark, running in parallel, to equally share the CPU and therefore
take 200 ms to complete.
With server virtualization, the circle in the graphic that represents CPU
service is replaced by the virtual server. The graphic below shows the detail
within the original circle.

Central Server Model – With Virtual
Server
The original CPU queue is replaced by the virtual machine queue, vmq1.
A request is then routed to the virtualization layer that is now maintaining
the queue for the physical CPU, pmq. Note that, for multi-CPU systems,
the single queue can be serviced by one of many physical CPUs that together
provide the physical machine resource, pm. The quantum amount of service
within the virtualization layer is smaller than a typical operating system
to avoid time-out problems at the OS level. Although the actual quantum value
is not available, some hearsay and information points to a 20 ms quantum.
This has not been verified. So a single request from the OS can result in
a number of smaller requests within the virtualization layer. An additional
two parameters are introduced to handle virtualization attributes such as
quotas or limits that are specified regarding the percentage of total CPU
that a single virtual machine can use. For example, it can be specified that
only 50% of the CPU will be used by a virtual machine. Therefore, we need
to introduce the probability that, after completion of CPU service, the request
will be put in a wait state by the virtualization layer. P11
is that probability. W1 is the average wait time.
Even with a single virtual machine, there are actually two workloads on the
CPU. L1 is the virtual machine workload and L0
is the workload imposed by the virtualization software itself. It is assumed
that CPU requests to manage the virtualization will take higher priority and
preempt the virtual machine requests.
The following graphic shows how two virtual machines would share the physical
machine resources. This could be extended for any number of virtual machines.

Central Server Model – Two Virtual
Machines
A series of benchmarks were run and analyzed to determine how these parameters
can be estimated, along with validating the accuracy of the overall model.
Benchmark Description
For the benchmarks reported in this paper,
we focused on purely CPU activity. All of the literature that we have seen
implies that CPU activity has the least amount of virtualization overhead,
compared to I/O activity such as networking and storage activity. But I/O
overhead can only make sense once the CPU overhead is understood.
We chose an industry standard benchmark,
SPECInt and selected the first of the benchmark tests, the 164.gzip test from
SPEC CPU 2000 v1.2. gzip (GNU zip) is a popular data compression program
which performs no file I/O other than reading the input. All compression
and decompression happens entirely in memory.
The following description which explains
the performance metric can be found at:
http://en.wikipedia.org/wiki/SPECint
SPEC defines a base
runtime for each of the 12 benchmark programs. For SPECint2000, that number
ranges from 1000–3000 seconds. The timed test is run on the system, the time
of the test system is compared to the reference time, and a ratio is computed.
That ratio, multiplied by 100, becomes the SPECint score for that test.
As an example, we'll
use the Sun
Microsystems Fire V20z, with an AMD Opteron 252 CPU, running
at 2600 MHz. Let's use the 164.gzip benchmark, which has a 1400 s reference
time. We time how long it takes the 164.gzip benchmark to execute on our V20z
test system, and find it takes 90.4 s. 1400/90.4 * 100 = 1548. Thus our 'base
ratio' is quoted as '1548'. There are no units.
More information is available regarding
SPEC benchmarks in [03]. The benchmark that we used actually runs the test
3 times and then reports its elapsed time. The elapsed time is not the same
as the SPECInt score, but it is extremely useful for our analysis. By accurately
measuring the elapsed time, we can then determine the CPU overhead for virtualization
as the system becomes over-committed and the benchmark’s CPU requests are
forced to wait for competing requests from the other VMs.
Hardware Configuration
The system used to run all benchmark tests
was a Unisys ES7000-540-G3 containing eight 3GHz Intel® Xeon® CPUs with hyper-threading support.
Intel hyperthreading technology allows a single processor to execute two independent
threads simultaneously by sharing resources between the threads. The sharing
of resources limits the increase in throughput to a level much lower than
would be achieved with two separate processors. Since the virtualization
layer can use each hyperthread as a virtual machine, hyperthreading was disabled
to allow consistent scaling measurements as additional physical processors
were added. The benchmark test system was configured with 32GB of memory
and approximately 60GB of internal SCSI which consisted of two 36GB disks
striped as RAID 0.
The system is managed by a separate service
system which runs Unisys’ Server Sentinel management software. The service
system is capable of monitoring the health of the ES7000 and performing configuration
tasks. The ES7000 was configured to use only two processors for all of the
benchmark runs performed for this investigation. This allowed for the creation
of many more virtual machines than physical processors with a relatively small
number of tests.
Software Configuration
The benchmark was run under two separate
scenarios:
- Bare Metal” - Red Hat Linux Fedora Core
5 2.6.15-1.2054_FC5smp
- VMware virtual environment - VMware ESX
3.0 RC1 (build 23447) with Red Hat Linux Fedora Core 5 2.6.15-1.2054_FC5
for VMs
In all cases, non-essential services were
turned off to eliminate unnecessary overhead in the operating system. The
benchmark, 164.gzip, was compiled using GCC 4.1.0.
Software Configuration
The OS used was a Linux OS based on the
Fedora Core 5.
For the “bare metal” benchmarks, the OS
level was:
2.6.15-1.2054_FC5smp
For the virtualization benchmarks, the
OS level was:
2.6.15-1.2054_FC5
within the individual VMs.
In both cases, all non-essential services
were turned off, which resulted in a very quiet system, as will be described
in the next section.
The virtualization software was:
VMware ESX 3.0 RC1 (build 23447)
Finally, the benchmark that was run was:
SPECINT CPU2000 gzip
This was compiled with GCC 4.1.0
Benchmark Results – “Bare Metal”
To get a baseline, we measured the performance
of gzip running without virtualization. The Linux OS that we used was extremely
“quiet”. We measured the overhead on a quiescent system, i.e., without
the benchmark executing. The CPU utilization, which can range from 100% to
0%, averaged over the 2 CPUs, was:
0.286%±.033
Note that all confidence intervals presented
in this paper are calculated using a confidence factor of .95.
We started with one instance for the benchmark
and then increased the instances to 2, 3, 4, 6, 8 and 12.
The following table shows the raw elapsed
times for the 7 benchmarks, which include the OS overhead.
| Bare Metal |
Raw Elapsed Time |
Conf Int |
| 1 |
210 |
|
| 2 |
215 |
|
| 3 |
302 |
±173.3 |
| 4 |
430 |
±2.3 |
| 6 |
649 |
±2.0 |
| 8 |
868 |
±1.3 |
| 12 |
1308 |
±2.3 |
Table #1
The difference between the first and second
benchmark cannot be totally explained by the CPU overhead. We suspect that
the increase is caused by fewer cache hits within the CPU cache since two
different workloads are sharing the cache. This has not been verified yet.
Next, we noticed that the 3-instance benchmark had significant variance –
the first process that started completed in less time than the other two.
These were the raw results:
First instance 223
Second instance 326
Third instance 356
The confidence intervals were calculated
using each instance’s results as a sample within the benchmark.
Given the nature of this benchmark, the
queueing time experienced by the multiple instances of the benchmark is quite
easy to estimate. For two instances, given 2 CPUs, the amount of queueing
is near zero. For 3 instances, the following is used for the estimate.
After an instance gets its quantum of
CPU time, it will go back into the queue. At that time, the other two instances
will be busy using the two CPUs. One instance will have just started while
the other instance will have some random residual time. Since the quanta
are a constant number, we can expect that the residual time will be the mean
of a uniform distribution, or half the quantum time. Therefore, the average
queueing time will be half of the total CPU time for an instance. If qn
is defined as the average number in the queue, qn = 0.5.
We believe that the disparate results
we observed were probably due to the fact that the three benchmarks were really
not entering the queue in a random order. The first instance always seemed
to have a “head start” over the others. Once we ran more than three instances,
the observations were more predictable.
For 4 instances, we can expect the queueing
time to be equal to the CPU elapsed time, or qn = 1. For
six, it will be twice (qn=2) and so on.
In general, we can expect the elapsed
time, as a function of the instances, to be:

Given the above simplified queueing process,
we have:

Where i is the number of instances,
P is the number of processors (in this case, 2) and
i ≥ P
The following table shows the estimate
compared to the measured statistics.
| Instances |
Measured |
Estimated |
| 1 |
210 |
|
| 2 |
215 |
215 |
| 3 |
302 |
322 |
| 4 |
430 |
430 |
| 6 |
649 |
645 |
| 8 |
868 |
860 |
| 12 |
1308 |
1290 |
Table #2
The predictable behavior of the benchmark
and the simplicity of the elapsed time estimate gave us a firm foundation
from which to observe the virtual machine benchmark results.
Benchmark Results – VMware ESX 3.0
First, the overhead of running a virtualization
layer was measured. We included in this overhead the overhead of the Linux
OS that was running within the VM (which was very small).
The overhead for virtualization and the
OS running in the VMs was measured on a quiescent system after each benchmark
was run for 1, 2, 3, 4, 6 and 8 instance benchmarks. Table #3 shows the measured
CPU utilization for these benchmarks, along with the confidence intervals.
The results were fitted to the general linear model using the LINEST function
in EXCEL:

We found
a = 1.064
b = 1.022
The Estimate column in Table #3 indicates
the line predicted with these parameters. The coefficient of determination
for this linear estimate was:
R2 = 0.975819
which shows a high correlation between
the estimated and observed values.
| Instances |
Overhead |
Conf Int |
Estimate |
| 1.00 |
1.68 |
0.09 |
2.09 |
| 2.00 |
3.19 |
0.17 |
3.11 |
| 3.00 |
4.00 |
0.11 |
4.13 |
| 4.00 |
5.94 |
0.92 |
5.15 |
| 6.00 |
7.11 |
0.15 |
7.20 |
| 8.00 |
8.99 |
0.52 |
9.24 |
Table #3
Obviously, the overhead increased with
the number of guests, even if they were not running any processes.
We then ran a series of benchmarks similar
to the “raw iron” ones, except now each instance was run in a virtual machine.
Table #4 shows our first results.
| Instances |
Elapsed Time |
Estimate |
| 1 |
222 |
|
| 2 |
236 |
236 |
| 3 |
354 |
354 |
| 4 |
460 |
472 |
| 6 |
469 |
708 |
| 8 |
473 |
944 |
Table #4 - Miraculous Results
These unexpected results gave us pause.
The benchmark was somehow finding spare CPU cycles that didn’t exist. After
further examination, we found the following situation.
The benchmark itself reported its elapsed
time by calling a function to find the system time at both the beginning and
end of the benchmark. The elapsed time reported by the benchmark was less
than the wall-clock elapsed time. What we hypothesize is that, due to the
unrelenting CPU consumption by the benchmarks, the virtualization layer was
unable to update its clock with the virtual CPU clock ticks. This phenomenon
is mentioned in [10] and [11] but we feel that this type of CPU workload severely
exaggerates the situation.
To work around this problem, we replaced
the call in the benchmark with a call to the daytime port on another Linux
server. The code replacement was implemented in the following way:
procedure gettime() in tools/src/specinvoke.c:
void gettime(spec_time_t *t) {
struct timeval tv;
struct timezone tz;
/* Need to make sure that everybody can take NULL for tz */
// Original code commented out
// gettimeofday(&tv, &tz);
// t->sec
= tv.tv_sec;
// t->nsec
= tv.tv_usec * 1000;
t->sec = mygettime();
//calls external server daytime
t->nsec = 0;
// only need seconds
}
In Makefile.in the specinvoke
make target (to compile and link in mygettime.c):
specinvoke: specinvoke.o getopt.o mygettime.o
$(CC) $(LDFLAGS) -o $@ specinvoke.o getopt.o mygettime.o
New function mygettime() in file mygettime.c:
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <unistd.h>
int mygettime(){
int sockfd;
int len, result;
struct sockaddr_in address;
char buffer[128];
char hh[3];
char mm[3];
char ss[3];
int hours,mins,secs;
int totalsecs;
totalsecs = 0; //plan for the worst
sockfd = socket(AF_INET, SOCK_STREAM, 0);
address.sin_family = AF_INET;
//use daytime service
address.sin_port = htons(13);
//on external host
address.sin_addr.s_addr =
inet_addr("xxx.xxx.xxx.xxx");
len = sizeof(address);
result = connect(sockfd, (struct sockaddr *) &address,
len);
if (result == -1) {
perror("problem with: getdate");
return(0);
}
result = read(sockfd, buffer, sizeof(buffer));
buffer[result] = '\0';
if (result >18)
{
hh[0] = buffer[12];
hh[1] = buffer[13];
hh[2] = '\0';
mm[0] = buffer[15];
mm[1] = buffer[16];
mm[2] = '\0';
ss[0] = buffer[18];
ss[1] = buffer[19];
ss[2] = '\0';
hours = atoi(hh);
mins = atoi(mm);
secs = atoi(ss);
totalsecs = 60*60*hours+60*mins+secs;
}
close(sockfd);
return(totalsecs);
}
…….
Note that, by hypothesizing a model and
using that model to evaluate observations, we were quickly alerted to a problem.
Once the code was changed, we measured
the following results.
| Instances |
Elapsed Time |
Conf Int |
| 1 |
219 |
|
| 2 |
229 |
|
| 3 |
346 |
±1.13 |
| 4 |
468 |
±2.39 |
| 6 |
719 |
±2.92 |
| 8 |
969 |
±3.03 |
| 12 |
1514 |
±3.64 |
Table #5
The behavior of the three-instance benchmark
was much more consistent and predictable than we observed with the raw benchmark.
For the next table, two estimates are
shown. Estimate1 was calculated using the same formula as used in the non-virtualized
case.
| Instances |
Measured |
Estimate1 |
Estimate2 |
| 1 |
219 |
|
|
| 2 |
229 |
229 |
234 |
| 3 |
346 |
344 |
354 |
| 4 |
468 |
458 |
477 |
| 6 |
719 |
687 |
729 |
| 8 |
969 |
916 |
991 |
| 12 |
1514 |
1374 |
1542 |
Table #6
Obviously, the impact of multiple VMs
needed to be included in the estimate.
The following estimate is proposed:

For O, we use the linear estimate
that we estimated earlier to show how overhead on a quiescent system is a
function of the number of instances.
Therefore, O was set to .01022,
or 1.022% increase of overhead per instance.
Estimate2 in the above table shows that
this approach approximates our observations.
Benchmark
Conclusions
The above section provides a rigorous
look at a very simple benchmark. Any benchmark will have positive and negative
attributes. The negative attribute for gzip is that it does not represent
interactive, data-intensive workloads or, for that matter, a workload with
any I/O at all. Also, the code in the benchmark does not stress the VMware
emulation for privileged code, which would incur additional overhead. We
expect, in future benchmarks, to see that Xen and other “hypervisors” do not
incur additional overhead for privileged instructions, but we cannot quantify
these assumptions at this time.
The positive attributes are that gzip
is an industry-wide standard benchmark. Its definition is unambiguous and
it has been benchmarked throughout the world to understand CPU performance.
Another attribute is that it is highly repeatable, with consistent results.
Before benchmarking more elaborate workloads,
we felt it imperative to get good, concise estimates for CPU behavior. As
can be seen from Table #6, we have succeeded in the construction of a very
simple analytical model to predict the total elapsed time (CPU and queueing
time) of pure CPU workloads with 2 parameters. These are:
ET2 – the performance
of the benchmark with the number of VMs equal to the number of CPUs
O - The linear overhead seen in
a quiescent system as the number of VMs is increased
Using
VMware to Monitor Performance
Appendix A provides a set of screen shots
showing how performance was monitored for our VMware benchmarks. The VMware
Virtual Infrastructure management software includes a Client with which we
monitored the physical and virtualized resources remotely from the machine
on which we executed the benchmarks.
Figure A1 shows the selection of the performance
statistics of the physical resources that we wanted to export. Figure A2
shows how to select detailed statistics for the CPU resource.
Figure A3 shows a screen shot that was
taken during our testing. Note that each of the two CPUs is shown, along
with the consolidated view. All CPU utilizations are normalized to 100%.
Figure A4 shows the VMs and their current
state (including CPU and memory utilization).
All performance statistics can be exported
into a comma-delimited file, which was used to collect the detailed statistics
from which we reported the averages and confidence intervals.
Conclusions
Virtual server technology is an emerging
technology that we believe to be transformational. It is imperative that
we establish a firm grounding for measuring and understanding its performance
behavior. As usual, the time to accurately benchmark and measure took longer
than expected. Our order of priority was to establish measurements for CPU-intensive
behavior first, since all other measurements rest on the accuracy of predicting
this overhead. We started with VMware and will in the future evaluate Xen
as well as other hypervisors.
Once we have established the benchmark
techniques and verified the results, we will turn to benchmarking the overhead
of I/O – both storage and networking. We suspect that the overhead for virtualization
of I/O will be quite high. Some non-verified estimates are that, for I/O-dominated
workloads, the CPU overhead could climb to 50% of the system. But this is,
so far, anecdotal.
We have tried to accurately show our measurements
and describe how they were achieved. We invite other researchers to verify
and expand our collective body of knowledge in the brave new world of server
virtualization.
Bibliography
[01] Bolker, Ethan and Ding, Yiping, “Virtual performance won’t do: Capacity planning for
virtual systems”, CMG2006.
[02] Buzen, J. P, “Computational Algorithms for Closed Queueing Networks
with Exponential Servers, Commun. ACM 16 (9) September 1973.
[03] Henning, John L, “SPEC CPU2000: Measuring CPU Performance in the New
Millenium”, Computer – July 2000.
[04] Kleinrock, L, “Analysis of a Time-Shared Processor, Naval Res. Logist.
Quart. 11 (1964).
[05] MacDougall, M.H, Simulating Computer Systems,
The MIT Press, 1987.
[06] Menon, A et. Al, “Diagnosing Performance Overheads in the Xen Virtual
Machine Environment”, VEE ’05 2005.
[07] Mesquite Software, User’s Guide: CSIM19 Simulation Engine, 2006.
[08] Salsburg, Michael, “Virtually Everything”, CMG Proceedings 2003.
[09] Sheldon, W. L, “Modeling VMware ESX Server Performance, CMG 2006.
[10] Weilnau, P, “Measuring Up for Server Virtualization”, CMG 2006.
[11] VWware (no listed authors), Timekeeping in VMware Virtual Machines,
http://www.vmware.com/pdf/vmware_timekeeping.pdf