A Proposed Benchmark for Virtualized Systems

September, 2007
by Michael A. Salsburg, D. V. (Doctor of Virtualization)

About the Author
Michael A. Salsburg, Ph.D., Technology Director, Unisys Corporation

Michael is responsible for the strategic direction of the overall server product line, with a focus on System Performance and overall Architecture.

He received his Bachelor of Science from the University of Pittsburgh (magna cum laude) in Mathematics in August 1972. He received his Masters of Science from the University of Delaware in Computer Science (Software Engineering and Artificial Intelligence) in 1982. He received his Doctor of Philosophy in Mathematics from Drexel University (Probability and Statistics) in 1992.

He has been awarded two international patents in the area of performance modeling algorithms and software. In addition, he has published dozens of papers and has lectured world-wide on the topic of computer performance evaluation. His current interest is in the area of High Performance Computing.

CMG'07 Banner Clip art

In June, I presented an article, "The Battle of the Brands". In it, a number of individual benchmarks were run to compare VMware to Xen. That was fun reading, wasn't it? Well, the good doctor is, as he writes this, winging his way to the site of CMG 2007 in San Diego for official CMG business and "fun in the sun". This leaves plenty time for reading and rumination. The article I have just read is called "VMmark: A Scalable Benchmark for Virtualized Systems". It is published by a number of technicians within VMware and is Technical Report VMware-TR-2006-002. It was published in September, 2006 (before the incendiary benchmark results that were discussed in "The Battle of the Brands".

Well, this older VMware paper is not as entertaining as the one published a few months later, but it does propose a reasonable way to benchmark the efficiency of server virtualization. These folks would, of course, like the industry to follow their lead. And the good doctor does not see any reason why other virtualization companies would object. It appears to be realistic and a good start at understanding how multiple workloads will behave together in a virtual environment. This article presents the main ideas from the paper, but you are welcome to refer to the original at:

http://www.vmware.com/pdf/vmmark_intro.pdf

Workload Tiling

What makes this type of benchmarking different than any before it is that multiple operating systems are running simultaneously on the same server. Each operating system is running within a virtual machine. Each one is referred to as a workload "Tile". Prescribed benchmarks are run within each tile. Each benchmark has a single-value "score" that is measured. If more tiles can be run without slowing down the existing tiles, then the score increases. Even though each tile does not use 100% of the CPU, the aggregate of workload tiles are intended to use all available CPU cycles. If one of the workload tiles has a lot of disk wait time, then a higher performance disk can be added to improve the score of that tile, but this may change the score of other tiles. Get the idea?

The Workloads

The paper describes 6 different workloads. One may question these and come up with others, but these workloads seem to capture essential functionality that is delivered by today's commodity (Intel or AMD-based) servers.

The Mail Server benchmark is executed using the Microsoft Exchange 2003 mail server. The benchmark itself is a Microsoft-provided utility called "LoadSim". This benchmark was altered to reduce the need for excessive storage and also to ramp up and stress the virtual machine. The OS for this workload was Microsoft Windows Server 2003. For this workload, the score was calculated to be "Actions/minute", which is measured within the benchmark.

The Java Server benchmark uses a common industry-wide benchmark, SPECjbb2005. This is a transactional benchmark, based loosely on the TPC-C benchmark. This was run on top of Microsoft Windows Server 2003. Some modifications were made, such as a method to fine-tune the think time between transactions. The score was calculated to be "New Orders/second"

The Standby Server benchmark addresses current environments where standby servers are configured to be running in "idle", waiting to run new workloads. There was no score for this one. The OS was Windows Server 2003, Enterprise Edition.

The Web Server benchmark uses an industry standard benchmark, SPECweb2005. This was run on top of the Red Hat Linux 4 operating system. The score was measured within the benchmark as "Accesses/second".

The Database Server benchmark uses a freely available online transaction processing application provided by Oracle. This was run on Red Hat Enterprise Linux 4. The score was measured within the benchmark as "Commits/second".

The File Server benchmark utilized an application called DBench that was derived from an industry-standard benchmark called NetBench. The difference between the two is that NetBench requires a large number of client systems to generate the workload. DBench uses an I/O trace that was gathered from a NetBench execution and replays the trace. In this way, there is no requirement for the client workload generators. DBench was further modified by adding a TCP/IP connection to a client that could be used to control the benchmark. The benchmark was run on Red Hat Enterprise Linux 4. The score was calculated as "MB/second".

Aggregating Scores for VMmark

The benchmarks were brought up during a "warm-up" period. Once steady state was achieved, scores were calculated for contiguous 40 minute intervals. Once the benchmarks were "warmed up", it was found that the scores had very little variation. That is a good sign for benchmark replication and comparisons to benchmarks run by other technicians. The benchmark results are boiled down to a single value. This overall score appears to be the median score across the three batches. This median score was then normalized to the score during the warm-up period. The following tables show these results. This is copied from the paper.

Raw Scores

Raw Scores

Normalized Scores

Normalized Scores

Discussion

First of all, it should be noted that this suite of benchmarks only contain one of the benchmarks that was discussed in the subsequent benchmark paper, "A Performance Comparison of Hypervisors", which I discussed in June. Perhaps the second paper was more like "offensive benchmarking" instead of "objective benchmarking". The benchmarks discussed above appear to be quite thoughtful, reasonable, easy to set up and duplicate. Three ran on a Microsoft OS while two ran on Red Hat, which seems fair enough. Perhaps the greatest criticism is:

What's with the weird normalization at the end to come up with the final score?

Perhaps the good doctor has encountered too much turbulence during the flight, but it doesn't seem to make much sense. Although the paper claims that running more tiles will increase the score, if everything is normalized (using a geometric mean implying that they understand statistics), how can the score actually increase?

Well, we can't all be perfect like the good doctor. I suspect that the next edition of this benchmark will clear up the mystery. In general, I applaud VMware for their vision and deep interest in computer performance, which is right up our alley, so to speak.

Before ending this article, the good doctor has wonderful news. If you were thinking about possibly going to CMG 2007, but not yet committed, you can easily justify your attendance since I will be presenting a paper that covers many of the details that I have been writing all year. Check out session # 546, "Beyond the Hypervisor Hype", scheduled for Thursday, Dec. 6, from 2:45 to 3:45. I will be available during the conference for any of your questions and suggestions for future articles. Now, how's that for shameless promotion?

I'm going to CMG'07 in San Diego California - December 2nd through 7th, 2007