What's All the Fuss About I/O Virtualization?

June, 2007
by Michael A. Salsburg, D. V. (Doctor of Virtualization)

About the Author
Michael A. Salsburg, Ph.D., Technology Director, Unisys Corporation

Michael is responsible for the strategic direction of the overall server product line, with a focus on System Performance and overall Architecture.

He received his Bachelor of Science from the University of Pittsburgh (magna cum laude) in Mathematics in August 1972. He received his Masters of Science from the University of Delaware in Computer Science (Software Engineering and Artificial Intelligence) in 1982. He received his Doctor of Philosophy in Mathematics from Drexel University (Probability and Statistics) in 1992.

He has been awarded two international patents in the area of performance modeling algorithms and software. In addition, he has published dozens of papers and has lectured world-wide on the topic of computer performance evaluation. His current interest is in the area of High Performance Computing.

Hourglass My loyal readers will remember that last month, I mentioned the following:

Our own benchmarks at Unisys have indicated that the CPU time per I/O can be between 4-6 times greater with virtualization than on the raw metal.1

So, what's behind this increase of CPU utilization for I/O operations? This month's article will dig deeper into this interesting issue. Much of the information has been synthesized from an article by Intel called: "Intel Virtualization Technology for Directed I/O."2

From the processor point of view, the most optimized I/O requires no processing of the data as it is transferred between a device and physical memory. This is often called direct memory access or DMA. Any type of conversion or replication of the data will incur additional processor cycles. With server virtualization, if memory could be transferred directly to or from the device to the virtual machine's (VM's) memory space, then the performance would be close to "bare metal" in that the standard device drivers could be used and the virtualization technology itself would not need to get involved.

But if this were the case, the use of virtualization would be severely limited. For one thing, multiple guests could not share a device. The number of viable guests would be limited to the number of physical devices that could be connected to the physical system. Secondly, the "live migration" of a guest would be severely restricted. If a guest is moved from one platform to another, there would be a requirement to have similar disk types, makes and models connected to both platforms.

Finally, the devices are connected to actual hardware. The guest operating systems in the VM's, on the other hand, interact with a virtualized layer that "appears" to be hardware when it is just an abstraction of the hardware. The devices expect to be able to send interrupts to the hardware "thinking" it will get to the operating system in the VM. Similarly, there is no current mechanism within today's device vendors to directly transfer data between the device's memory and the guest's memory without the involvement of the virtualization technology.

The next three figures show this graphically. Figure #1 shows the direct memory access between a device and the address space being used by a process. Four processes are shown. Each process "thinks" it is using physical memory. In reality, it is virtual memory, with a translation table to handle the translation of physical to virtual memory. In addition, the address could also be in the CPU memory cache, which has a lower latency. The point here is that the technology is mature and the physical processor technology handles the various translations through hardware-supported translation tables, which have low latency and do not consume processor cycles to handle the translation. Note that "physical memory" is distinguished by the term "Host Physical Address" (HPA).

Figure #1 - No Virtualization
Figure #1 - No Virtualization

Figure #2 shows a new level of memory that is being managed by the virtualization technology - the "Guest Physical Address" (GPA). This is displayed as memory that is managed by the Virtual Machine Monitor (VMM), which is part of the hypervisor technology, such as VMware's ESX or Xen. The VMM emulates the hardware beneath it and presents an abstraction of the hardware for each "virtual machine" (VM). Since the devices are using host physical addressing, the direct transfer of data does not reach into the individual guests.

There is no simple translation table support in hardware since VMMs maintain separate sets of guest physical addresses for each guest. The mapping of physical addresses to guest addresses is quite dynamic, since the VMMs tend to re-assign addresses among the guests. Therefore, as part of the VMM management of guest physical addresses, it also needs to copy the memory from the host physical address and, by the way, do significant checking for errant I/Os. An I/O can either mistakenly or maliciously be instructed to use memory that is "out of bounds" for the process. Such an I/O can bring down the entire virtualized system. So, this management, isolation and containment comes at a cost of processor cycles consumed by the VMM.

Figure #2 - Current VMMs
Figure #2 - Current VMMs

Figure #3 shows the introduction of Intel's Virtualized Technology for Directed I/O (VT-d). Note that this paper only discusses Intel technology. AMD provides similar capabilities in its I/O Memory Management Unit (IOMMU).

The VMM can take advantage of the VT-d technology by defining individual DMA protection domains. One or more devices can be assigned to a protection domain. In addition, a protection domain can be assigned to a specific guest, the VMM or a guest-OS driver.

Once the protection domain is defined, an I/O device can be provided a view of memory as if it were viewing a set of HPAs when, in reality, it is working with GPAs. The VMM is no longer needed to enforce isolation and containment.

Figure #2 - VMMs with VT-d
Figure #2 - VMMs with VT-d

Similar to DMA accesses, I/O interrupt requests generated by I/O devices are currently handled by the VMM. When the interrupt occurs, the VMM must present the interrupt to the guest. This is not accomplished through hardware. Again, CPU cycles are consumed and there is additional latency added to these interrupts. VT-d provides an interrupt-remapping architecture. Again, the isolation and containment are achieved through the definition of protection domains. The handling of interrupts is fairly involved, since, over the years, various types of device interrupts and message signaling technologies have been introduced. VT-d supports remapping of interrupts from all I/O sources including interrupt controllers (IOAPICS), as well as MSI and MSI-X interrupts defined by PCI specifications.

The VT-d architecture allows hardware implementations to cache frequently used remapping-structure entries. There are three types of caches:

  • Context-Entry Table - this table is used to map devices to protection domains
  • Page Directory Entry - These are used by the multi-table approach to mapping HPA to GPA
  • I/O Translation Look-aside Buffer - This holds the results of page walks that occur as the multiple page tables are traversed.

The implementation includes the responsibility of maintaining consistency of these three caches by refreshing invalidated entries. For example, if the VMM changes a protection domain definition, the associated entries have to be purged.

As virtualization is used on large-scale servers for mission-critical data processing, the issue of scaling arises. The number of devices and protection domains can require caches of significant size. The VT-d technology defines a cooperative approach to caching that allows the I/O devices themselves to maintain their own caches. Address Translation Services (ATS) are being defined as extensions of the PCI Express protocol so that devices can cooperatively cache data.

So, one may wonder how all of this technology affects the bottom line of performance. At the beginning, I mentioned that the CPU time per I/O can increase by a factor of 4 to 6. When looking in more detail, the "CPU time" is really time that the CPU is "stalled" - waiting for an interrupt to complete. Although there is still a dearth of real-world performance measurements, anecdotal information implies that this stall time can reduce throughput for data-intensive workloads by up to 50%, where throughput of CPU-intensive workloads is reduced by 10% or less. With VT-d, one should expect the reduction for data-intensive workloads to be much closer to the 10% value, when compared to execution on the "bare metal".

This 10% goal cannot be achieved by the VT-d technology alone. In order to fully realize optimized I/O performance in a virtual environment, the PCI vendors are establishing standards in partnership with Intel's VT-d technology roadmap and AMD's IOMMU roadmap. The subject of PCI assistance in virtualization optimization will be the subject for a subsequent article.

So, until next month....

Feel free to send me e-mails with your comments and suggestions at:


And remember - I'm not a Real Doctor, I'm a Virtual one....

References:

1. http://www.cmg.org/measureit/issues/mit41/m_41_1.html

2. http://www.intel.com/technology/itj/2006/v10i3/2-io/1-abstract.htm