CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

Hyperthreading - Two for the price of one?
Simultaneous Multithreading
March 1, 2003
by Mark Friedman

About the Author
Mark Friedman, Demandtech

Mark is the co-author of "Windows 2000 Performance Guide," published by O’Reilly in 2001. He is a frequent contributor to CMG conferences on performance and tuning topics, concentrating on Microsoft Windows NT in 1996. Also in 1996, he began developing Performance SeNTry, performance monitoring software for Windows NT and 2000 also known as NTSMF.

He edited an industry newsletter entitled "Mark Friedman on Storage Management," published by Demand Technology. In 1994, he founded OnDemand Software, which developed and sold the award-winning WinInstall software distribution package for Windows networks. From 1987-1991 he worked at Landmark Systems where he was the architect and the led the development team that built The Monitor for MVS.

[Hide]

The Intel Hyperthreading© technology incorporated in recent Pentium 4 Xeon and the newest high-end Pentium 4 processors (see The Intel Press Release) generates a conundrum for capacity planners. The hyperthreaded flavor of the Pentium 4 presents two logical processors to the operating system for every physical microprocessor chip you buy. When you are considering an upgrade to a hyperthreaded Pentium 4 machine, just how many CPUs are you getting? In this Research Note, we will look at how these hyperthreaded processors work so you can get a better handle on the capabilities of this new flavor of Intel processor. Depending on your workload, you may actually be getting the capacity of two processors for the price of one, but, as usual, your mileage may vary.

Hyperthreading (HT) is Intel's implementation of Simultaneous Multithreading (SMT) technology in the P4, the latest in high performance processor architectures. Intel added SMT support to its 32-bit P4 architecture, which already has an advanced superscalar architecture, plus out-of-order execution capabilities. Superscalar machines process multiple instructions in a single clock cycle by exploiting instruction-level parallelism (ILP). Out-of-order instruction execution is a technique for increasing ILP in the face of the usual pipeline stalls due to cache misses, misdirected branches, etc. Both add to the internal complexity of the Intel microprocessors found in most desktop PCs, but the benefit is superior performance.

SMT is a relatively new approach to processor design that promises to exploit the thread-level parallelism (TLP) that is inherent in most server workloads. Even though Intel has added SMT capability to chips also destined for workstations, these address workloads that ordinarily do not have the same degree of thread-level parallelism that can be expected in servers. (See, for example) SMT was originally championed by a research team led by Jack Lo and Susan Eggers at the University of Washington. In 1997, the team published an influential article in the September issue of IEEE Micro entitled "Simultaneous multithreading: a platform for next generation processors." That seminal publication compared SMT performance favorably to both superscalar machines and single chip multiprocessors (CMP), like the latest high end SPARC chips from Sun. In the meantime, Digital pioneered the use of SMT on a version of its 64-bit Alpha microprocessor. When Intel absorbed Alpha manufacturing and engineering in 1997 as part of an antitrust settlement with the old Digital Corporation, it acquired key practical experience in implementing SMT technology

Xeon chips with SMT began to appear in quantity for servers in the spring of 2002; general purpose P4s with HT came out at the end of 2002. Intel suggests that the 2-way SMT support in the P4 can speed up typical server workloads by 20% or so, while adding only about 5% to the complexity of the P4 Xeon design (1). Because actual performance of an HT P4 will vary based on the degree of thread-level parallelism in the workload, it is difficult to say how much benefit SMT confers in the general case. The most thorough assessment of the performance of an HT P4 that has been published to date is by Duc Vianney of IBM running on a modified Linux 2.4 kernel. See the article here. Vianney concludes, "Intel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel and multithreaded applications. The speed-up from Hyper-Threading could be as high as 30% in stock kernel 2.4.19, to 51% in kernel 2.5.32 due to drastic changes in the scheduler run queue's support and Hyper-Threading awareness."

My personal experience with a 2-way 2.2 GHz Xeon (4 logical CPUs running Windows 2000) executing a highly parallel workload yielded performance consistent with 4 conventional processors running at a similar clock speed. The 100% speedup is a commendable result. However, in other documented cases, performance of an HT P4 can degrade when the workload is not suitable (scientific calculations involving floating point arithmetic - the traditional Achilles heel of the Intel 32-bit processor family - are the culprit). See, for example, this documented result where a hyperthreaded Xeon ran 25% slower compared to the same machine with HT disabled.

Figure 1 - Click picture to see larger image How SMT works. A hyperthreaded Pentium 4 contains a single processor core, but maintains dual sets of processor registers and other status information to allow two program threads to execute simultaneously. This extra hardware makes a hyperthreaded P4 look like two logical processors to the operating system (Windows 2000 or Linux), which can then schedule two Ready and Waiting program threads to execute. The single processor core fetches IA-32 instructions, alternating between one thread and the other on successive clocks. Then, instructions are decoded into micro-operations (µops) and assigned internal pseudo-registers. Still in the early stages of the execution pipeline, µops from both instruction streams are intermixed in the µop Queue and are scheduled independently during the out-of-order instruction Execute stage. See Figure 1, which illustrates the modifications Intel made to the P4 execution pipeline to support HT.

So how good is it? SMT technology appears to be a significant performance breakthrough. One of the crucial ideas behind SMT is that when an instruction from one thread inevitably stalls the execution pipeline - in the case of a mispredicted branch, for example, or a Level 1 cache miss during an instruction or data fetch - the processor core may be able to continue to make forward progress executing an alternate instruction stream. Since the complex execution logic in the processor core of a typical super-scalar machine is notoriously underutilized (2), under almost any best case scenario, SMT can boost performance because it can be loaded with instructions that are ready to execute from more than one Ready thread. Under the right circumstances, an HT P4 can achieve double the Instruction Execution Rate (IER) of a single threaded super-scalar. SMTs also appear to perform as well or better than single chip multiprocessors on highly parallel workloads. Another advantage of the SMT approach is that the execution time of a single threaded workload is hardly impacted at all - Intel estimates that a hyperthreaded P4 executes a single thread only about 2% slower than a conventional Pentium 4.

SMT does raise several issues for operating system software developers which have both the software folks at Microsoft and the volunteer developers of Linux scrambling a bit. Standard multiprocessor support in an OS like Red Hat 7.2 or Microsoft Windows 2000 will recognize two logical processors for every physical HT processor installed and automatically start scheduling Ready threads to each idle logical engine. But without special consideration for the SMT environment, less than optimal performance can result because SMT has some impact on thread scheduling (3). Windows 2000 currently dispatches an Idle loop instruction stream whenever there is no Ready and Waiting thread to schedule. The Pentium 4 HT processor core cannot differentiate between this Idle loop instruction stream and a real workload, so it goes ahead and tries to execute it in parallel. Intel recommends that the OS issue a HALT (HLT) instruction instead of running an Idle loop when there is no work to schedule of an HT logical processor. When one logical engine is HALTed, the processor core automatically switches all available internal resources, which are normally partitioned between the two logical processes, to support the remaining instruction stream. Support for the HALT instruction is expected in Windows 2003 .Net Server. Another OS optimization that Intel hardware engineers recommend is that the OS attempt to schedule a Ready thread on an idle physical processor before it attempts to add a second instruction stream to an HT engine already executing an instruction stream. This support is also expected in Windows 2003 .Net Server.

The Microsoft support for HT that has attracted by far the most attention in the trade press and various discussion groups is software licensing. In a decision that could have gone either way, Microsoft decided that it will license its software by physical processor. However, support for this licensing policy depends on your running an OS version which recognizes and supports HT. This requires Windows XP (for a workstation) or an upcoming release of Windows 2003 .Net Server. On the licensing front, Windows 2003 .Net Server supplies a SYSTEM_LOGICAL_PROCESSOR_INFORMATION structure that a running process can access to determine if HT is active so that applications like SQL Server can figure out how many physical processors are installed and confirm license compliance automatically.

As for the caretakers of the Linux bazaar, some Intel software developers apparently patched the Linux kernel to support hyperthreading (see this posting at geek.com) in its development lab, which is certainly one advantage of the Open Source movement. Out of the box, Red Hat Linux 7.2, for example, does not recognize hyperthreading, but a stable kernel that does support hyperthreading is available for the asking. You can also find documentation for installing the HT support in Red Hat Linux at Intel’s web site

Conclusion. Compared to the complexity that superscalar architectures, for example, add to processor design, SMT promises to be a very cost-effective way to squeeze additional performance out of advanced processors. The Pentium 4 Xeon was one of the first commercial applications of SMT technology, and others have already followed. Exploiting SMT to get higher performance from the Intel legacy IA-32 microarchitecture may even prove to be the undoing of Intel’s enormously expensive gamble on its new 64-bit Itanium processors, which rely on more complicated speculation and predication techniques instead. An SMT-oriented 64-bit addressing extension to the existing IA-32 instruction set may prove to be a much simpler approach to high performance computing in the Intel-compatible world then the complex instruction set Itanium. And if Intel decides not to build it in an attempt to preserve the market for its Itanium chips, you can be sure that rival AMD will.


1. Marr, et al., Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, Vol 6, Issue 1 (Feb 14, 2002).
2. Eggers & Lo’s team at UDub suggests that more than 60% of the execution cycles on a typical super-scalar machine are wasted. Intel’s marketing material associated with the hyperthreaded Xeon product launch reports the processor core on a P4 is otherwise only approximately 30% utilized.
3. See Sujay Parekh, Susan Eggers, and Henry Levy, "Thread-Sensitive Scheduling for SMT Processors." University of Washington Technical Report, 2000.

Last Updated 06/05/09


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group