May, 2007
by Neil J. Gunther
For the past few years Intel, AMD, IBM, Sun, and other microprocessor vendors have been
aggressively promoting the concept of multicores.
Multicore processors
contain multiple executions units or cores on the same silicon die and are packaged
as a single module. The message has been that more cores are better than a single core or CPU.
Last January, however, Intel and IBM made a highly unorthodox
joint announcement.
What makes this announcement unorthodox is that IBM and AMD have formed a
partnership
in the manufacture of future microprocessors, so IBM and Intel are competitors in this context.
In case you missed it, here's what they said in a nutshell.
Intel and IBM will produce single CPU parts using state-of-the-art
45 nanometer (nm) technology. Intel said it is converting all its manufacturing facilities
(fab lines, in the vernacular) and will produce 45 nm microprocessors (code named penryn) by the end
of this year. The IBM and AMD schedule is a bit less aggressive but they expect to ship their respective
Power series and Opteron series microprocessors by the middle of 2008.
Add to this picture the running commentary that Moore's law (see Section 3) is
finished, kaput, dead
(even according to Gordon!),
or possibly risen from the dead.
Any time commercial enemies get together for a public lovefest in such a volatile marketing milieu,
it warrants closer scrutiny because it usually means there is something major
going on. When I saw this announcement, some of the questions that went through my mind were:
Since I have some past experience with VLSI technology from the time when I was a researcher at Xerox PARC, I thought you might find my speculative thoughts on these matters of some interest; especially when you come to do your next round of server procurement. I have to say speculative because I don't have any particular insider information, although I do still hear a lot of rumors on the Silicon Valley grapevine.
The answers to these and other questions are very likely to impact the way we do server performance analysis and capacity planning in the near future.
Before I get into dissecting the above issues, I want to emphasize why the joint Intel/IBM announcement is so important. It is important, not just from the marketing standpoint, but also for computer performance analysis, and in particular, how it relates to another well-known law viz., Amdahl's law. In case you're confused, the conclusion of Gene Amdahl's 1967 paper can be paraphrased as:
When it comes to running commercial workloads on a single CPU or a multiprocessor, the best performance will be achieved by running it on the fastest single CPU you can get your hands on.
Skeptics are invited to see Section 2.10 of my Perl::PDQ book for a more formal discussion on this point. Gene Amdahl never said anything about parallelism (in the sense we mean it today) and he never wrote down the speedup equation that now bears his name. BTW, it's pure marketing genius to get your name attached to an equation you never wrote! Ironically, all of those appellations were incorrectly attributed by others in subsequent years.
Like Moore's law, Amdahl's law was based on purely empirical evidence. At that time, Amdahl was trying to dissuade people from using multiprocessors, partly because they were more expensive for Amdahl Corporation to build. Amdahl was correct about that then, and he's correct about it today.
Table 1: A selection of current generation Intel and AMD multicores.
| Intel Corporation | Advanced Micro Devices | ||||||||
| Series | Model | Cores | Clock (GHz) | Fab (nm) | Series | Model | Cores | Clock (GHz) | Fab (nm) |
| Core 2 Duo | E6300 | 2 | 1.83 | 65 | Athlon 64 X2 | 4400 | 2 | 2.3 | 65 |
| Core 2 Duo | E6700 | 2 | 2.66 | 65 | Athlon 64 X2 | 6000 | 2 | 3.0 | 90 |
| Core 2 Quad | Q6600 | 4 | 2.40 | 65 | Athlon 64 | FX-70 | 4 | 2.6 | 90 |
| Core 2 Extreme | QX6700 | 4 | 2.66 | 65 | Athlon 64 | FX-74 | 4 | 3.0 | 90 |
Multiprocessors that are familiar to us as servers in air-cooled boxes are still expensive to manufacture and engineer for reliability. A tremendous amount of effort goes into removing race-conditions and faults in the hardware state-machines, not to mention symmetrizing the operating system to take advantage of the multiple processors in a scalable way. Multicores, like those shown in Table 1, can be thought of as multiprocessor architectures shrunk down onto a silicon die and contained in a single "black box"; the multicore module. Placing too many high-speed cores in the same module can lead to overheating problems. A burning question (if I can use that phrase) for the 45 nm technology is, How will the clock speeds of the new multicores compare with the speed of the fastest single processors? (See Section 5)
Moore's law states:
The number of transistors that can be fabricated on a very large-scale integrated (VLSI) chip doubles every two years.
Just like credit card debt this is a statement about compounded or exponential growth (Figure 1), so the ramifications can be monumental. Gordon Moore of Intel Corporation made his empirically based pronouncement about 40 years ago. In that time, the area of a silicon chip has remained essentially constant; about the size of postage stamp. Therefore, to accommodate the 2 year doubling of transistor density predicted by Moore's law, the size of the transistors and their associated "wires" have to be shrunk accordingly. This is possible due to the fact that the transistor fabrication process is based largely on something called photolithography. Photolithography is a highly specialized form of photography where the transistors and their associated circuits are built up in layers on a silicon wafer substrate. Each layer is imaged or shot onto the silicon wafer in much the same way an image is formed on a photographic plate. Each image is actually a special kind of mask that lets light through in certain regions and not in others. Between each shot, the silicon wafer is cleaned and prepared for the next layer. Intel and IBM use slightly different lithographic processes from one another, but it requires about 50 photolithographic shots of this type to build up today's generation of microprocessors. The key point is, to double the number of transistors on the chip, each mask image only needs to be shrunk by a factor of two.

Figure 1: Moore's law for memory chips and microprocessors plotted on a semi-logarithmic scale, which has the effect of making nonlinear exponential curves appear linear. The uppermost purple curve is the Moore projection based on data up to 1975; note the kink correction around 1980, which shows that the so-called law is only an approximation. [Source: Intel Corporation]
In 1985 the feature size was around 1000 nanometer (nm). Today, Intel and other chip makers are fabricating in with 65 nm feature sizes. For reference, the wavelength of visible light is about 500 nm, ultraviolet light (used in the transistor fabrication process) is about 200 nm, and the HIV virus is about 100 nm in diameter. Obviously, Moore's shrinkage law cannot go on indefinitely because the scale of an atom is finite in the range of 0.1-0.5 nm, and the fabrication process used to make transistors today cannot be applied to make transistors out of individual atoms.
Intel has several silicon chip fabrication facilities or fab lines in Oregon, Arizona, Israel, and
elsewhere. Hillsboro, Oregon is a research fab line, rather like a "test kitchen" where new
silicon fabrication recipes are dreamt up. These fab lines are amongst the cleanest rooms in the
world. Your camera, for example, is too "dirty" to gain entry. Each is the area of three-football
fields and costs about 3 billion dollars to set up. This astronomical cost represents a huge
barrier-to-entry for any new vendor wanting to get into the game, and it also explains why there are fewer
and fewer chip vendors year after year. Over and above the shear cost of the fab line, is the time to
design and develop a new microprocessor, which is currently about 5 years.
All of Intel's facilities are fully automated. By that I mean, the robots have taken over. Humans are no longer involved in silicon processing anymore like in the good old days (just 20 years ago). In fact, the picture on the right reminds me of something out of Kubrick's movie 2001 A Space Odyssey. A robot transports the wafers, each of which is 300 mm in diameter (about the size of dinner plate), between processing steps via an overhead monorail track system.
Each processing step is usually carried out in a separate machine. A robot transfers the wafer-carrier from one processing machine to the next machine and so on. Any of these processes can also have their parametric settings altered on the fly by process engineers using computers outside the clean room. That means there is no guarantee that any two wafers have received identical processing. In fact, there is a numeric zoo for AMD and Intel microprocessors which try to identify microprocessor chips that received different processing. Processing includes such things as ion implantation (doping), photoresist application and removal, oxide deposition, and most importantly, metal deposition. That's what produces the metal conductors (wires) that interconnect the transistors and circuits. To accommodate all the complex crisscrossing of the interconnect network, there are 8 metal layers using copper interconnect (M1,M2,...,M8) on the new microprocessors. The bottom layer M1 connects transistors to transistors, while the upper layers are used to connect circuits spanning the chip.
Table 2: Major technology nodes predicted by the SIA 2006 roadmap.
| Year | 2004 | 2007 | 2010 | 2013 | 2016 | 2020 |
| nm | 90 | 65 | 45 | 32 | 22 | 14 |
Each progression in shrinking the technology is identified by a waypoint or "node" (Table 2) on the SIA (Semiconductor Industry Association) roadmap. A technology node is defined primarily by the minimum metal pitch used on any product, for example, the Metal-1 layer (M1) half-pitch in a microprocessor. Here, pitch refers to the spacing between wires. You can see that IBM and Intel are well ahead of the SIA roadmap; another reason the announcement in Section 1 was newsworthy. This also explains those numbers in the Fab (nm) column in Table 1. A lot of people quote those numbers, but very few know where they come from. Now you know too. Apparently, Freescale Semiconductor, Inc. has not yet found a way to get below 90 nm reliably. Since they were "fabbing" the PowerPC G5-microprocessor for the Apple Macintosh, that delay in moving from the 2004 SAI technology node to the 2007 technology node was leaving the Mac behind in the CPU-speed sweepstakes. That's partly why Apple Inc. had to cross over to Intel microprocessors.
Moore's law also captures the amazing scale of cost reduction. It explains why you can buy what was a 10,000 MIPS supercomputer a decade ago and have it in your laptop today for a few hundred dollars. From the performance perspective, however, Moore's law is more often associated with increasing MIPS. To understand how Moore's law has run into trouble at 45 nm, we need to understand something about how CMOS transistors work. CMOS stands for Complimentary Metal-Oxide Semiconductor and I'll come back to more details about that in Section 6.
How does a transistor work? Imagine your garden hose with the water turned on at a moderate flow from the tap to the nozzle, which is lying in a drainage ditch. The water flowing in the hose is analogous to a current of electrons between a source (analogous to the tap) and a corresponding sink (analogous to the drain). In fact, two of the terminals in a transistor are called the source (labeled S in Figure 2) and the drain (labeled D in Figure 2). Part of the role of a transistor is to control the current flow between the source and drain. When using your garden hose, you normally control the flow of water at the tap (the source). The CMOS transistor does not work that way. Instead, it is more like controlling the water flow by pressing your foot on the hose to stop it (off) and lifting your foot to start it (on). You are likely to be more effective if you are wearing shoes. The transistor terminal that corresponds to your shoe is called the gate and it sits over the top of the source and drain terminals, just like the situation with the hose. The three terminals together form a triode valve, which is the silicon version of the glowing glass tubes you can see in the back of any TV set that is more ten years old.

Figure 2: Schematic comparison of the standard transistor gate and the new high-K metal gate (shown in blue).
Depending on how the transistor is configured in a circuit, when you push you foot down the current is turned off and that might correspond to a digital `0'. Similarly, when you lift your foot up that might correspond to a digital `1'. This explains how Moore's law is associated with processor speed. As the transistor geometry becomes smaller, the distance between source and drain gets shorter and fewer electrons need to flow under the gate. A major hurdle that has to be overcome in scaling down from 60 nm to 45 nm technology is leakage of electrons between the source and drain. By analogy with the garden hose, trying to stop the water flow with a conventional shoe is no longer effective at these tiny scales, so the difference between a digital `1' and `0' becomes less distinct, and this can cause all sorts of subtle problems. To really pinch the hose tightly, you need to wear metal-capped boots. And very special metal at that, called Hafnium. There are many other special fabrication tricks-of-the-trade required to make the 45 nm fab technology operational, but they need not concern us here.
We fell off the Moore's law curve, not because photolithography collided with limitations due to quantum physics or anything else exotic, but more mundanely because it ran into a largely unanticipated thermodynamic barrier. In other words, Moore's law was stopped dead in its tracks by old-fashioned 19th century physics. As CMOS feature sizes are reduced and switching speed increased, thermal power dissipation becomes a serious problem because it is directly proportional to the clock frequency used on the chip. A typical 2-3 GHz CPU generates on the order of 100 Watts (same as a household light bulb). And don't forget the huge power transients at the pins of the chip, which are essentially scale-invariant. Moreover, you end up trying to push that heat through a pinhole: the small die size. In other words it's really the dissipation of power density that kills you. To the degree that this power density cannot be dissipated, the chip failure rate increases dramatically due to thermal degradation. The Apple Mac G5 dual processor (IBM Power 5 chip) has a freon cooling system (a la Cray) together with a spectrophotometer that detects freon leaks. If such a leak is detected, the system shuts down immediately. This is another reason Apple went Intel. Some designers have proposed hetero-speed cores to keep thermal dissipation problems under control. General workloads would run on moderate speed cores and the highest speed cores would only run under specific compiler directives. This is yet another rendition of Amdahl's law, as described in Section 2. As I understand it, this thermal barrier is the reason why multicores have been so heavily promoted by the CPU vendors; keep the cores running at about current clock speeds and compensate for the absence of higher speed by adding more cores per module to provide more aggregate MIPS. Of course, this looks good on paper, but it comes with it's own set of problems, which those of us who have been involved with the development of SMPs are familiar with. Quite apart from Moore's law having been thermally defeated, many application programmers have become rather addicted to the thrill of riding Moore's exponential curve. Even a purely single-threaded program will run faster without any additional programming effort. But what happens when you have to run threads across multiple cores? Welcome to the return of concurrent programming as a major performance issue for applications in the foreseeable future. Been there, done that, 15 years ago. But this time it's worse, because the cores are inside a true black-box; the module. Without appropriate hardware registers, any serious performance tuning will likely have to be accomplished in software alone.
In an ironic twist of fate, the new CMOS transistor technology actually hearkens back to the earliest transistor implementations. When I was involved with VLSI design tools at Xerox PARC, we used the Mead-Conway design rule: "poly over silicon" produces a transistor. The word poly refers to polysilicon or amorphous silicon. The Mead-Conway rule was actually shorthand for a layer of poly-Si over an implied silicon dioxide insulator over a doped, over a Si substrate with implied source and drain, leads to a transistor being produced from the photolithographic masks generated by the VSLI CAD tool. Now, recall from Section 4 that the `M' in CMOS stands for metal. That terminology is already a throwback to the days when transistor gates were made by depositing metal (usually Aluminum) rather than poly-silicon over the silicon substrate. So it's somewhat ironic that, as part of the joint announcement, Gordon Moore was trotted out to publicly proclaim:
"The implementation of high-k and metal gate materials marks the biggest change in transistor technology since the introduction of polysilicon gate MOS transistors in the late 1960s"
I should point out here that the Hafnium metal is used in the so-called high-k gate oxide layer (the yellow strip in Figure 2), not the metal gate itself. Neither Intel nor IBM disclosed what type of metals will be used in the gate. The key issues at each technology node defined in Table 2 are usually the same:
Each of these issues comes into play with different levels of significance at each SIA node. In fact, it turns out that Moore's law has died many little deaths (le petit Moore?) since he first proposed it (but people have short memories). Indeed, Moore's remark could be interpreted as an acknowledgment that the transition to 45 nm was closer to a near death experience than anyone has witnessed before.
Take a look at the picture on the right. No, that's not a Google Earth image of a football field at
the edge of town, that's a photograph of the Penryn microprocessor chip. This means it's real! The
claims for the new Hafnium-oxide/metal-gate technology include:
It remains to be seen whether these attributes really translate into the so-called Moore's law II curve and lead to faster, cooler single processors or just somewhat faster and somewhat cooler multicores.
Finally, let's see if I can answer my own questions.
Why a joint announcement?
Announcing with IBM (the actual fab vendor) draws less of the wrong kind of attention than announcing with AMD who is the direct competitor. Also, it seems that both IBM and Intel benefitted directly from VLSI research performed several years earlier at Sematech. A joint announcement tends to dissipate any feud over who discovered what processing tricks first. Do I smell a lot of cigar smoke and back room deals here?
How can Intel upgrade all their fab lines so fast?
This question has a rather startling answer. Since all of Intel's fab facilities are fully automated and software controlled, the correct processing parameters and machine schedules are more or less uploaded from their Hillsboro research fab. Think about that; Intel will upload an entire factory!
Why put all your silicon eggs in the same fab basket?
I expect this is primarily a response to the tremendous competitive pressure Intel finds itself under. Since they have little to lose, they're going for broke. Let's hope there are no bugs in that fab software they upload from Hillsboro.
Are multicores dead?
Here's where I suspect Intel is not going to put all their eggs in the same basket. They don't have to. They can keep the multicore option open in case the expected efficiencies at 45 nm don't pan out as expected or Moore's law dies another little death.
Has Moore's law been resuscitated?
To answer this question, we'll just have to wait and see. But the wait is only about 6 months, according to Intel.
"No exponential is forever, but we can delay 'forever'."—Gordon Moore