Intel's Stoakley platform and 45nm Xeons

When AMD’s “Barcelona” Opterons made their debut last Monday, we couldn’t tell you about a sleek, black box nestled in among the other test systems in Damage Labs. Housed inside of it: an example of Intel’s brand-new “Stoakley” dual-processor platform, complete with a pair of Xeons based on 45nm process technology. These Xeons are the first members of the Penryn family of 45nm CPUs to reach our test labs, and they offer a tantalizing look at how Intel will counter AMD’s new CPU design with a substantially revised version of its own potent Core microarchitecture.

These new CPUs and the platform that supports them promise marked improvements in performance, thanks to a bevy of tweaks and updates. In fact, although the new Xeons are more a minor refresh than a major overhaul, the gains they’ve attained are formidable. Today, we can show you how these processors perform.

The contest between next-generation CPU architectures has begun in earnest. Read on to see how Intel’s 45nm Xeons match up with AMD’s quad-core Opterons.

Goin’ to Harpertown
Following hardware developments these days requires navigating a virtual minefield of overlapping codenames, and Intel proudly leads the world in codename generation. The new Xeons have several names attached. “Penryn” is the codename for the family of processors based on Intel’s 45nm fab process, and this same silicon will serve a number of markets in various configurations. For the server and workstation markets, the bread-and-butter Penryn derivative will be “Harpertown,” a dual-chip, quad-core product that supersedes the current quad-core “Clovertown” Xeons. Intel also has plans for a single-chip, dual-core variant known as “Wolfdale.”

All Penryn derivatives will be manufactured via Intel’s 45nm high-k chip fabrication process, which the company has hailed as a breakthrough and a fundamental restructuring of the transistor. Despite the fanfare, the change brings gains that were once considered fairly conventional for process shrinks. Intel says the 45nm high-k process has twice the transistor density, a 20% increase in switching speed, and a 30% reduction in switching power versus its 65nm process. Improvements of that order are nothing to scoff at these days, nor is Intel’s manufacturing might. The firm already has two fabs making the 45nm conversion in the second half of 2007, Fab D1D in Oregon and Fab 32 in Arizona. Fab 28 in Israel will follow in the first half of next year, along with Fab 11X in New Mexico in the second half of ’08. 45nm processors should make up the majority of its output by then.

Harpertown Xeons and their Penryn-based cousins are not just die-shrunk versions of current chips, but they do retain the same basic layout. The quad-core parts are comprised of two dual-core chips situated together in a single LGA771-style package. This two-chip arrangement isn’t as neatly integrated as AMD’s “native quad-core” Opterons—the two chips can communicate with one another only by means of the relatively slow front-side bus—but it has the advantage of making chips easier to manufacture. The approximately 463 million transistors of AMD’s Barcelona are packed into an area that’s 283 mm² via AMD’s 65nm SOI fab process. That’s a relatively large area over which AMD must avoid defects. By contrast, current 65nm Xeons are based on two chips, each roughly 341 million transistors and measuring just 143 mm². Each chip in a Harpertown Xeon crams 410 million transistors into an even smaller 107 mm² area. One can argue that AMD’s approach to quad-core processors is more elegant, but it’s hard to argue with the Penryn family’s tiny die area.

A wafer of Harpertown 45nm Xeons

The small die belies big changes, though. The most obvious of those is a larger (6MB) and smarter (24-way set associative) L2 cache shared between the two cores on each chip. That adds up to 12MB of L2 cache per socket, for those who prefer to count that way. Harpertowns Xeons can better feed that cache thanks front-side bus speeds of up to 1.6GHz.

Penryn’s CPUs themselves may need the extra bandwidth, thanks to a handful of tweaks. One of the most prominent: a new, faster divider capable of handling both integer and floating-point numbers. This new radix-16-based design processes four bits per cycle, versus two bits in prior designs, and includes an optimized square root function. An early-out algorithm in the divider can lead to lower instruction latencies in some cases, as well. Penryn also extends the Core microarchitecture’s 128-bit single cycle SSE capabilities to shuffle operations, doubling execution throughput there. This is not a new instruction but an optimization for existing instructions, so no software changes are required to take advantage of this capability. The faster shuffle should be useful in formatting and setting up data for use in other SSE-based vector operations.

Speaking of SSE and new instructions, SSE4 is finally here in Penryn. These aren’t just the Supplemental SSE3 instructions supported in the first rev of the Core microarchitecture, but 47 all-new instructions aimed at video acceleration, basic graphics operations (including dot products), and the integration and control of coprocessors over PCIe. These instructions will, of course, require updated software support.

Harpertown Xeons pack some additional Penryn goodness, such as store forwarding and virtualization improvements, but they do not have the nifty “dynamic acceleration tech” intended for desktop Penryn derivatives. Those chips will have the ability to raise their clock speeds beyond their stock ratings, while staying within their appointed thermal envelopes, when one core is idle and the other is busy with a heavily single-threaded workload. Such trickery may be too fancy for the button-down world of servers and workstations, at least in its first-generation form.

Interestingly, Intel is toying with another, more permanent possibility for some future Xeon products: disabling one core on each of the two chips in a package in order to yield a dual-core solution that has 6MB of dedicated L2 cache per core. This move could allow a distinctive mix of single-threaded performance (as dictated by both cache sizes and clock speeds) within a given power envelope.

Speaking of which, the power envelopes for the new Xeons will remain essentially the same as the old ones. That means TDPs of 40, 65, and 80W for dual-core parts and 50, 80, and 120W for quad-cores. TDP ratings at a given clock speed should be down, I believe, although we don’t have all of the details yet. We do know that Intel plans to sell a 3.16GHz version of Harpertown that will fit into the top 120W envelope, and we know that our sample Harpertowns, to be sold as the Xeon E5472, run at 3GHz and fit into an 80W thermal envelope. Additional details on the lineup and pricing will have to wait for the Harpertown Xeons’ official launch date, which isn’t yet here. That will come on November 12.

Stoakley steps up
The product that is officially arriving today is Intel’s new dual-socket platform, code-named Stoakley. This platform is comprised of something old—Intel’s current ESB2 I/O chip (or south bridge)—and something new—a new memory controller hub or north bridge chip code-named Seaburg. Seaburg supplants a pair of existing products, the server-oriented Blackford MCH and the workstation-class Greencreek MCH. Manufactured on a newer process node than its predecessors, Seaburg’s clock speed is up from 333 to 400MHz within a similar power envelope.

We’ve removed the air duct to expose the CPU coolers and DIMMs in our Stoakley test rig

Of course, the Stoakley platform’s main mission in life is to support the new 45nm Xeons. Like the Bensley platform before it, Stoakley has two front-side buses, one dedicated to each socket in the system. However, while Bensley’s front-side buses topped out at 1.33GHz, Stoakley’s FSBs can run at 1.6GHz. Memory bandwidth is up, too, since Seaburg supports FB-DIMM speeds of 800MHz for its four memory channels (though 667MHz remains an option.) Stoakley’s memory controller gains more capacity for memory request reordering than Bensley, as well. All told, Intel cites a 25% higher sustainable memory throughput for the new platform.

In addition to the extra throughput, Stoakley can house twice as much memory as Bensley—up to 128GB—and will support FB-DIMM fail-over for high-reliability systems. Seaburg also doubles the number of PCIe lanes and upgrades those links to second-generation PCI Express.

One of the bigger challenges in designing the Seaburg north bridge was no doubt creating the snoop filter. This logic stores coherency information for all last-level caches on both of the chipset’s front-side buses, and it reduces FSB utilization by filtering out unnecessary coherency updates rather than passing them along from one FSB to the other. A system with dual Harpertown Xeons will have four-last level caches of 6MB each, and each cache will be 24-way associative. Accordingly, Seaburg’s snoop filter has four affinity groups, provides 24MB of coverage, and is 96-way associative. Seaburg also uses a more optimal algorithm to improve victim selection.

In the previous generation, only the workstation-oriented Greencreek MCH had a snoop filter; the server-targeted Blackford MCH did not, because it could hamper performance in some cases. The improvements to Stoakley’s snoop filter have mitigated that performance penalty, and so Intel will offer only one product in this generation. Technically, Stoakley is billed primarily as a workstation platform, but expect it to find its way into servers, as well. With its increased throughput, Stoakley could prove particularly popular for HPC systems.

Test notes
You can see our test system configurations and the like in the section below. Most of it is self-explanatory, but I should mention at least this. You’ll notice that the Stoakley/Xeon 45nm system came with 16GB of RAM, while the rest of the systems had 8GB of RAM. I elected to retain the eight-DIMM, 16GB configuration for the majority of our tests, especially the power tests, since the rest of the test rigs had eight DIMMs each. The presence of additional RAM in the Stoakley box shouldn’t affect the outcome of the vast majority of our tests, since they all fit comfortably into 8GB. The one potential exception is SPECjbb2005, which can use quite a bit of memory, so I tested the Stoakley/Xeon E5472 system with 8GB of RAM in SPECjbb2005.

On another note, we were unfortunately unable to include results from our Folding@Home benchmark in this review, because the bootable Linux CD’s networking stack proved somehow incompatible with our Stoakley review system. We’ll have to test that later.

Also, you’ll see that we have an Opteron 2347 HE among the results, a new addition since our initial review of the quad-core Opterons. We’re curious to see how this CPU matches up against the Xeon L5335 in performance and power use.

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors	Dual Xeon L5335 2.0GHz Dual Xeon E5345 2.33GHz Dual Xeon X5365 3.0GHz	Dual Xeon E5472 3.0GHz	Dual Opteron 2218 HE 2.6GHz Dual Opteron 2220 2.8GHz	Dual Opteron 2347 1.9GHz Dual Opteron 2350 2.0GHz Dual Opteron 2360 SE 2.5GHz
System bus	1333MHz (333MHz quad-pumped)	1600MHz (400MHz quad-pumped)	1GHz HyperTransport	1GHz HyperTransport
Motherboard	SuperMicro X7DB8+	SuperMicro X7DWA	Tyan Tiger K8SSA (S3992)	SuperMicro H8DMU+
BIOS revision	8/13/2007	8/28/2007	5/29/2007	8/15/2007
North bridge	Intel 5000P MCH	Intel Seaburg MCH	ServerWorks BCM 5780	Nvidia nForce Pro 3600
South bridge	Intel 6321 ESB ICH	Intel 6321 ESB ICH	ServerWorks BCM 5785	Nvidia nForce Pro 3600
Chipset drivers	INF Update 8.3.0.1013	INF Update 8.5.0.1005	–	SMBus driver 4.57
Memory size	8GB (8 DIMMs)	16GB (8 DIMMs)	8GB (8 DIMMs)	8GB (8 DIMMs)
Memory type	1024MB DDR2-667 FB-DIMMs at 667MHz	2048MB DDR2-800 FB-DIMMs at 800MHz	1024MB ECC reg. DDR2-667 DIMMs at 667MHz	1024MB ECC reg. DDR2-667 DIMMs at 667MHz
CAS latency (CL)	5	5	5	5
RAS to CAS delay (tRCD)	5	5	5	5
RAS precharge (tRP)	5	5	5	5
Storage controller	Intel 6321 ESB ICH with Intel Matrix Storage Manager 7.6	Intel 6321 ESB ICH with Intel Matrix Storage Manager 7.6	Broadcom RAIDCore with 1.1.7057.1 drivers	Nvidia nForce Pro 3600 with 6.87 drivers
Hard drive	WD Caviar WD1600YD 160GB
Graphics	Integrated ATI ES1000 with 6.14.10.6553 drivers
OS	Windows Server 2003 R2 Enterprise x64 Edition with Service Pack 2
Power supply	Ablecom PWS-702A-1R 700W

We used the following versions of our test applications:

SiSoft Sandra XI.SP4a 64-bit
CPU-Z 1.40
SPECjbb2005 with Sun Java 6 Update 2 Windows x64 edition
Valve VRAD map build benchmark
Cinebench R10 64-bit Edition
POV-Ray for Windows 3.7 beta 22 64-bit
CASE Lab Euler3d CFD benchmark multithreaded edition
MyriMatch proteomics benchmark
picCOLOR 4.0 build 598 64-bit
The Panorama Factory 4.5 x64 Edition
Windows Media Encoder 9 x64 Edition

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

We start with some synthetic tests of the cache and memory subsystem, and the first one shows us that the 45nm Xeon E5472 pretty much matches its the Xeon X5365 in L1 and L2 cache bandwidth. The only big difference is at the 16MB block size, where the E5472’s larger 6MB L2 cache helps out some. Both of these chips run at 3GHz, so they’re a clock-for-clock match. We’ll want to watch these two to see how much, if any, the Harpertown Xeon E5472s improve per-clock performance.

Let’s take a closer look at the tail end of these results, where we’re primarily accessing main memory. I believe these results show memory bandwidth available to a single CPU core, not total system bandwidth, but they’re still enlightening.

The Stoakley platform’s faster bus and higher memory frequencies add up to a nice boost in bandwidth over the older Xeons on the Bensley platform. Again, I don’t think we’re seeing absolute peak bandwidth, especially from the Xeons, but we can see a relative boost in throughput.

Memory access latencies are essentially unchanged from the older Xeons to the newer. Let’s look at this issue in a little more detail. In the graphs below, yellow represents L1 cache, light orange is L2 cache, red is L3 cache, and dark orange is main memory.

As one might expect, the Xeon E5742’s memory access latencies are lower at larger block sizes, like 16MB and 32MB, than the X5365’s. The faster bus and memory clocks likely deserve credit for that. More impressively, we measured the E5472’s 6MB L2 cache at 15 cycles of latency, just one cycle more than the 4MB L2 cache on the Xeon X5365 at the same clock frequency—quite the contrast to the high latencies we found in the quad-core Opterons’ new L3 cache.

SPECjbb2005
SPECjbb2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

SPECjbb2005 can be configured to run in many different ways, with different performance outcomes, depending on the tuning of the JVM, thread allocations, and all sorts of other things. I had no intention of producing a record score myself; I just wanted to test relative performance on equal footing. Much higher performance is available using alternative JVMs and the like, and we may explore those options in the future. For now, we’ll leave peak scores to the guys who spend their days optimizing for a single benchmark.

I used the Sun JVM for Windows x64, and I found that using two instances of the JVM produced the best scores on the Opteron-based systems. Scores with one or two instances were about the same on the Xeons, so I settled on two instances for my testing, with the following Java options:

-Xms2048m -Xmx4096m +XX:AggressiveOpts

Those settings produced the following results:

The Xeon E5742 delivers a clock-for-clock performance increase of roughly 10% over the Xeon X5365 in this test, enough to vault it ahead of another not-yet-released product, the 2.5GHz Opteron 2360 SE, and into the top spot.

Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve Software’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.

I’ve included a quick Task Manager snapshot from the test below, and I’ll continue that on the following pages. That’s there simply to show how well the application makes use of eight CPU cores, when present. As you’ll see, some apps max out at four threads.

The new Xeon E5472s shave five seconds off of the X5365s’ time, impressively enough. This isn’t quite the ~10% gain we saw above, but it’s not bad, either. Notably, even the Opteron 2360 SEs are nearly half a minute slower than the E5472s.

Cinebench
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

The theme of clock-for-clock performance gains continues in Cinebench, where the 45nm Xeons’ faster divider and SSE shuffle capabilities may be coming into play. The E5472s are only slightly faster than the X5365s with only a single thread in use, but the new Xeons scale better up to eight threads than the older models. Again, Intel is putting more distance between its top chip and AMD’s future Opteron 2360 SE.

POV-Ray rendering
We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

The per-clock performance gains come to a halt in POV-Ray, where the E5472s essentially match the X5365s. That still puts them in a tie for first place, though.

By the way, this beta version of POV-Ray seems to have a problem with single-threaded tasks bouncing around from one CPU core to the next, and this causes especially acute problems on NUMA systems. Since the vast majority of the computation time for the benchmark scene involves such single-threaded work, things turn out badly for the Opteron 2300s.

MyriMatch
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

One of the most striking things about these results is that fact that performance on the eight-core systems seems to top out at about four to six threads and drop off from there. I asked Myrimatch’s authors about this dynamic a few months ago, and here’s how they explained it:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution. Of course, machines with insufficient memory to store both spectra and sequence database at once suffer a tremendous performance penalty, but the benchmark employs a small database with a small spectral set to avoid this problem.

As they note, memory bandwidth may become a bottleneck with this application. And right on cue, the new Xeons on the Stoakley platform produce a substantial performance gain over the Xeon X5365s. The performance boost is enough for Intel to recapture the overall lead from the Opteron 2350 SEs.

STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with increasing numbers of threads.

The Xeon E5472s chalk up another victory, and they set a new record for Euler3D throughput in the process. The performance gains over the X5365s are present from one to eight threads, but they’re most pronounced at six and eight threads, where bus and memory bandwidth limitations are most likely to become a factor. In fact, the E5472s are faster at six threads than the X5365s are at eight.

The Panorama Factory
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Xeon E5472s continue to post solid performance gains in this image processing application, finishing the panorama generation process nearly two seconds quicker than the Xeon X5365s. Looking at the results from the individual operations in this process, we can see small gains from the E5472s at nearly every stage. Proportionally, some of the biggest gains come in the stitch and render operations.

picCOLOR
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

The new Xeons post strong per-clock performance gains in some of picCOLOR’s functions, especially in the Fourier (FFT/PWR) one, where the E5472s post a score of 17.71 versus the X5365’s 11.62. I asked Dr. Müller about this function, and he said: “The FFT/PWR function calculates the Fourier transform of the image, then
displays the power spectrum, and then reconstructs the original image
by inverse Fourier transform.” That makes this function a good candidate for taking advantage of Penryn’s tweaks. In fact, the inner kernel of the FFT algorithm uses a bit shuffle function, and the power part of the function includes “a few MULs, one ADD, and one SQRT.” So we should be seeing both Penryn’s fast SSE shuffle and its optimized square root logic in action.

Windows Media Encoder x64 Edition
Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

The E5472s are at it again, finishing the encoding task 20 seconds before their like-clocked predecessors.

SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

The E5472s’ performance gains here aren’t quite what we’ve seen elsewhere, but it hardly matters. Nothing can touch the 3GHz quad-core Xeons.

POV-Ray power consumption and efficiency
Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we asked POV-Ray to render our “chess2.pov” scene at 1024×768 resolution with antialiasing set to 0.3.

Before testing, we enabled the CPU power management features for Opterons and Xeons—PowerNow! and Demand Based Switching, respectively—via Windows Server’s “Server Balanced Processor Power and Performance” power scheme.

Incidentally, the 5300-series Xeons I’ve used here are newer G-step models that promise lower power use at idle than older ones. I used a beta BIOS for our SuperMicro X7DB8+ motherboard that supports the enhanced idle power management capabilities of G-step chips. Unfortunately, I’m unsure whether we’re seeing the full impact of those enhancements. Intel informs me that only newer revisions of its 5000-series chipset support G-step processors fully in this regard. Although this is a relatively new motherboard, I’m not certain it has the correct chipset revision.

Of course, our Stoakley platform should support the further reductions in idle power offered by the Xeon E5472s.

Anyhow, here are the results:

Without any extra help, you can easily see that the new Xeons bring big reductions in power use over the X5365s. We can slice up the data in various ways in order to better understand them, though. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Stoakley platform draws about the same at idle as Bensley does when coupled with low-power Xeons. The E5472s on Stoakley draw 20W less at idle than their 3GHz counterparts on the Bensley platform, but that’s still quite a bit more power draw at idle than any of the Opterons.

Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.

The Stoakley/Harpertown pairing brings a drastic drop in power draw versus the Xeon X5365s on Bensley. In fact, the Stoakley/Harpertown combo at 3GHz draws less power than Bensley/Clovertown pairing at 2.33GHz. Notably, the Xeon E5472 system also consumes less power than the Opteron 2360 SE-based one.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

When you slice things this way, the Opterons tend to excel, led by the low-power Opteron 2347 HE. However, the Stoakley/Harpertown system isn’t far behind, and it edges out the low-power Xeon L5335.

We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve chosen to identify the end of the render as the point where power use begins to drop from its steady peak. We’ve sometimes seen disk paging going on after that, but we don’t want to include that more variable activity in our render period.

We’ve computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

In what may be our best measure of energy-efficient performance, the Xeon E5472/Stoakley system distances itself from the pack. Even AMD’s impressive new quad-core Opterons, our previous champs, are well behind it.

Power use at partial utilization with SPECjbb 2005
Before we close out our look at power efficiency, I’d like to consider another example. I’ve measured power use in SPECjbb2005 in order to show how it scales with incremental increases in load. I’ve only used a single instance of the JVM so that we can see a nice, gradual step up in load—two instances would take us to peak utilization much quicker.

We’ve graphed the quad-core Opterons and Xeons together. Since the dual-core Opterons take much longer to finish, they get their own graph.

The E5472s look great here, as well, starting at idle power levels similar to the Xeon L5335 and peaking out right alongside the 2.33GHz Xeon E5345.

Conclusions
The combination of Intel’s 45nm Harpertown Xeons and their supporting Stoakley platform brings incremental but compelling gains in performance over current Xeons on the Bensley platform. Clock for clock, the new Xeons delivered performance gains in the majority of our tests. Those gains were especially notable in SPECjbb2005, where we saw about a 10% increase, and in memory bandwidth-limited applications like MyriMatch and Euler3D’s CFD solver, where the advances were even greater.

This higher clock-per-clock performance comes alongside a considerable drop in peak power use at 3GHz—from 403W for the Xeon X5365 system to 311W for the Xeon E5472 system—and a smaller but welcome drop in power draw at idle. The faster performance and lower power consumption together make the Stoakley/Harpertown combo an excellent “performance per watt” proposition, as our measure of energy required to render a scene demonstrated. In fact, no other solution was close in this respect. The new Xeons’ weakness on the efficiency front remains power draw at idle, a problem largely attributable to Intel’s continued use of FB-DIMM memory. For this reason, AMD’s quad-core Opterons remain competitive in terms of overall power efficiency.

Those new Opterons will certainly have their hands full with Intel’s 45nm Xeons, though. The Xeon E5472 extends Intel’s performance lead over the fastest quad-core Opteron we’ve seen yet, the 2.5GHz model 2360 SE. Of course, neither chip is available to the public as a product just yet, though both are promised for the fourth quarter of this year. Right now, if both companies make good on their plans, it looks like Intel will continue to lead in the server and workstation markets. The same may be true in other markets served by these same basic CPU designs, but only time will tell for sure.

Scott Wasson

View all posts by Scott Wasson

Intel’s Stoakley platform and 45nm Xeons

Scott Wasson

Scott Wasson

Most Popular News

Latest News

Joint International Police Operation Disrupts LabHost – A Platform That Supported 2,000+ Cybercriminals

Apple Removes WhatsApp and Threads from Its App Store in China

XRP Falls to $0.3 Amid Massive Weekend Sell-off – Can $1 Be Achieved Post-Halving?

Cardano Could Rally to $27 After Bitcoin Halving Following a Historical Performance

Japanese Banking Firm Launches Passive Income Program for Shiba Inu

Ripple CLO Clarifies Future Steps With the SEC While Quenching Settlement Rumors

Cisco Launches AI-Driven Security Solution ‘Hypershield’