Home AMD’s ‘Istanbul’ six-core Opteron processors
Reviews

AMD’s ‘Istanbul’ six-core Opteron processors

Scott Wasson
Disclosure
Disclosure
In our content, we occasionally include affiliate links. Should you click on these links, we may earn a commission, though this incurs no additional cost to you. Your use of this website signifies your acceptance of our terms and conditions as well as our privacy policy.

The recent advent of Intel’s “Nehalem” Xeons had a bit of an apocalyptic feeling to it, when one considered the implications for AMD. Despite strong showings from the past few generations of Xeons and some unfortunate problems for the first quad-core Opterons, Intel never really seemed to open up an insurmountable lead in the two-socket server and workstation spaces. The Opteron’s power efficiency was consistently strong, at least, and its outright performance wasn’t too far behind the curve. The Nehalem-based Xeons, though, reached dizzying new performance heights with comparatively modest power consumption. One was left to wonder how on earth AMD would respond.

Now we have an answer, and it’s an interesting one, to say the least. The newest Opteron, code-named Istanbul, packs not four but six cores on a single die, giving it a considerable boost in performance potential. Not only that, but it’s hitting the market early. AMD had originally planned to introduce this product in the October time frame, but the first spin of Istanbul silicon came back solid, so the firm pulled the launch forward into June. Even with the accelerated schedule, of course, Istanbul comes not a moment too soon, now that Nehalem Xeons are out in the wild. We’ve had a pair of Istanbul chips humming away in our labs for the past week. Let’s have a look at whether they can restore the Opteron’s competitiveness.

The hexapod cometh
In the wake of Intel’s introduction of a radically new platform, AMD is emphasizing buttoned-down continuity for its new Opterons. In fact, this continuity may be Istanbul’s defining feature. By and large, Istanbul is essentially a quad-core “Shanghai” processor with two additional cores added to the die. Istanbul is compatible with the existing Socket F infrastructure, so it’s an easy drop-in upgrade for existing servers. So long as your Socket F motherboard supports dual power planes, all that’s required for an Istanbul upgrade is a quick BIOS flash and a chip swap. (In fact, that’s exactly how we prepared our test system for this review.) To get the six-core chips even fit into existing power envelopes, AMD has dialed back clock frequencies slightly, which is why the company cites a general performance boost of around 30% when going from a Shanghai Opteron to an Istanbul—depending, of course, on the workload.

Although AMD expresses hope of in-place server upgrades becoming a healthy portion of its business in a down economy, the more likely payoff for Istanbul is with AMD’s largest customers: system vendors, who ought to be able to refresh their Opteron-based product lineups with relatively minimal validation efforts. In fact, I’d expect to see quite a few vendors unveiling Istanbul-based systems in the coming weeks, starting today, even though they’ve just introduced new Xeon-based offerings, as well.


Istanbul looks like Shanghai plus two cores. Source: AMD.

Despite all of this sleepy talk about continuity, Istanbul does have a few new tricks up its sleeve. For one thing, the north bridge and HyperTransport clocks in Istanbul are decoupled, so higher HyperTransport frequencies are possible. The Opterons introduced today all have a HyperTransport clock of 2.4GHz, resulting in a 4.8 GT/s transaction rate. The north bridge clock, which also governs the speed of the L3 cache, runs at 2.2GHz.

The most notable change, though, is probably the addition of a feature AMD calls HT Assist. HT Assist is essentially a probe filter intended to reduce the overhead required for the synchronization of cached data across CPUs in multiple sockets. HT Assist reserves space in each processor’s L3 cache, in which it stores an index of where that CPU’s cache lines are being used system-wide. The CPU then becomes “host” of the cache lines stored in its directory. If any CPU needs an update about a particular cache line, it will often know which CPU is the correct host to probe for that information. AMD says HT Assist can replace broadcast probe requests (sent to all sockets) with directed requests in 8 of 11 typical CPU-to-CPU transactions. This reduction in probe traffic can yield big gains in available system bandwidth, as we reported when we saw AMD demo a 4P system whose Stream bandwidth increased from roughly 25GB/s to 42GB/s with the addition of Istanbul processors with HT Assist.

Back then, AMD talked of user-configurable HT Assist index sizes that could be set in the BIOS. Since that time, the firm has instead settled on a static index size of 1MB, which it considers the most optimal tradeoff between cache size and index granularity. To keep things simple, including Istanbul validation for system vendors, the index size will not be user-configurable. AMD has also decided not to enabled HT Assist by default on 2P systems, because the reduction in probe traffic on a 2P box isn’t worth the loss of 1MB of L3 cache per processor. For what it’s worth, our 2P SuperMicro H8DMU+ motherboard does expose a BIOS option to enable this feature, and we found that enabling it produced no appreciable increase in Stream bandwidth.


The Istanbul Opteron die. Source: AMD.

Like Shanghai before it, Istanbul is produced by GlobalFoundries on its 45nm SOI fabrication process. Istanbul weighs in at 904 million transistors, and its six-core die is 346 mm². Compare that to Shanghai, which is 758 million transistors and 258 mm². Istanbul isn’t 50% larger by either count, although its core count is up from four to six, because a 6MB L3 cache occupies a large portion of both chips. Intel’s Nehalem Xeons, of course, are also 45nm chips, and have dimensions very similar to Shanghai, with roughly 751 million transistors in a 263 mm² die. In other words, even if AMD does match Nehalem with Istanbul, it will be doing so with a considerably larger chip.

The comparison to Nehalem is instructive for many reasons, not least of which is the very different approaches AMD and Intel have taken with their latest CPU architectures. From a certain way of looking at things, they reach similar destinations by different paths. Istanbul, of course, has six execution cores, each of which can issue three instructions per clock. Nehalem has four cores, but they are true four-issue cores, capable of issuing, executing, and retiring four instructions per clock. Chip wide, then, Istanbul can issue 18 instructions per clock, while Nehalem can issue 16—closer than one might think, when just considering core counts. Also, thanks to simultaneous multithreading, Nehalem can track eight hardware threads, to Istanbul’s six, for greater thread-level parallelism. Perhaps most decisive for many of today’s workloads is the fact that Nehalem has three channels of DDR3 memory per socket, versus Istanbul’s two channels of DDR2. Despite its larger die size and higher core count, Istanbul isn’t necessarily far-and-way superior to Nehalem, even in theory.

That’s the match-up in the 2P space, but 4P and better servers may be more hospitable ground for the time being. The Xeon 7400 series processors, better known as Dunnington, have six cores, but are based on Intel’s older microarchitecture. AMD expects Istanbul to give it a clear lead in this space, at least until Nehalem-EX arrives later this year with native octal cores and four memory channels per socket.

Pricing and availability
Istanbul Opterons will populate the new Opteron 2400 and 8400 series lineups, and their introduction brings with it some price reductions on existing Shanghai Opterons.

Model Cores Clock speed North bridge/
L3 cache speed
HyperTransport
speed
ACP Price
Opteron 2435 6 2.6GHz 2.2GHz 2.4GHz 75W $989
Opteron 2431 6 2.4GHz 2.2GHz 2.4GHz 75W $698
Opteron 2427 6 2.2GHz 2.2GHz 2.4GHz 75W $455
Opteron 2389 4 2.9GHz 2.2GHz 2.2GHz 75W $698
Opteron 2387 4 2.8GHz 2.2GHz 2.2GHz 75W $523
Opteron 2384 4 2.7GHz 2.2GHz 2.2GHz 75W $523
Opteron 2382 4 2.6GHz 2.2GHz 2.2GHz 75W $316
Opteron 2380 4 2.5GHz 2.0GHz 2.0GHz 75W $316
Opteron 2378 4 2.4GHz 2.0GHz 2.0GHz 75W $174
Opteron 2376 4 2.3GHz 2.0GHz 2.0GHz 75W $174

The three 2P versions of Istanbul run at 2.2, 2.4, and 2.6GHz, and all fit into AMD’s mainstream 75W ACP power envelope. AMD is quick to point out that its entire product lineup shares the same basic feature set—including cache sizes, memory speeds, and virtualization support—in contrast to the breathtaking variety of the Xeon 5500 series, which can be rather daunting to keep sorted.

One can see here how AMD intends for the quad- and six-core Opterons to coexist. The top Shanghai model, the 2389 at 2.9GHz, drops from $989 to $698 to make room for the 2.6GHz Istanbul. The other Shanghais tumble in reaction. At that same $698 mark is the Opteron 2431, a 2.4GHz Istanbul. So the customer is faced with a fairly straightforward choice between four cores at 2.9GHz or six cores at 2.4GHz for the same price. The 4P-and-greater Opteron 8000 series presents the same choice, with higher stakes.

Model Cores Clock speed North bridge/
L3 cache speed
HyperTransport
speed
ACP Price
Opteron 8435 6 2.6GHz 2.2GHz 2.4GHz 75W $2,649
Opteron 8431 6 2.4GHz 2.2GHz 2.4GHz 75W $2,149
Opteron 8389 4 2.9GHz 2.2GHz 2.2GHz 75W $2,149
Opteron 8387 4 2.8GHz 2.2GHz 2.2GHz 75W $1,865
Opteron 8384 4 2.7GHz 2.2GHz 2.2GHz 75W $1,514
Opteron 8382 4 2.6GHz 2.2GHz 2.2GHz 75W $1,165
Opteron 8380 4 2.5GHz 2.0GHz 2.0GHz 75W $989
Opteron 8378 4 2.4GHz 2.0GHz 2.0GHz 75W $873

The first wave of Istanbuls all occupy standard power envelopes, but the six-core chips will proliferate to the other Opteron power grades this summer. We expect to see an SE model (105W ACP) at 2.8GHz, an HE (55W ACP) at 2GHz, and an EE (40W ACP) at 1.9GHz.

We have in our labs a pair of Opteron 2435 processors, and we’ve selected as their most direct competition a pair of Xeon X5550s. These Nehalem CPUs have a core clock of 2.66GHz, a 6.4 GT/s QPI link, support DDR3 1333MHz memory, and list for $958.

The X5550 has a 95W TDP rating, but there is some dispute over whether AMD’s ACP and Intel’s TDP are truly comparable. AMD has its own TDP numbers for its processors—SE chips are 137W, standard ones are 115W, HE models are 79W, and EE models are 60W—but it claims those numbers are more of an absolute peak than Intel’s. Hence the development of its ACP metric. We’ll measure power ourselves shortly, so I wouldn’t get too hung up on that issue.

After this, there’s that and the other
AMD has already outlined its plans for the next little while, including the introduction of the Socket F-compatible Fiorano platform later this year, an all-AMD effort that will bring PCIe Gen2 and HyperTransport 3 support (for the chipset link, not just CPU-to-CPU links like now), along with hardware support for I/O virtualization. After that, in early 2010, will come the bifurcation of Opteron socket types into two classes, the higher-end G34 with four memory channels and the mid-range C32 socket with dual-channel memory. These new sockets will enable some features already present in 45nm Opteron silicon, including DDR3 memory support and a fourth HyperTransport link. The two socket types will overlap in the 2P space, while only the G34 will serve 4P and beyond.


Source: AMD.

For sheer power, the most interesting of the two is the G34 socket, which will play host to Magny-Cours, a 12-core monster that’s essentially comprised of two Istanbuls in a single package, with an in-package HyperTransport interconnect between the two dies. AMD’s Mike Goddard told us this on-package HT connection isn’t anything special, just a pair of HT links (one x16 and one x8) running at regular frequencies. However, without the need to traverse a longer distance over a motherboard, Goddard said AMD should be able to tune the synchronizers on the HT links to achieve much lower latencies than a socket-to-socket connection.

Beyond that, mapping out the multi-chip-per-package future of the Opteron becomes rather tricky. Magny-Cours, for instance, will be fully connected on a per chip basis, not just per socket, in a 2P system. The routing on a 4P system becomes very daunting, very quickly, but the bottom line is that it’s fully connected per socket, not per die, with no more than two hops required in any scenario. Goddard said it was “a science experiment” getting that 4P routing topology done.


Source: AMD.

After a refresh with 32nm processors based on the next-generation “Bulldozer” microarchitecture on the G34 and C32 platforms in 2011, AMD plans to introduce a new platform again in 2012. Details about this one are sketchy, but Goddard told us that platform would include on-die PCI Express connectivity. I expect we’ll learn more about that as the time approaches.

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors
Dual Xeon E5450 3.0GHz

Dual Xeon X5492 3.4GHz
Dual
Xeon L5430 2.66GHz
Dual
Xeon X5550 2.66GHz
Dual
Xeon W5580 3.2GHz
Dual
Opteron 2347 HE 1.9GHz
Dual
Opteron 2356 2.3GHz

Dual Opteron 2384 2.7GHz

Dual Opteron 2389 2.9GHz

Dual Opteron 2435 2.6GHz
System
bus
1333
MT/s
(333MHz)
1600
MT/s
(400MHz)
1333
MT/s
(333MHz)
QPI
6.4 GT/s
(3.2GHz)
HT
2.0 GT/s
(1.0GHz)
HT
2.0 GT/s
(1.0GHz)
HT
4.4 GT/s
(2.2GHz)
HT
4.8 GT/s
(2.4GHz)
Motherboard SuperMicro
X7DB8+
SuperMicro
X7DWA
Asus
RS160-E5
SuperMicro
X8DA3
SuperMicro
H8DMU+
SuperMicro
H8DMU+
BIOS
revision
6/23/2008 8/04/2008 8/08/2008 2/20/2009 3/25/08 10/15/08
10/15/08
05/18/09
North
bridge
Intel
5000P MCH
Intel
5400 MCH
Intel
5100 MCH
Intel
5520 MCH
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
South
bridge
Intel
6321 ESB ICH
Intel
6321 ESB ICH
Intel
ICH9R
Intel
ICH10R
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
Chipset
drivers
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 8.9.0.1006
Memory
size
16GB
(8 DIMMs)
16GB
(8 DIMMs)
6GB (6 DIMMs) 24GB (6 DIMMs) 16GB
(8 DIMMs)
16GB
(8 DIMMs)
Memory
type
2048MB
DDR2-800 FB-DIMMs
2048MB
DDR2-800 FB-DIMMs
1024MB
registered ECC
DDR2-667 DIMMs
4096MB
registered ECC
DDR3-1333 DIMMs
2048MB
registered ECC
DDR2-800 DIMMs
2048MB
registered ECC
DDR2-800 DIMMs
Memory
speed (Effective)

667MHz
800MHz
667MHz
1333MHz
667MHz
800MHz
CAS
latency (CL)
5 5 5 10 5 6
RAS
to CAS delay (tRCD)
5 5 5 9 5 5
RAS
precharge (tRP)
5 5 5 9 5 5
Storage
controller
Intel
6321 ESB ICH
with
Matrix Storage Manager 8.6
Intel
6321 ESB ICH
with
Matrix Storage Manager 8.6
Intel ICH9R with
Matrix Storage Manager 8.6
Intel ICH10R with
Matrix Storage Manager 8.6
Nvidia
nForce Pro 3600
LSI
Logic Embedded MegaRAID
with 8.9.518.2007 drivers
Power
supply
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
FSP
Group FSP460-701UG 460W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Graphics Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
XGI Volari Z9s with 1.09.10_ASUS drivers
Nvidia
GeForce 8400 GS with ForceWare 182.08 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Hard
drive
WD
Caviar WD1600YD 160GB
OS Windows
Server 2008 Enterprise x64 Edition with Service Pack 1

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This bandwidth test gives a nice visual for the different levels of the cache and memory hierarchy. Because AMD’s lower-level caches don’t replicate all of the contents of the higher-level caches, Istanbul’s two additional 512KB L2 caches (associated with its two added cores) increase its total effective cache size—and bandwidth—compared to Shanghai.

One new addition we’ve made for this review is a proper Stream bandwidth test. This version of Stream is multithreaded and can be told how many threads to create. We’ve chosen the optimal number for each system. As you can see, the Nehalem Xeons have a clear lead in available bandwidth thanks primarily to their three channels of DDR3 1333MHz memory. With no real changes to the memory subsystem, Istanbul achieves no more throughput than Shanghai.

Memory access latencies haven’t really changed with Istanbul, either, even though six cores are now sharing the same two memory controllers.

We can get a closer look at access latencies throughout the memory hierarchy with the 3D graphs below. I’ve colored the block sizes that correspond to different cache levels, with yellow being L1 data cache and brown representing main memory.

The continuity between Istanbul and Shanghai continues here. The Xeon X5550 looks pretty similar, too, but it has smaller L1 and L3 caches, a larger, quicker L3 cache (8MB) and much shorter access times to main memory.

SPECjbb2005
SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We did not intend to challenge the best published scores with our results, but we did hope to achieve reasonably optimal tuning for our test systems. We used a fast JVM—the 64-bit version of Oracle’s JRockIt JRE P28.0—and picked up some tweaks for tuning from recently published results. We used two JVM instances on all systems (one per socket), with the following command line options:

start /AFFINITY [FC0, 03F] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:6 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k

Those options are specifically the ones used with the Istanbul Opteron system. They varied for the other two systems in a couple of ways. Notice that we used the Windows “start” command to affinitize threads on a per-socket basis. For the Xeon X5550 system with 16 threads, we used masks [FF00, 00FF], and for the Shanghai Opterons, we used [F0,0F]. We also adjusted the number of garbage collector threads (-XXgcthreads) for each JVM to match the number of hardware threads per socket. In keeping with the SPECjbb run rules, we tested at up to twice the optimal number of warehouses per system, with the optimal count being the total number of hardware threads.

In all cases, Windows Server’s “lock pages in memory” setting was enabled for the benchmark user. In the X5550 system’s BIOS, we disabled the “hardware prefetch” and “adjacent cache line prefetch” options.

Since this is a new round of tests with an updated JVM, we’ve limited our scope to the three most relevant CPU types.

Even with six cores, the Opteron 2435 can’t match the Xeon X5550 in SPECjbb2005. Istanbul does bring substantial progress over Shanghai, however, closing the gap quite a bit. Things become more interesting when we bring power use into the picture, as we’re about to do.

SPECpower_ssj2008
Another new addition for this review is, at long last, SPECpower_ssj2008. Like SPECjbb2005, this benchmark is based on multithreaded Java workloads and uses similar tuning parameters, but its workloads are somewhat different. SPECpower is also distinctive in that it measures power use at different load levels, stepping up from active idle to 100% utilization in 10% increments. The benchmark then reports power-performance ratios at each load level.

SPEC’s run rules for this benchmark require the collection of ambient temperature, humidity, and altitude data, as well as power and performance, in order to prevent the gaming of the test. Per SPEC’s recommendations, we used a separate system to act as the data collector. Attached to it were a Digi WatchPort/H temperature and humidity sensor and an Extech 380803 power meter. I should note that the Extech is not officially approved by SPEC. Although it generally works well enough, the Extech occasionally produces a clearly wrong reading, which is either approximately one half or twice the prior reading—apparently a simple serial communications quirk. We’ve found that we can filter out these errors with a simple inspection of the data, and SPECpower appears to catch the errors, as well. Our results would not be accepted for publication by SPEC unless we used an approved (and much costlier) power meter. They should, however, be good enough for our purposes.

We used the same basic performance tuning and system setup parameters here that we did with SPECjbb2005, with the exception that we lowered the JVM heap size slightly to avoid a memory allocation error. Here’s an example of the Java options from our Istanbul system:

-Xms3700m -Xmx3700m -Xns3000m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:6 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k

Like I said, the heap size is the only real change. Due to this benchmark’s long run times, we only ran it once on each system.

SPECpower_ssj results are a little more complicated to interpret than your average benchmark. We’ve plotted the output in several ways into order to help us understand it.

Here’s a look at ssj_ops, the benchmark’s measure of performance, and the power consumed in watts at each load level. The Istanbul Opteron 2435-based system looks awfully good here; its power consumption is similar to the Shanghai system at each load level, but with substantially higher performance. The Xeon X5550 system is a little different; at active idle, it draws 142W, versus 150W for the two Opteron boxes. Beyond that, the Xeon X5550 system draws more power but achieves higher performance at each step than the Opteron 2435.

A look at performance-to-power ratios should help clarify things.

Now we can see just how incredibly close a race this is. The performance-power curves for the Opteron 2435 and Xeon X5550 systems almost perfectly overlap, amazingly enough. The Nehalem Xeon is slightly superior at the lower load levels, but the Istanbul box takes a lead as utilization climbs to 40% and higher.

Obviously, Istanbul’s showing here represents a solid advance over Shanghai. Multi-core processors tend to offer very strong power efficiency propositions with highly parallel workloads. Adding two more cores and dialing back clock speeds in order to fit into the same power envelopes as Shanghai proves to be a very effective strategy in this case.

Surprisingly, the Xeon X5550 system manages to out-point the Opteron 2435 in SPECpower_ssj2008’s overall performance per watt summation, although only by an eyelash. The overall result takes power draw at active idle into account, which is probably what puts the Xeon over the top. Make no mistake, though: this Istanbul system is very much a match for the Xeon in terms of power-efficient performance.

Cinebench rendering
We can take another look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

In this application, Istanbul’s two additional cores bring it even closer to the Xeon X5550. As the multithreaded version of this test ran, we measured power draw at the wall socket for each of our test systems across a set time period.

A quick look at the data tells us much of what we need to know, Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Idle power draw here is similar to what we saw in SPECpower_ssj, but slightly higher, especially for the Xeon X5550 system. Only one watt separates it from the Istanbul box.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

Power draw under load here isn’t quite as high as it was in SPECpower_ssj, but the trend remains the same: the Xeon X5550 system draws considerably more power at peak than the Opteron systems.

One way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

The Istanbul system consumes less energy over the course of the test period than the Xeon X5550.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

The energy efficiency picture comes into sharper focus with this final metric. The Istanbul Opteron-based system requires less energy to render the scene than anything we tested but a tailored low-power system based on the Xeon L5430 and Intel’s San Clemente platform. This is a more definitive result than we saw in SPECpower_ssj, and Istanbul comes out clearly ahead of the Xeon X5550.

MyriMatch proteomics
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Opterons aren’t entirely memory bandwidth bound in this test, because the Opteron 2435 shaves 20 seconds off of the execution time compared to the 2389. That’s a healthy improvement, but it’s not sufficient to catch the Xeon X5550, which completes the test 14 seconds ahead of the Istanbul Opteron.

STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

I thought it might be nice to plot the performance at different thread counts, which I did in the graph above. However, we’ve seen some pretty broad variance in the results of this test at lower thread counts, which suggests that it may be stumbling over these systems’ non-uniform memory architectures. Just for kicks, I decided to try running two instances of this benchmark concurrently, with each one affinitized to a socket, and adding the results into an aggregate compute rate. Doing so proved to offer a nice performance boost.

Both the Xeons and Opterons benefited from the change. However you run this test, though, the Nehalem Xeons are simply faster, probably due to their superior memory bandwidth.

Folding@Home
Next, we have a slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the total number of cores (or threads, in the case of SMT) in the system in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

The Xeon X5550’s per-thread performance here, in the individual work unit types, is relatively weak because it’s running two threads per core. Once we get to the final analysis, though, its total projected points per day look much stronger. Istanbul is once again a respectable improvement on Shanghai, but not quite fast enough to catch the new Xeons.

POV-Ray rendering

We’ve been using this chess2 POV-Ray scene as an example for ages, and here, it offers some drama, with the Opteron 2435 finishing one second after the Xeon X5550. The POV-Ray benchmark scene has a large single-threaded component, so it produces very different results. The Opteron 2389 is faster in this case, giving us a peek at the other side of the core-versus-frequency tradeoff.

Valve VRAD map lighting
This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into its games.

The Nehalem Xeons simply excel in this application. No contest, really.

x264 HD video encoding
This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

One can surmise by looking at these results that x264’s second pass is more widely multithreaded than the first. True to form, the Shanghai Opterons are faster in pass one, while Istanbul is faster the second time around. Due to the flexibility of its “Turbo mode” mechanism, the Xeon X5550’s performance is excellent in both cases.

Sandra Mandelbrot
We’ve included this final test largely just to satisfy our own curiosity about how the different CPU architectures handle from SSE extensions and the like. SiSoft Sandra’s “multimedia” benchmark is intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power.

The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

We’re using the 64-bit version of the Sandra executable, as well.

In our final benchmark, the Istanbul Opteron produces a bit of a surprise, besting the Xeon X5550 in all three tests—and even outperforming the $1600-a-pop Xeon W5580 in the integer x8 test.

Conclusions
We’ve now had a look at AMD’s first response to Nehalem, and well, it’s not bad. The six-core Opteron 2345 can’t quite match the Xeon X5550 in overall performance—and although these products are nearly the same price, the Xeon X5550 isn’t the highest Nehalem speed grade. That honor would fall to the Xeon W5580 processor that appeared in some of our benchmark results. In terms of raw performance in a 2P system, Nehalem still reigns supreme.

Yet Istanbul should be a clear improvement over Shanghai for many workstation-class workloads and most server-class workloads—i.e., those that are essentially parallel and widely multithreaded. The Opteron 2435 manages to deliver this higher performance not just within the same power envelopes, but quite empirically with almost the exact same measured power consumption as the Opteron 2389.

This combination yields a nice increase in power efficiency, which was enough to put our Istanbul-based test system in the same territory as our Xeon X5550 system. The competition between the two was remarkably close in SPECpower_ssj, and the Istanbul system required notably less energy to render the Cinema 4D sample scene in Cinebench. So despite that fact that Intel leads in outright performance, the Opteron 2435 is entirely competitive on the power-efficiency front, with lower peak power draw, to boot. Those who evaluate systems strictly on this basis would do well to keep Opterons in the mix.

And if you have existing, compatible Socket F servers, the Istanbul Opterons should be an excellent drop-in upgrade. They’re a no-brainer, really, when one considers energy costs and per-socket/per-server software licensing fees.

AMD has a tougher sell to make when it comes to brand-new systems. The Nehalem Xeons offer higher peak performance with a similar energy-efficiency proposition. Still, Istanbul at least keeps the Opteron in the conversation, which makes the outlook for AMD seem substantially less apocalyptic than it did several short months ago.

Latest News

IPTV
Streaming News & Events

Operator of Illegal IPTV Streams Sentenced to Five Years in Jail

White House Announces New Set of Rules for Federal Agencies Using AI
News

White House Announces New Set of Rules for Federal Agencies Using AI

The US government has announced that federal agencies using AI tools will be required to follow new safeguards by December 1. The announcement comes from the Office of Management and...

4 Canadian School Boards Sue Three Social Media Giants
News

Four Canadian School Boards Have Sued Social Media Giants for Sabotaging Young Minds

Four of the largest school boards in Canada have filed a lawsuit against social media giants for being addictive, disrupting student learning, and harming their mental health. The lawsuit seeks...

Slothana goes parabolic
Crypto News

Traders Transfer $2.2 Million in Solana to Emerging Meme Cryptocurrency Slothana

Reddit Shares Fall 16% In A Day After Promoters Sell
News

Reddit Shares Fall 16% in a Day after Promoters Sell One Million Shares

Gold Miner Nilam Resources Shares Surge 22x Amidst Bitcoin Buying Announcement
Crypto News

Gold Miner Nilam Resources Shares Surge 22x Amidst Bitcoin Buying Announcement

BlackRock CEO Goes Bullish on BTC as Spot Bitcoin ETF Crosses $17 Billion
Crypto News

BlackRock CEO Goes Bullish on BTC as Spot Bitcoin ETF Crosses $17 Billion