Home Intel’s ‘V8’ media creation platform
Reviews

Intel’s ‘V8’ media creation platform

Scott Wasson
Disclosure
Disclosure
In our content, we occasionally include affiliate links. Should you click on these links, we may earn a commission, though this incurs no additional cost to you. Your use of this website signifies your acceptance of our terms and conditions as well as our privacy policy.

I HAVE DAYDREAMED ABOUT ridiculously fast computers since childhood. Many of those daydreams seem positively puny by today’s standards, but at the time, they were outrageous, if not entirely far-fetched. What if they could double the RAM on the Atari 800 all the way to 128K, or even 256K? What if the new Atari STE could show 16 colors at 640×400 resolution? Could they give the new Amiga 16-bit audio with eight channels? That was crazy talk, much of it—until I switched to the PC and started answering some of those questions with concrete results. What if I could overclock this Celeron 300A to 464MHz? And run two of them in SMP? What if they made a chip with 3D graphics abilities on it, just like an SGI workstation? Turns out having those questions answered is really quite nice.

Nowadays, we get incremental improvements in computing power so often, such questions seem almost naive or somehow inappropriate. Yet I’m sure there are some among us who have wondered about what’s coming down the road. Most of us have dual-core systems by now, and sitting there enjoying the speed of a Core 2 Duo E6600, one might engage in a little fanciful speculation: What would it be like to have, say, eight of these cores running at 3GHz on dual, independent 1333MHz buses with a torrent of memory bandwidth?

If you’re prone to such speculation, you’ll be pleased to hear Intel has concocted an answer to that very question in the form of its “V8” media creation platform. V8 is Intel’s tentative first response to AMD’s dual-socket enthusiast platform, Quad FX. Like Quad FX, V8 draws on workstation/server-class technology to take desktop PCs to new heights. Unlike AMD’s effort, though, V8 doesn’t involve an enthusiast-class mobo or any sort of processor bundle or discount. If you want to grab a slice of the future now, with eight cores of glory at your disposal, you’re going to have to pay a pretty penny for it. Happily, though, we’ve tested a V8 system against a slew of today’s best desktop processors, and we can give you a glimpse of how the future may look, free of charge. Here’s a hint: it’s ridiculously fast.

I could have had a . . . .
We should establish some things about this V8 setup right up front. First of all, this is not a new branded “platform” like Intel’s Centrino or vPro efforts, and in fact, it’s not really a new product at all. V8 is mostly just Intel flexing its muscles and showing what it can do with a dual-socket system in response to AMD’s Quad FX. (It’s probably not coincidence that AMD’s dual-socket project code-named “4×4” was met with a project code-named V8.) This demonstration of power involves mostly off-the-shelf parts: Xeon processors and their accompanying motherboards and memory. Intel has simply packaged up these products together and shipped them out to media outlets like us with the suggestion that we test them as the “ultimate media creation workstation.”

I say “mostly” off-the-shelf parts because the Xeon X5365 processors Intel threw into this rig are not exactly common. You can buy them today, but only with a Mac Pro workstation wrapped around them. No other PC maker offers them yet, and they’re not yet available for purchase by themselves. What’s more, Intel rates these Xeons at a TDP of 150W—well above the 120W of the next speed grade down, the Xeon X5355 at 2.66GHz. Intel says it plans to make the Xeon X5365 more widely available via other PC makers and as a boxed processor later this year, but when it does so, those chips will be a new stepping that should fit into current Xeon thermal envelopes. So the version of the Xeon we’re testing is a modern-day Jezebel—hot, rare, and power-hungry.

With that said, let’s consider what sort of computing power we’re talking about here. For the uninitiated, the Xeon X5365 is very much similar to the Core 2 Extreme QX6800 processor; its two dual-core chips on a single package add up to four processor cores. Yet the Xeon X5365 has a higher clock speed, a superior system topology, and can run in pairs. Regular readers may recall that we’ve already reviewed the current Xeon platform, and we also pitted quad-core Xeons against Opterons late last year. This V8 system is fundamentally the same technology, but with even faster CPUs. As a result, V8 delivers considerable torque to the rear wheels thanks to 16MB of total L2 cache, dual 1333MHz front-side buses, a staggering 21GB/s of memory bandwidth, and eight Core-microarchitecture CPU cores running at 3GHz each.

This gives you a Task Manager readout that looks like so:

Pant, pant.

And that 20% load is while doing a dual-threaded MP3 encode.

The motherboard that makes such feats of strength possible is Intel’s workstation-class S5000VXN, a snappy name if ever there was one.


Intel’s S5000VXN motherboard

This mobo gains many good things from its workstation-class background, including two CPU sockets, eight DIMM slots with a maximum capacity of 32GB of memory, six SATA ports with support for RAID levels 0/1/10, dual GigE ports, and High Definition Audio.

That background betrays the S5000VXN when it comes to enthusiast bona-fides, though. The board has only one PCIe x16 slot, so multi-GPU support is probably out. The additional expansion slots are odd birds on the desktop: a couple of PCIe x4 slots and a pair of PCI-X slots—you know, for your home Fibre Channel adapter. The HD Audio has only two channels, and the board itself measures 13″ by 12″, much too large to fit into nearly any common desktop PC enclosure. What’s more, the BIOS is bereft of the usual tweaking and overclocking options, although you’re in luck if you want to change the date, enable console redirection to the serial port, or use the EFI shell. Also, we found that we had to abandon our trusty OCZ GameXStream 700W PSU for this system, because the S5000VXN requires both an eight-pin auxiliary power connector and a four-pin one at the same time. We had to swap in the 850W CoolerMaster power supply that Intel shipped with the board in order to get the system running.

Say what you will about the Asus L1N64-SLI WS motherboard that anchors AMD’s Quad FX platform. We certainly have. It’s expensive, draws way too much power, and costs more than it should. But the L1N64-SLI WS is at heart a pretty solid enthusiast-class mobo, which puts it head and shoulders above the S5000VXN in terms of practicality, tweakabilty, and affordability.

That’s without considering what may be the V8 platform’s greatest weakness: its use of fully buffered DIMMs (or FB-DIMMs) for memory. FB-DIMM is a server-class technology that adds some memory access latency and power draw in return for better signal integrity and potentially more bandwidth. Even in the server world, it’s a controversial tradeoff, but on the desktop, FB-DIMMs just don’t make sense right now. The additional 5W per module that FB-DIMMs add over DDR2 isn’t especially welcome, but the memory latency is an even bigger drawback. I hope I’m not giving away the game too much, but have a look at these results from our synthetic memory access latency test for a sense of the problem.

This memory access handicap won’t hurt the V8 system in every application, but many desktop apps are sensitive to access latencies, including games.

Say you weren’t put off by any of these drawbacks and wanted to put together a V8 system like the one we’ve tested. What would it cost? Well, it’s tough to know since these Xeon X5365 processors are practically priceless, but we can price out the next rung down the ladder, the X5355. Those cost $1189 each at Newegg and more elsewhere. The S5000VXN motherboard will set you back about 500 bucks, and the four 1GB FB-DIMMs will run you roughly $135 a pop. That tallies up to about $3400 for the CPUs, motherboard, and memory, if you settle for the “slower” CPUs.

That’s not bad if you’re, say, one of the founders of Google.

Then again, other solutions are just plain slower. Quad FX certainly is, which may explain why a similar config with FX-74 CPUs, motherboard, and DDR2 memory totals just under $1600. If you can settle for just four Intel cores, you can put together a system based on the Core 2 Extreme QX6700, Asus P5B Deluxe, and DDR2 memory for about the same as the Quad FX system. (The Core 2 Extreme QX6800 is about as rare as the Xeon X5365, so don’t count on buying one of those.) Neither of these options will tear through our widely multithreaded benchmarks the way the Xeon V8 system does, as we’re about to see.

 

Test notes
We’ve compared the V8 system to a broad range of desktop CPUs, drawing on the results we’ve collected for previous reviews. The only trouble with that is that the V8 system has 4GB of RAM in it and needs to use four DIMMs to get all of its potential bandwidth. The bulk of our test results came from systems with 2GB of RAM. We don’t expect the additional RAM in the Xeon X5365-based V8 rig to boost performance much, because our benchmark apps tend to fit well enough into 2GB of memory. However, in order to assess the impact of going to 4GB of RAM, we’ve tested the V8 system’s closest rival, the Athlon 64 FX-74, with 4GB of RAM, as well. You’ll see results for this system with 2GB like everything else and also with 4GB when labeled “Athlon 64 FX-74 4GB.”

Also, please note that the V8 system is simply labeled “Xeon X5365” for the CPUs in it.

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

In some cases, getting the results meant simulating a slower chip with a faster one. For instance, our Core 2 Duo E6600 and E6700 processors are actually a Core 2 Extreme X6800 processor clocked down to the appropriate speeds. Their performance should be identical to that of the real thing. Similarly, our Athlon 64 FX-72 results come from an underclocked pair of Athlon 64 FX-74s, our Athlon 64 X2 4400+ is an underclocked X2 5000+ (both 65nm), and our Athlon 64 X2 5600+ is an underclocked Athlon 64 X2 6000+.

Our test systems were configured like so:

Processor Core 2 Duo E6300 1.83GHz
Core 2 Duo E6400 2.13GHz
Core 2 Duo E6600 2.4GHz
Core 2 Duo E6700 2.66GHz
Core 2 Extreme X6800 2.93GHz
Core 2 Quad Q6600 2.4GHz
Core 2 Extreme QX6700 2.66GHz
Core 2 Extreme QX6800 2.93GHz
Athlon 64 X2 3600+ 1.9GHz (65nm)
Athlon 64 X2 4400+ 2.3GHz (65nm)
Athlon 64 X2 5000+ 2.6GHz (65nm)
Athlon 64 X2 5000+ 2.6GHz (90nm)
Athlon 64 X2 5600+ 2.8GHz (90nm)
Athlon 64 X2 6000+ 3.0GHz (90nm)
Athlon 64 FX-70 2.6GHz
Athlon 64 FX-72 2.8GHz
Athlon 64 FX-74 3.0GHz
Athlon 64 FX-74 3.0GHz Xeon X5365 3.0GHz
System bus 1066MHz (266MHz quad-pumped) 1GHz HyperTransport 1GHz HyperTransport 1GHz HyperTransport 1333MHz (333MHz quad-pumped)
Motherboard Intel D975XBX2 Asus M2N32-SLI Deluxe Asus L1N64-SLI WS Asus L1N64-SLI WS Intel S5000XVN
BIOS revision BX97520J.86A.2618.
2007.0212.0954
0903 0205 0205 S5000.86B.06.00.0076.
0409200070751
North bridge 975X MCH nForce 590 SLI SPP nForce 680a SLI nForce 680a SLI 5000X MCH
South bridge ICH7R nForce 590 SLI MCP nForce 680a SLI nForce 680a SLI 6321 ESB ICH
Chipset drivers INF Update 8.1.1.1010
Intel Matrix Storage Manager 6.21
ForceWare 15.00 ForceWare 15.00 ForceWare 15.00 INF Update 8.1.1.1010
Intel Matrix Storage Manager 6.21
Memory size 2GB (2 DIMMs) 2GB (2 DIMMs) 2GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs)
Memory type Corsair TWIN2X2048-6400C4
DDR2 SDRAM
at 800MHz
Corsair TWIN2X2048-8500C5
DDR2 SDRAM
at 800MHz
Crucial Ballistix PC6400
DDR2 SDRAM
at 800MHz
Corsair TWIN2X2048-8500C5D
DDR2 SDRAM
at 800MHz
Samsung ECC DDR2-667 FB-DIMM at 667MHz
CAS latency (CL) 4 4 4 4 5
RAS to CAS delay (tRCD) 4 4 4 4 5
RAS precharge (tRP) 4 4 4 4 5
Cycle time (tRAS) 12 12 12 12 15
Audio Integrated ICH7R/STAC9274D5 with
Sigmatel 6.10.0.5274 drivers
Integrated nForce 590 MCP/AD1988B with
Soundmax 6.10.2.6100 drivers
Integrated nForce 680a SLI/AD1988B with
Soundmax 6.10.2.6100 drivers
Integrated nForce 680a SLI/AD1988B with
Soundmax 6.10.2.6100 drivers
Integrated 6321 ESB/ALC260 with
Realtek 6.0.1.5397 drivers
Hard drive Maxtor DiamondMax 10 250GB SATA 150
Graphics GeForce 7900 GTX 512MB PCIe with ForceWare 100.64 drivers
OS Windows Vista Ultimate x64 Edition
OS updates

Our Core 2 Duo E6400 processor came to us courtesy of the fine folks up north at NCIX. Those of you who are up in Canada will definitely want to check them out as a potential source of PC hardware and related goodies.

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Also, except where otherwise noted, our test systems were powered by OCZ GameXStream 700W power supply units. Thanks to OCZ for providing these units for our use in testing.

The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

 

The Elder Scrolls IV: Oblivion
We tested Oblivion by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. Each gameplay sequence lasted 60 seconds. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

For this test, we set Oblivion‘s graphical quality to “Medium” but with HDR lighting enabled and vsync disabled, at 800×600 resolution. We’ve chosen this relatively low display resolution in order to prevent the graphics card from becoming a bottleneck, so differences between the CPUs can shine through.

Notice the little green plot with four lines above the benchmark results. That’s a snapshot of the CPU utilization indicator in Windows Task Manager, which helps illustrate how much the application takes advantage of up to four CPU cores, when they’re available. I’ve included these Task Manager graphics whenever possible throughout our results, as is our usual practice. These four-way Task Manager shows won’t quite show us when all eight of the dual Xeon X5365 system’s cores are occupied, but believe me, you’ll generally know when that happens. In this case, Oblivion really only takes full advantage of a single CPU core, although Nvidia’s graphics drivers use multithreading to offload some vertex processing chores.

The Xeon X5365 system’s additional CPU cores are no help to it here. It ought to perform reasonably well even in single-threaded applications because its single cores are as fast as anything else around, but perhaps its memory access overhead slows it down here. Performance is still excellent, but it’s no faster than many dual- and quad-core Intel processors.

Rainbow Six: Vegas
Rainbow Six: Vegas is based on Unreal Engine 3 and is a port from the Xbox 360. For both of these reasons, it’s one of the first PC games that’s multithreaded, and it ought to provide an illuminating look at CPU gaming performance.

For this test, we set the game to run at 800×600 resolution with high dynamic range lighting disabled. “Hardware skinning” (via the GPU) was disabled, leaving that burden to fall on the CPU. Shadow quality was set to very low, and motion blur was enabled at medium quality. I played through a 90-second sequence of the game’s Terrorist Hunt mode on the “Dante’s” level five times, capturing frame rates with FRAPS, as we did with Oblivion.

The Xeon X5365 falls victim to a performance pitfall in this game, as well. Don’t get me wrong; this thing is going to run current games very well overall, but it won’t necessarily be the fastest in the benchmarks.

 

Supreme Commander
This game is multithreaded and can actually take advantage of more than two processor cores, making it a rare commodity indeed. We ran into some snags when we first tried to test this game with FRAPS. Getting consistent results proved difficult, and the sound didn’t want to work on our Intel D975XBX2 motherboard, whose Vista x64 audio drivers may not yet be up to snuff. I was also developing the first signs of extreme RTS addiction—a grave condition indeed. I found myself analyzing unit types and lusting after level-two engineer bots. Fortunately, we were able to overcome these problems by using Supreme Commander‘s very nice built-in benchmark, which plays back a test game and reports detailed performance results afterward. We launched the benchmark by running the game with the “/map perftest /nosound” options. (Normally, we prefer to test games with audio enabled, but we made an exception here.) We tested at 1024×768 resolution with the game’s default quality settings.

Supreme Commander’s built-in benchmark breaks down its results into several major categories: running the game’s simulation, rendering the game’s graphics, and a composite score that’s simply comprised of the other two. The performance test also reports good ol’ frame rates, so we’ve included those, as well.

As you can tell, Supreme Commander’s frame rates don’t vary much from one CPU to the next. (I’ve tried testing it with more powerful graphics cards, as well, and even that doesn’t seem to tease out big differences between faster and slower processors.) That said, the Xeon X5365 only finishes mid-pack in the simulation portion of Supreme Commander’s performance test. Current games just don’t take advantage of four cores well, let alone eight.

 

Valve Source engine particle simulation
Next up are a couple of tests we picked up during a visit to Valve Software, the developers of the Half-Life games. They’ve been working to incorporate support for multi-core processors into their Source game engine, and they’ve cooked up a couple of benchmarks to demonstrate the benefits of multithreading.

The first of those tests runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

Valve’s particle simulation is actually able to use the Xeon X5365 system’s additional cores somewhat. The result is that the V8 system finishes ahead of the Core 2 Extreme QX6800, though not by a large margin.

Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into its games. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. It does, however, show how multiple CPU cores can speed up game development.

Performance scales up to eight cores even better in Valve’s VRAD test than in its particle simulation, and we begin to see a bit of the dual Xeon X5365s’ potential. The map build finishes in only 79 seconds, while the Athlon 64 FX-74 takes 173 seconds to accomplish the same work. The Core 2 Duo E6600, our shining example of a mid-range dual-core CPU, builds the map in 290 seconds.

 

3DMark06
3DMark06 combines the results from its graphics and CPU tests in order to reach an overall score. Here’s how the processors did overall and in each of those tests.

3DMark’s graphics tests are utterly constrained by graphics card performance, as one would expect. Its CPU tests, however, scale up nicely to eight cores, handing the Xeon X5365 big wins in those sub-tests and the top overall 3DMark score by over 100 points.

 

The Panorama Factory
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Xeon X5365 system finishes stitching together our panoramic photo over 10 seconds ahead of the Athlon 64 FX-74 and about five seconds sooner than the Core 2 Extreme QX6800, the quickest quad-core processor.

picCOLOR
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

With only four threads in play, picCOLOR doesn’t use the Xeon X5365 to its full potential. Even so, the Xeon system should be one of the fastest, and it is. It’s probably held back a little by memory access latency, though, as its low scores in the first couple of sub-tests would seem to indicate.

 

Windows Media Encoder x64 Edition
Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

Video encoding is often cited as an example of an application where multiple cores can easily provide performance dividends, yet even Microsoft’s own 64-bit version of Windows Media Encoder maxes out at four threads. The Xeon X5365 is as fast as the top quad-core system, but no faster.

LAME MP3 encoding
LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in many of our previous CPU reviews.

Sure, the V8 system’s using maybe one fourth of its total capacity in this dual-threaded app, but at least it stays near the top of the pack.

 

Cinebench
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

Yow. Who can deny the rendering power of the V8 rig? Even its single-threaded score is the highest of the bunch, and the effect is multiplied when eight cores are fully engaged. Don’t hold the fact that performance doesn’t scale almost linearly with the number of cores against the Xeons. The makers of Cinebench need to use a higher resolution to render their sample scene, so setup doesn’t occupy such a large proportion of the benchmark time. If they did so, I’m confident the quad-core and eight-way systems would scale even better.

POV-Ray rendering
We’ve finally caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

POV-Ray gives us a better look at the traditionally excellent performance scaling we expect from 3D rendering apps, and that means the Xeon X5365 system smokes everything else. Our mid-range paragon, the Core 2 Duo E6600, takes five minutes to render a scene the Xeons finish in just one.

The Xeon X5365 easily takes the lead in POV-Ray’s official benchmark scene, as well. This scene uses some features of POV-Ray that are not multithreaded, so it doesn’t scale as dramatically well as our chess scene.

 

MyriMatch
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads. Also, this is a newer version of the MyriMatch code than we’ve used in the past, with a larger spectral collection, so these results aren’t comparable to those in some of our past articles.

The Xeon X5365 really suffers at lower thread counts, where it can’t keep pace with the Core 2 quad-cores, but it keeps improving as we add more threads and winds up being fastest overall.

STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. I understand the STARS Euler3D routines are both very floating-point intensive and oftentimes limited by memory bandwidth. Charles has updated the benchmark for us to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

Here’s another case where the Xeon X5365 trails the quad-core Core 2 systems at lower thread counts but breaks through at eight threads. The Xeons’ peak throughput of 2.5Hz more than doubles that of the Athlon 64 FX-74.

 

Folding@Home
Next, we have the slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

You may have noticed that the Gromacs 3.3 scores for the Athlon 64 FX-74 with 4GB and the Xeon X5365 are lower than expected, given the performance of similar configurations. I believe that’s because Stanford has updated its Gromacs 3.3 core in a way that lowers performance with our test work unit. This update is fairly recent, and should only affect these two CPU configurations, handicapping them a little bit.

Even with the handicap, the Xeon X5365 posts an eye-popping 1564 points per day as an average of the four work unit types, miles ahead of the quad-core setups.

 

SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

This one’s just an exhibition for the Xeon X5365, with nothing like any close competition—especially from AMD. I expect that to change when AMD’s Barcelona chip arrive with four cores and single-cycle execution of 128-bit SSE instructions.

 

Power consumption and efficiency
We’re trying something a little different with power consumption. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, video card, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor and speakers into a separate outlet, though.) We measured how each of our test systems used power during a roughly one-minute period, during which time we executed Cinebench’s multithreaded rendering test. All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests.

You’ll notice that I’ve not included the Athlon 64 FX-72 here. That’s because our “simulated” FX-72 CPUs are underclocked versions of faster processors, and we’ve not been able to get Cool’n’Quiet power-saving tech to work when CPU multiplier control is in use. I have included test results for genuine Athlon 64 X2 4400+ and 5600+ chips, as promised in our last CPU roundup.

I have included our simulated Core 2 Duo E6600 and E6700, because SpeedStep works fine on the D975XBX2 motherboard alongside underclocking. The simulated processors’ voltage may not be exactly the same as what you’d find on many retail E6600s and E6700s. However, voltage and power use can vary from one chip to the next, since Intel sets voltage individually on each chip at the factory.

One slight issue here is that we’ve tested most of these systems on OCZ GameXStream 700W power supplies, but we had to test the V8 rig with a different PSU, and 850W CoolerMaster. To keep the head-to-head comparison going, I also tested the Athlon 64 FX-74 4GB config with the CoolerMaster power supply.

Also, I’m not sure what the story is here, but the Xeon X5365 system didn’t appear to be making use of SpeedStep or the C1E halt state, even when I set the power options in Windows Vista to “Balanced” and enabled SpeedStep in the BIOS. I’m unsure whether this is a true problem or if the CPU clock speed reflected in the CPU-Z utility was somehow incorrect. Going without SpeedStep (if indeed that’s what’s happening) will raise the idle power consumption of the Xeon system, but it shouldn’t impact peak power draw.

The differences between the CPUs are immediately obvious by looking at these plots of the raw data. We can slice up the data in various ways in order to better understand them, though. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Idle power draw on the Xeon X5365 system is very high, but this is consistent with what we’ve seen in the past out of similar Xeon systems on the Bensley platform. Even the Quad FX’s notoriously high idle power draw isn’t in league with this beast.

Next, we can look at peak power draw by taking an average from the five-second span from 10 to 15 seconds into our test period, during which the processors were rendering. In the case of the Xeon X5365, I had to use the span from five to 10 seconds, since the system finished rendering well before the 15-second mark.

The Xeon X5365 system may draw the most power at idle, but the FX-74 processors still pull more at peak. This is a quirk of the Quad FX platform—the mobo has high power draw and so do the CPUs. Opterons, we’ve found, are much more efficient. Heck, even the FX-70 is way better than the FX-74.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

The Xeon X5365’s combination of relatively high power draw under load and high idle power use adds up to the most energy use during the span of our test period. However, things look different if you keep all eight cores busy most of the time.

We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve chosen to identify the end of the render as the point where power use begins to drop from its steady peak. There seems to be some disk paging going on after that, but we don’t want to include that more variable activity in our render period.

We’ve computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

By finishing the job very quickly, the Xeon X5365 uses relatively little energy to render the scene. That’s the V8 system’s power efficiency ace in the hole. You’ve got to keep all of those cores busy in order for it to be power efficient, but it certainly can be.

 
Conclusions
So what happens if you put eight cores on the desktop right now? A few things. First and most impressively, they slice through the right applications—that rare breed capable of using eight threads at once—with ferocity. That’s what we saw in our 3D rendering, image processing, and scientific computing tests, where the Xeon X5365-based V8 system set a new standard for performance. With some of those applications, it nearly doubled the performance of the best quad-core systems. Such a show of power is a wonderful and fun thing to see.

Second, in many cases, six or seven cores sit idle and offer no real performance benefit. That’s what happened in our 3D gaming and MP3 encoding tests. In a great many applications, software developers have a long way to go to take full advantage of four cores, let alone eight.

Finally, in this particular case, the quirks of the eight-core system become immediately obvious. The current Xeon platform’s relatively high memory access latencies and power consumption—both of which we can pin at least partially on the use of FB-DIMM memory—make it less than ideal for desktop use. Higher power draw means more noise and heat expended into the room with the user, and higher memory access latencies mean the V8 system isn’t the fastest gaming rig on the block.

These problems alone could perhaps be overlooked, but Intel has more ground to cover in order to make its dual-socket solution palatable to the average—or very much above-average—PC enthusiast. 90% of the solution here will involve a better motherboard with the right expansion slot config, feature set, footprint, and price tag. Despite the vast performance lead Intel has over AMD, and despite the Quad FX platform’s obvious shortcomings, AMD has established a superior blueprint for a dual-socket enthusiast offering.

The good news is Intel and AMD both seem to be zeroing in on the right way to do dually enthusiast systems. V8 is very much a first step for Intel. The firm has already signaled its intention to follow up later this year with a true enthusiast dual-socket platform with the promising codename “Skulltrail.” We don’t yet know much about it, but it will reportedly involve a multi-GPU-capable motherboard. Let’s hope it also involves some affordable mobo and processor options.

We have been fairly hard on this first, new generation of dual-socket systems, both Quad FX and V8, and rightly so. They’re far from perfect, and we’d have a hard time recommending either for purchase. Right now, your best choice for an over-the-top, bad-ass system is a Core 2 Quad or Core 2 Extreme QX6700-based system. As PC enthusiasts, though, we very much want to see dual-socket systems flourish as a new high-end option, much like SLI and CrossFire have done in graphics. Such expensive toys aren’t for everyone, but they expand the boundaries of the PC’s capabilities. Outrageously fast and capable PCs serve as enablers and incubators for all sorts of other good things, from widely multithreaded games to huge, wide-screen displays. These are the sorts of developments that could make us start daydreaming again about what comes next, and what could be better than that? 

Latest News

Joint International Police Operation Disrupts LabHost
News

Joint International Police Operation Disrupts LabHost – A Platform That Supported 2,000+ Cybercriminals

Apple Removes WhatsApp and Threads From App Store In China
News

Apple Removes WhatsApp and Threads from Its App Store in China

On Friday Apple announced that it’s removing WhatsApp and Threads from its App Store in China over security concerns from the government. Adding further, Apple said it’s only doing its...

XRP Falls to $0.3 Amid Massive Weekend Sell-off - Can $1 Be Achieved Post-Halving?
Crypto News

XRP Falls to $0.3 Amid Massive Weekend Sell-off – Can $1 Be Achieved Post-Halving?

The crypto market is sinking lower, moving away from its impressive Q1 peak of $2.86 trillion. Major altcoins like Ethereum have not been spared either, with investors facing losses from the...

Cardano Could Rally to $27 After Bitcoin Halving if Historical Performance
Crypto News

Cardano Could Rally to $27 After Bitcoin Halving Following a Historical Performance

Japanese Banking Firm Launches Passive Income Program for Shiba Inu
Crypto News

Japanese Banking Firm Launches Passive Income Program for Shiba Inu

Ripple CLO Clarifies Future Steps With the SEC While Quenching Settlement Rumors
Crypto News

Ripple CLO Clarifies Future Steps With the SEC While Quenching Settlement Rumors

Cisco Launches AI-Driven Security Solution 'Hypershield'
News

Cisco Launches AI-Driven Security Solution ‘Hypershield’