AMD's Radeon HD 6950 and 6970 graphics processors

2.6 billion. Six. The first figure is the number of transistors in AMD’s new Cayman graphics processor. The second is the number of days we’ve had to spend with it prior to its release. Today’s GPUs are incredibly complex beasts, and the companies that produce them don’t waste any time in shoving ’em out the door once they’re ready. Consequently, our task of getting a handle on these things and relaying our sense of it to you… isn’t easy. We’re gonna have to cut some corners, leave out a few vowels and consonants, and pare back some of the lame jokes in order to get you a review before these graphics cards go on sale.

“What’s all the fuss?” you might be asking. “Isn’t this just another rehashed version of AMD’s existing GPU architecture, like the Radeon HD 6800 series?” Oh, but the answer to your question, so cynically posed, is: “Nope.”

As you may recall, TSMC, the chip fabrication firm that produces GPUs for both of the major players, upset the apple cart last year by unexpectedly canceling its 32-nanometer fabrication process. Both AMD and Nvidia had to scramble to rebuild their plans for next-generation chips, which were intended for 32-nm. At that time, AMD had a choice: to push ahead with an ambitious new graphics architecture, re-targeting the chips for 40 nanometers, or to play it safe and settle for smaller, incremental changes while waiting for TSMC to work out its production issues.

Turns out AMD chose both options. The safer, more incremental improvements were incorporated into the GPU code-named Barts, which became the Radeon HD 6850 and 6870. That chip retained the same core architectural DNA as its predecessor, but it added tailored efficiency improvements and some new display and multimedia features. Barts was also downsized to hit a nice balance of price and performance. At the same time, work quietly continued—at what had to be a breakneck pace—on another, larger chip code-named Cayman.

Many of us in the outside world had heard the name, but AMD did a surprisingly good job (as these things go) of keeping a secret, at least for a while—Cayman ain’t your daddy’s Radeon. Or even your slightly older twin brother’s, perhaps. Unlike Barts, Cayman is based on a fundamentally new GPU architecture, with improvements extending from its graphics front end through its shader core and into its render back-ends. The highlights include higher geometry throughput, more efficient shader execution, and smarter edge antialiasing. In other words, more goodness abounds throughout.

So when we say our task of cramming a review of Cayman into a few short days isn’t easy, that’s because this chip is the most distinctive member of the recent, bumper crop of new GPUs.

Cayman.. Ca-aa-ay-man

A logical block diagram of the Cayman GPU architecture. Source: AMD.

Our hardware reviewer’s license stipulates that we must include a block diagram in page one of any review of a new GPU, and so you have it above. This view from high altitude gives us a sense of the architecture’s overall layout, although it has no doubt been retouched by AMD marketing to add whiter teeth and to remove any interesting wrinkles.

Cayman’s basic layout will be familiar to anyone who knows recent Radeon GPUs like Barts and Cypress. The chip has a total of 24 SIMD engines in a dual-core configuration. (Both Cypress and Barts are dual-core, too, with dual dispatch processors as in the diagram above, although AMD didn’t reveal this level of detail when it first rolled out Cypress.) Each SIMD engine has a texture unit associated with it, along with an L1 texture cache. Cayman sticks with the tried-and-true formula of four 64-bit memory interfaces, each with an L2 cache and dual ROP units attached. In short, although it’s a little larger than Cypress, Cayman remains the same basic class of GPU, with no real changes to key differentiators like memory interface width or ROP count.

	ROP pixels/ clock	Texels filtered/ clock (int/fp16)	Shader ALUs	Rasterized triangles/ clock	Memory interface width (bits)	Estimated transistor count (Millions)	Approximate die size (mm²)	Fabrication process node
GF104	32	64/64	384	2	256	1950	331*	40 nm
GF110	48	64/64	512	4	384	3000	529*	40 nm
RV770	16	40/20	800	1	256	956	256	55 nm
Cypress	32	80/40	1600	1	256	2150	334	40 nm
Barts	32	56/28	1120	1	256	1700	255	40 nm
Cayman	32	96/48	1536	2	256	2640	389	40 nm

*Best published estimate; Nvidia doesn’t divulge die sizes

Above is a look at the Cayman chip itself, along with some key comparative specs. Cayman is a bit of a departure from recent AMD GPUs because it’s decidedly larger, but it’s not a reticle buster like some of Nvidia’s bigger creations. In terms of transistor count and die area, Cayman appears to land somewhere between Nvidia’s two closest would-be competitors, the GF104 and GF110.

A new, narrower SPU
The big adjustment in Cayman comes at such a minute level, it isn’t even visible in the big block diagram. Inside of each of the chip’s SIMD shader processing engines is an array of 16 execution units or stream processing units (SPUs). In every AMD GPU architecture dating back to the R600, the fundamental SPU layout has been essentially the same, with four arithmetic logic units (ALUs) of equal capability and a fifth “fat” ALU capable of handling special functions like transcendentals. These execution units play a key part in the larger GPU symphony. Instructions for the ALUs are grouped together into a single, very long instruction word, and then all 16 of the SPUs in a SIMD engine execute the same instructions on different data simultaneously.

Scheduling instructions in VLIW5 groups like that can be a challenge, since the real-time compiler in AMD’s graphics drivers must ensure that one operation’s output isn’t needed as input for another operation. If such dependencies are present, the compiler may not be able to schedule instructions on all five ALUs at once, and some ALUs may be left idle. The fact that only the one, “fat” ALU can handle transcendentals further complicates matters.

Thus, Cayman introduces a new, slimmer SPU block with four ALUs. Each of those four ALUs has absorbed the capabilities of the old “fat” ALU, so they can all handle special functions. Both the symmetrical nature of the ALUs and the narrower VLIW4 instruction word should simplify compiler scheduling and allow fuller utilization of the ALUs. It should also ease register management and make performance more predictable, especially for non-graphics applications. AMD claims a 10% improvement in performance per square millimeter over the prior VLIW5 design. However, AMD Graphics CTO Eric Demers, who was chief architect on Cayman back when the project started and was also deeply involved in R600, said almost wistfully that AMD would have retained the five-wide ALU if graphics workloads were the only consideration. Obviously, GPU computing performance was a big impetus behind the change.

In fact, some of the enhancements in Cayman apply almost exclusively to GPU computing applications and may affect AMD’s FireStream lineup more directly than its consumer Radeon graphics cards. Among them: the ratios for double-precision floating-point math have improved somewhat, since DP math operations happen at one-quarter the single-precision rate, rather than one-fifth in prior designs. Cayman has taken another step toward the data center by incorporating ECC protection for external memories, much like Nvidia’s Fermi architecture. Unfortunately, unlike Fermi, internal memories and storage aren’t protected. Of course, ECC protection won’t be used in consumer graphics cards, regardless.

Cayman’s support for processing multiple compute kernels simultaneously is more robust, as well. According to Demers, Cypress could execute multiple kernels, but with only one pipe into the chip, their entry into the GPU had to be serialized. Cayman now has three entry points, with the possibility for more in future GPUs. Each kernel has its own command queue and virtual address domain, so they should be truly independent from one another.

The laundry list of compute-focused changes goes on from there, encompassing dual, bidirectional DMA engines for faster communication with the host system; the coalescing of shader read operations; and the ability to fetch data directly into the local data share attached to each SIMD. Many of these capabilities may sound familiar because Nvidia added them to its Fermi architecture. Clearly, AMD is on a similar architectural trajectory, toward making its GPU into a very competent general-purpose and data-parallel processor.

More tessellation from the, uh, tessellinator?
One of the flash points in DirectX 11 GPU architecture discussion has been the question of geometry throughput. Tessellation—the ability to take a low-polygon mesh and some additional information and transform it into a much more detailed, high-poly mesh on the GPU—is one of DX11’s highest-profile features. Add the fact that Nvidia has taken a much more sweeping approach to parallelizing geometry processing, and you have the makings of a good argument or three.

The underlying issue here is that polygon throughput rates in GPUs haven’t risen at nearly the rate other forms of graphics power have. There’s more to it, but the fact that setup and rasterization rates didn’t, for ages, eclipse one triangle per clock cycle is a good indicator of the problem. Without parallel geometry processing, the limits were fairly static. GPU makers are finally pushing past those limits, with Nvidia quite clearly in the lead. The GF100 and GF110 GPUs can rasterize up to four triangles per clock cycle, for example.

AMD created some confusion on this front when it introduced Cypress by claiming the chip had dual rasterizers. In reality, Cypress was dual core “from the rasterizers down,” as a knowledgeable source put it to me recently. What Cypress had was dual scan converters—a pixel-throughput optimization for large polygons—but it lacked the setup and primitive interpolation rates to surpass one triangle per clock cycle.

Caymans’ dual graphics/vertex engines. Source: AMD.

By contrast, Cayman has the ability to setup and rasterize two triangles per clock cycle. I’m not sure it quite tracks with what you’re seeing in the simplified diagram above, but Cayman has two copies of the logic block that does triangle setup, backface culling, and geometry subdivision for tessellation. Load-balancing logic distributes DirectX tiles between these two vertex engines, and the processed tiles are then fed into one of Cayman’s two 12-SIMD shader blocks. Interestingly, neither vertex engine is tied to a single shader block, nor vice-versa. Future variants of this architecture could have a single vertex engine and dual shader blocks—or the reverse.

Of course, two triangles per clock is the max theoretical rate, but delivered performance will be a little lower. I’m told AMD has measured Cayman’s throughput at between 1.6 and 1.8 triangles per clock.

That’s a big improvement over prior Radeons, but by comparison, Nvidia’s biggest chip, the GF110, has four raster engines; 16 “PolyMorph engines” for setup, transform, and geometry expansion; and a four-triangle-per-clock theoretical peak.

On the edge: better antialiasing
The render back-ends haven’t been overlooked in Cayman’s wide-ranging overhaul. Several new capabilities should raise performance and image quality.

Among those is native support in the ROP units for some additional color formats, including 16-bit integer (snorm/unorm) and 32-bit floating-point. AMD claims antialiasing with these color formats should be 2-4X faster than before, which is true largely because those formats are no longer handled in software—that is, in the shader core rather than in the ROPs.

The biggest news, though, is the introduction of a new antialiasing capability known as EQAA (which I believe stands for enhanced quality antialiasing.) The intriguing thing here is that EQAA is more or less a clone of the coverage sampled AA (CSAA) feature Nvidia first introduced in the G80, its first-gen DX10 GPU. At that time, AMD was touting its custom-filtered antialiasing modes as an alternative to CSAA. Now, CFAA has all but disappeared, with both the wide and narrow tent filters from prior generations having been excised from the 6800/6900-series drivers. Only the edge-detect filter remains, although it is an interesting option.

Sadly, we don’t have time to explain multisampled antialiasing (or quantum physics, for that matter) in this space, but for those who are familiar, EQAA simply stores fewer color samples than it does coverage samples, thereby increasing accuracy (and thus image quality) with a minimal increase in the memory footprint or performance cost. We’ve found Nvidia’s corresponding feature, CSAA, to deliver visibly superior edge AA quality without slowing frame rates much at all. Cayman’s ROPs can be programmed to store a different number of color and coverage samples, so many things are possible, but AMD has largely replicated Nvidia’s CSAA modes, with one notable addition. Also, AMD’s naming scheme for the different EQAA modes is a little more modest, since it’s based on the number of color samples rather than coverage samples. I’ve mapped the names and sample sizes to clear up any confusion. Included are the traditional multisampled AA modes for reference.

Radeon mode	Texture/ shader samples	Color samples	Coverage samples	GeForce mode
2X MSAA	1	2	2	2X MSAA
2X EQAA	1	2	4	–
4X MSAA	1	4	4	4X MSAA
4X EQAA	1	4	8	8X CSAA
8X MSAA	1	8	8	8xQ CSAA
–	1	4	16	16X CSAA
8X EQAA	1	8	16	16xQ CSAA
–	1	8	32	32X CSAA

AMD’s new mode, as you can see, is 2X EQAA, which captures two only color samples but four coverage samples. This mode could be a nice choice, especially in situations where performance is marginal—perhaps less likely to be an issue in Cayman than in a smaller derivative, but you get the picture.

Purported EQAA sample patterns. Source: AMD.

The EQAA sample patterns from the AMD presentation above are apparently only for illustrative purposes. We’ve captured the texture/shader (green dots), color (gray dots), and coverage (small red dots) sample patterns from Cayman and the GF110 with some simple tools, and they don’t really correspond with AMD’s presentation.

Radeon mode	Sample pattern	GeForce mode	Sample pattern
4X EQAA		8X CSAA
8X EQAA		16xQ CSAA

In reality, AMD’s sample patterns are quite a bit funkier. In 8X EQAA, one color and coverage sample is taken from the very top left corner of the pixel space. In the bottom right corner, you can see that same color sample point intruding from the pixel below.

	MSAA	EQAA
2X
4X
8X

EQAA’s effects are very evident in this simple test pattern. You have to like that 2X EQAA mode, which looks nearly as good as 4X multisampling.

I had hoped to include a lot more information on EQAA, including robust image quality comparisons with Nvidia’s CSAA and some performance data, but we’ll have to circle back and do that at a later date. We’re quite pleased to see AMD adding this feature, because it offers the possibility of direct performance comparisons between GeForces and Radeons in high-quality AA modes like 4X EQAA/8X CSAA. In fact, since we tend to prefer the image quality and performance of these AA methods, they may soon become our new de facto standard for testing, supplanting 4X multisampling.

Cayman does retain one other interesting antialiasing option, the morphological AA capability introduced with the Radeon HD 6800 series. MLAA is a post-process filter that lacks sub-pixel accuracy, so it’s a decidedly lower quality option than multisampling or EQAA—especially in motion, where its deficiencies are more evident than in static screen captures—but it has the great virtue of working properly with a wide range of games, including those that use deferred shading methods that don’t play well with MSAA and its derivatives. Again, this feature deserves more attention than we can give it presently, but we have it on our hit list for later.

PowerTune, somehow, isn’t for electric guitars
Speaking of features that deserve more attention than we can give them, Cayman introduces a novel power containment scheme known as PowerTune, whose stated goal is to keep the GPU from exceeding its maximum power rating (or TDP) in “outlier” applications that are much more power-intensive than the typical game. Nvidia added a similar feature in its GeForce GTX 580 and 570 graphics cards just recently, but AMD claims its approach is better on several fronts. For one, Cayman contains an integrated power control processor that monitors power draw constantly. This processor then algorithmically adjusts clock speeds for various logic blocks on the GPU in order to enforce the product’s stated TDP limit.

Any such mechanism that reduces clock speeds has the potential to impact performance. The picture becomes more complicated from there very quickly, though. PowerTune is, in a sense, kind of the inverse of the Turbo Boost capability built into the latest Intel CPUs. Turbo Boost will opportunistically raise clock speeds in order to grab more performance when available, whereas PowerTune limits clock frequencies when the chip draws too much power. AMD tell us PowerTune generally shouldn’t kick in during normal use—but it adds a caveat: especially with antialiasing in the mix. Of course, antialiasing isn’t always in use. PowerTune will reduce performance in some measurable ways—and not just in FurMark or the like. Even 3DMark Vantage’s Perlin Noise test, which has lots of shader arithmetic, will cause PowerTune to kick in.

AMD is very open about the implications of this feature, even going so far as to point out that default GPU clocks for its products will no longer have to be constrained by “outlier” applications. Taken another way, that’s a straightforward admission that GPU clock frequencies will be set higher and allowed to bump up against the TDP limits. That’s a departure from the usual approach, say in the CPU world, in which a buying a certain product generally guarantees the user a certain level of performance and the invocation of throttling generally means a cooling problem has occurred. Intel has struck a very different compromise by offering its users some extra, non-guaranteed performance in the form of Turbo Boost. The question, we suppose, is how far AMD will push on binning and power capping its products over time—and whether users will decide to push back.

AMD tells us its PowerTune algorithm for each video card model will be tuned for the worst-case scenario, to accommodate the leakiest, most power-hungry chips that fall into that particular product bin. As a result, performance should not vary substantially from one, say, Radeon HD 6950 to the next, even if ASIC quality does. AMD claims this steadiness from chip to chip is a contrast to Nvidia’s power-limiting scheme, which is based directly on power draw at the 12V rail. Since Nvidia claims its cards shouldn’t clamp power during normal use, though, we’re unsure whether (or how much) that distinction matters.

The presence of the PowerTune controller opens up some tweaking options, which AMD has decided to expose to the end user. A slider in the Catalyst Control Center will allow users to raise or lower their video cards’ TDP limits by 20%. The possibilities here are several. The user could raise the TDP limit alone to get less frequency clamping and higher performance in some cases. He could overclock his GPU but leave the TDP clamp in place, capturing additional performance where possible while ensuring his video card’s power consumption doesn’t exceed its limits. He might choose to raise both clock speeds and power limits to achieve maximum performance. Or he might decide to lower the TDP limit in, say, a home-theater PC to ensure modest noise levels and power draw.

I suppose one could also overclock the snot out of the thing and plunge the PowerTune slider to negative 20% just to create confusion about how the card will perform in any given situation. Whee!

With that said, we’re about ready to close the book on Cayman’s architectural enhancements and move on to the specifics of the new Radeon cards. Before we do so, though, we should point out that Cayman inherits all of the display and multimedia goodness already familiar from the Radeon HD 6800 series, including DisplayPort 1.2, a considerable array of display outputs compatible with the Eyefinity multi-monitor gaming scheme, and AMD’s UVD3 video processing block.

You’re totally getting carded

	GPU clock (MHz)	Shader ALUs	Textures filtered/ clock	ROP pixels/ clock	Memory transfer rate	Memory interface width (bits)	Idle/peak power draw	Suggested e-tail price
Radeon HD 6850	775	960	48	32	4.0 Gbps	256	19W/127W	$179.99
Radeon HD 6870	900	1120	56	32	4.2 Gbps	256	19W/151W	$239.99
Radeon HD 6950	800	1408	88	32	5.0 Gbps	256	20W/200W	$299.99
Radeon HD 6970	880	1536	96	32	5.5 Gbps	256	20W/250W	$369.99

The table above shows the key clock rates and specifications for the two new Cayman-based graphics cards, alongside their younger cousins in the Radeon HD 6800 series. We have a couple of bombshells in the memory department, one of which is the sheer-panic clock frequencies AMD has achieved for Cayman’s GDDR5 interface and external memories. Nvidia’s GeForce GTX 580 has a wider memory interface, but it tops out at just 4 Gbps. The other surprise on the memory front: both the 6950 and 6970 are packing 2GB of RAM by default. Even the GTX 580, a $500 video card, has only 1536MB. This higher RAM amount should allow the 6950 and 6970 to drive to some very high resolutions via Eyefinity and multiple displays without running out of space.

AMD says these two new cards should be available for sale today at online retailers. At $369.99, the 6970 is priced just above the GeForce GTX 570, whose suggested price (and current street price) is $349.99. Meanwhile, the 6950 has very little direct competition at the $300 mark, since Nvidia doesn’t currently have a similar offering. AMD tells us it expects its partners to introduce 1GB variants of the Cayman cards that will sell for less, too. We think a 1GB version of the 6950 could be a very attractive offering at around $279. Here’s hoping it happens.

The dark side of the 6970

From outside, the 6950 and 6970 are difficult to distinguish

The one obvious difference between the two: the 6970 has one eight-pin aux power input

From left to right: Radeon HD 6970, 6950, 6870, 6850

A naked Radeon HD 6950 card

The cooler includes a vapor chamber-based heatsink with a copper base—similar to GTX 500-series GeForces

That final picture above deserves some comment. First, notice that the Cayman cards have dual CrossFireX connectors, unlike the 6800 series. That means three- and four-way CrossFireX configurations should be possible. Second, check out that minuscule switch on the right. The 6900-series cards come with dual video BIOSes, so that the user can switch to a protected, backup BIOS should a bad flash scramble the main one. The switch allows the user to select which video BIOS to use. That’s a nifty little safety provision, and it should pay off for AMD’s partners in the form of lower RMA rates.

Our testing methods
Many of our performance tests are scripted and repeatable, but for some of the games, including Battlefield: Bad Company 2, we used the Fraps utility to record frame rates while playing a 60-second sequence from the game. Although capturing frame rates while playing isn’t precisely repeatable, we tried to make each run as similar as possible to all of the others. We raised our sample size, testing each Fraps sequence five times per video card, in order to counteract any variability. We’ve included second-by-second frame rate results from Fraps for those games, and in that case, you’re seeing the results from a single, representative pass through the test sequence.

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and we’ve reported the median result.

Our test systems were configured like so:

Processor	Core i7-965 Extreme 3.2GHz
Motherboard	Gigabyte EX58-UD5
North bridge	X58 IOH
South bridge	ICH10R
Memory size	12GB (6 DIMMs)
Memory type	Corsair Dominator CMD12GX3M6A1600C8 DDR3 SDRAM at 1600MHz
Memory timings	8-8-8-24 2T
Chipset drivers	INF update 9.1.1.1025 Rapid Storage Technology 9.6.0.1014
Audio	Integrated ICH10R/ALC889A with Realtek R2.51 drivers
Graphics
	Radeon HD 4870 1GB with Catalyst 10.10c drivers
	Asus Radeon HD 5870 1GB with Catalyst 10.10c drivers
	Asus Radeon HD 5870 1GB + Radeon HD 5870 1GB with Catalyst 10.10c drivers
	Asus ROG Matrix Radeon HD 5870 2GB with Catalyst 10.10c drivers
	Radeon HD 5970 2GB with Catalyst 10.10c drivers
	Asus Radeon HD 6850 1GB with Catalyst 10.10c drivers
	Dual Asus Radeon HD 6850 1GB with Catalyst 10.10c drivers
	XFX Radeon HD 6870 1GB with Catalyst 10.10c drivers
	Sapphire Radeon HD 6870 1GB + XFX Radeon HD 6870 1GB with Catalyst 10.10c drivers
	Radeon HD 6950 2GB with Catalyst 8.79.6-101206a drivers
	Dual Radeon HD 6950 2GB with Catalyst 8.79.6-101206a drivers
	Radeon HD 6970 2GB with Catalyst 8.79.6-101206a drivers
	Dual Radeon HD 6970 2GB with Catalyst 8.79.6-101206a drivers
	GeForce 8800 GTX 768MB with ForceWare 260.99 drivers
	XFX GeForce GTX 280 1GB with ForceWare 260.99 drivers
	Asus GeForce GTX 460 768MB with ForceWare 260.99 drivers
	Dual Asus GeForce GTX 460 768MB with ForceWare 260.99 drivers
	MSI Hawk Talon Attack GeForce GTX 460 1GB 810MHz with ForceWare 260.99 drivers
	MSI Hawk Talon Attack GeForce GTX 460 1GB 810MHz + EVGA GeForce GTX 460 FTW 1GB 850MHz with ForceWare 260.99 drivers
	Galaxy GeForce GTX 470 1280MB GC with ForceWare 260.99 drivers
	GeForce GTX 480 1536MB with ForceWare 260.99 drivers
	GeForce GTX 570 1280MB with ForceWare 263.09 drivers
	Zotac GeForce GTX 570 1280MB + GeForce GTX 570 1280MB with ForceWare 263.09 drivers
	GeForce GTX 580 1536MB with ForceWare 262.99 drivers
	Zotac GeForce GTX 580 1536MB + Asus GeForce GTX 580 1536MB with ForceWare 262.99 drivers
Hard drive	WD RE3 WD1002FBYS 1TB SATA
Power supply	PC Power & Cooling Silencer 750 Watt
OS	Windows 7 Ultimate x64 Edition DirectX runtime update June 2010

Thanks to Intel, Corsair, Western Digital, Gigabyte, and PC Power & Cooling for helping to outfit our test rigs with some of the finest hardware available. AMD, Nvidia, and the makers of the various products supplied the graphics cards for testing, as well.

Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.

We used the following test applications:

Some further notes on our methods:

We measured total system power consumption at the wall socket using a Yokogawa WT210 digital power meter. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench.

The idle measurements were taken at the Windows desktop with the Aero theme enabled. The cards were tested under load running Left 4 Dead 2 at a 1920×1080 resolution with 4X AA and 16X anisotropic filtering. We test power with Left 4 Dead 2 because we’ve found that the Source engine’s fairly simple shaders tend to cause GPUs to draw quite a bit of power, so we think it’s a solidly representative peak gaming workload.
We measured noise levels on our test system, sitting on an open test bench, using an Extech 407738 digital sound level meter. The meter was mounted on a tripod approximately 10″ from the test system at a height even with the top of the video card.

You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.
We used GPU-Z to log GPU temperatures during our load testing.

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Pixel fill and texturing performance

	Peak pixel fill rate (Gpixels/s)	Peak bilinear integer texel filtering rate (Gtexels/s)	Peak bilinear FP16 texel filtering rate (Gtexels/s)	Peak memory bandwidth (GB/s)
GeForce GTX 460 768MB	16.8	39.2	39.2	88.3
GeForce GTX 460 1GB 810MHz	25.9	47.6	47.6	124.8
GeForce GTX 470 GC	25.0	35.0	17.5	133.9
GeForce GTX 480	33.6	42.0	21.0	177.4
GeForce GTX 570	29.3	43.9	43.9	152.0
GeForce GTX 580	37.1	49.4	49.4	192.0
Radeon HD 6850	25.3	37.9	19.0	128.0
Radeon HD 6870	28.8	50.4	25.2	134.4
Radeon HD 5870	27.2	68.0	34.0	153.6
Radeon HD 6950	25.6	70.4	35.2	160.0
Radeon HD 6970	28.2	84.5	42.2	176.0
Radeon HD 5970	46.4	116.0	58.0	256.0

The theoretical peak numbers in the table above will serve as a bit of a guide to what comes next. Different GPU architectures achieve more or less of their peak rates in real-world use, depending on many factors, but these numbers give us a sense of how the various video cards compare.

Versus its most direct rival, the GeForce GTX 570, the Radeon HD 6970 has comparable rates all around. Although the GTX 570 has a wider 320-bit memory interface, the 6970’s amazing GDDR5 clock speeds more than make up the deficit. The fact that the GTX 570 can filter FP16 textures at its full rate, rather than half, is no obstacle for the 6970, either, since Cayman’s higher unit count and clock frequency allows it to reach similar FP16 filtering rates, at least in theory.

The closest “competitor” to the Radeon HD 6950 is last year’s model, the Radeon HD 5870. The 6950 is only a little faster than the 5870 across the board—and that’s the stock model. We’ve also tested a slightly overclocked version of the 5870 with 2GB of RAM, which should provide us with an interesting and very direct comparison between the Cayman and Cypress architectures in which key rates are nearly equal and efficiency becomes the question.

This color fill rate test tends to be limited primarily by memory bandwidth rather than by ROP rates. True to form, the 6970 and 6950 outperform the GeForce GTX 570 here.

Notice, also, that I’ve tested a trio of older cards for historical interest, including the Radeon HD 4870, the GeForce GTX 280, and the oldest DX10 chip on the planet, the GeForce 8800 GTX. They can only participate in a subset of our test since they’re not DX11-capable, but they should be fun to watch and compare.

3DMark’s texture fill test doesn’t involve any sort of texture filtering. That’s unfortunate, since texture filtering rates are almost certainly more important than sampling rates in the grand scheme of things. Still, this is a decent test of FP16 texture sampling rates, so we’ll use it to consider that aspect of GPU performance. Texture storage is, after all, essentially the way GPUs access memory, and unfiltered access speeds will matter to routines that store data and retrieve it without filtering.

AMD’s raw sampling rates were already quite a bit faster than Nvidia’s, and Cayman’s higher unit count puts some additional distance between the two.

Cayman’s much higher theoretical texture filtering rates work out to somewhat higher measured throughput in RightMark, but nothing like the 2X advantage the 6970 has over the GTX 570 on paper. Then, in our FP16 filtering test, the 6970 doesn’t deliver on nearly as much of its promise as the GTX 570 does—and the GTX 580 is faster still.

Shader and geometry processing performance

	Peak shader arithmetic (GFLOPS)	Peak rasterization rate (Mtris/s)	Peak memory bandwidth (GB/s)
GeForce GTX 460 768MB	941	1400	88.3
GeForce GTX 460 1GB 810MHz	1089	1620	124.8
GeForce GTX 470 GC	1120	2500	133.9
GeForce GTX 480	1345	2800	177.4
GeForce GTX 570	1405	2928	152.0
GeForce GTX 580	1581	3088	192.0
Radeon HD 6850	1517	790	128.0
Radeon HD 6870	2016	900	134.4
Radeon HD 5870	2720	850	153.6
Radeon HD 6950	2253	1600	160.0
Radeon HD 6970	2703	1760	176.0
Radeon HD 5970	4640	1450	256.0

Theoretical shader performance is an even trickier subject than the graphics rates we covered on the last page, for reasons we discussed when considering Cayman’s VLIW4 SPU design. Scheduling efficiency and utilization will count for a lot, as will other quirks of the individual architectures. In theory, the 6970’s peak FLOPS rates are nearly double the GeForce GTX 570’s, but Nvidia has a very different approach to shader design involving fewer units, doubled clock frequencies (versus the GPU core clock), and very efficient sequential, scalar scheduling. Also, Cayman’s dual vertex engines give it a nice boost in peak rasterization rate over the 5870, but the 6970’s theoretical peak rate is still less than two-thirds of the GTX 570’s.

The first tool we can use to measure delivered pixel shader performance is ShaderToyMark, a pixel shader test based on six different effects taken from the nifty ShaderToy utility. The pixel shaders used are fascinating abstract effects created by demoscene participants, all of whom are credited on the ShaderToyMark homepage. Running all six of these pixel shaders simultaneously easily stresses today’s fastest GPUs, even at the benchmark’s relatively low 960×540 default resolution.

Yep, Nvidia’s GPUs are faster here, despite their much lower theoretical peak FLOPS counts. Go past that and focus on the question of Cypress’ VLIW5 shaders versus Cayman’s VLIW4 design for a second, though. In theory, the Radeon HD 5870 can deliver 2.72 GLFOPS to the 6970’s 2.7 GFLOPS. In practice, though, the 6970 is over 10% faster, even in this all-graphics workload. That’s progress, even if it’s not revolutionary.

Up next is a compute shader benchmark built into Civilization V. This test measures the GPU’s ability to decompress textures used for the graphically detailed leader characters depicted in the game. The decompression routine is based on a DirectX 11 compute shader. The benchmark reports individual results for a long list of leaders; we’ve averaged those scores to give you the results you see below.

It’s not awful, but Cayman performs relatively poorly in this test, all things considered. The 6950 falls behind the Barts-based Radeon HD 6870, which has no advantage on paper that would predict this outcome. One possible reason for this result is that AMD’s driver-based real-time compiler far Cayman may still be fairly immature. There’s another possibility, too, which we’ll explore in a sec.

Finally, we have the shader tests from 3DMark Vantage.

Clockwise from top left: Parallax occlusion mapping, Perlin noise,
GPU cloth, and GPU particles

The 6900-series cards generally perform as expected in three of these tests, offering minor incremental improvements over the Radeon HD 5870. In a fourth, the Perlin noise test, the 5870 is markedly faster. Why? I’m pretty sure we’re seeing Cayman’s PowerTune power cap taking effect. AMD specifically mentioned 3DMark’s Perlin noise as an application that bumps up against the limits, and the performance would seem to indicate that clock speeds are being lowered.

Even so, notice that the 6970 remains quite a bit faster than the GTX 570 in this benchmark, just as it is in the parallel occlusion mapping test. Both of those are pixel shader-intensive tests, and as we’ve mentioned, Perlin noise is very arithmetic-heavy. The final two 3DMark tests, however, emphasize vertex shader performance, and the Fermi architecture’s distributed geometry processing capabilities give it a clear win. Note that Nvidia’s pre-Fermi G80 and GT200 chips (in the 8800 GTX and GTX 280, respectively) don’t fare nearly as well, relatively speaking, against the Radeon HD 4870.

Geometry processing throughput
We can measure geometry processing speeds pretty straightforwardly with a couple of tools. The first is the Unigine Heaven demo. This demo doesn’t really make good use of additional polygons to increase image quality at its highest tessellation levels, but it does push enough polys to serve as a decent synthetic benchmark.

The Radeon HD 6970 performs as well here as two Cypress chips aboard the Radeon HD 5970, so that’s progress. Still, Cayman is no match for the GF110’s quad rasterizers and 16 vertex engines.

We can push into even higher degrees of tessellation using TessMark’s multiple detail levels.

Hmm. TessMark uses OpenGL rather than Direct3D to access the GPU, and apparently AMD’s OpenGL drivers aren’t yet fully aware of Cayman’s expanded geometry processing capabilities. Frustrating.

HAWX 2
As we transition from synthetic benchmarks that measure geometry processing throughput to real-world gaming tests, we’ll make a stop at the curious case of HAWX 2.
We already commented pretty extensively on the controversy surrounding tessellation and polygon use in HAWX 2, so we won’t go into that again. I’d encourage you to read what we wrote earlier, if you haven’t yet, in order to better understand the issues. Suffice to say that this game pushes through an awful lot of polygons, but it doesn’t necessarily do so in as efficient a way as one would hope. The result is probably something closer to a synthetic test of geometry processing performance than a typical deployment of DX11 tessellation.

The question is: can Cayman’s revamped tessellation capabilities make the Radeons more competitive in this strange case?

Well, again, this is progress, but the Radeon HD 6970 still trails the much cheaper GeForce GTX 460 1GB. Suffice to say that four or so years ago, when AMD and Nvidia architects began envisioning these GPU architectures, they had very different visions about what sort of polygon throughput should be required. Then again, in defense of the Radeons and of HAWX 2‘s developers, the Cayman cards are achieving easily playable frame rates at this four-megapixel resolution, so the point really is academic.

Lost Planet 2
Our next stop is another game with a built-in benchmark that makes extensive use of tessellation, believe it or not. We figured this and HAWX 2 would make a nice bridge from our synthetic tessellation benchmark and the rest of our game tests. This one isn’t quite so controversial, thank goodness.

This benchmark emphasizes the game’s DX11 effects, as the camera spends nearly all of its time locked onto the tessellated giant slug. We tested at two different tessellation levels to see whether it made any notable difference in performance. The difference in image quality between the two is, well, subtle.

This contest is a little closer, but the GTX 570 still has the upper hand on the 6970 here. The 6970 and 6950 are faster than the 5870, but not by a lot.

Civilization V
In addition to the compute shader test we’ve already covered, Civ V has several other built-in benchmarking modes, including two we think are useful for testing video cards. One of them concentrates on the world leaders presented in the game, which is interesting because the game’s developers have spent quite a bit of effort on generating very high quality images in those scenes, complete with some rather convincing material shaders to accent the hair, clothes, and skin of the characters. This benchmark isn’t necessarily representative of Civ V‘s core gameplay, but it does measure performance in one of the most graphically striking parts of the game. As with the earlier compute shader test, we chose to average the results from the individual leaders.

The Radeons dominate in this test of pixel shading prowess, and Cayman even improves on Cypress’ performance somewhat.

Another benchmark in Civ V focuses, rightly, on the most taxing part of the core gameplay, when you’re deep into a map and have hundreds of units and structures populating the space. This is when an underpowered GPU can slow down and cause the game to run poorly. This test outputs a generic score that can be a little hard to interpret, so we’ve converted the results into frames per second to make them more readable.

The tables turn here, as the GTX 570 outduels the 6970. One bright spot for the Radeon camp is multi-GPU performance, where the Nvidia cards seem to struggle.

StarCraft II
Up next is a little game you may have heard of called StarCraft II. We tested SC2 by playing back a match from a recent tournament using the game’s replay feature. This particular match was about 10 minutes in duration, and we captured frame rates over that time using the Fraps utility. Thanks to the relatively long time window involved, we decided not to repeat this test multiple times, like we usually do when testing games with Fraps.

We tested at the settings shown above, with the notable exception that we also enabled 4X antialiasing via these cards’ respective driver control panels. SC2 doesn’t support AA natively, but we think this class of card can produce playable frame rates with AA enabled—and the game looks better that way.

The cheaper Radeon HD 6950 essentially ties the GeForce GTX 570, while the 6970 is a few FPS ahead of them both. The dual 6970 CrossFireX config takes top honors overall, with the highest average frame rate and an FPS minimum over our eight-minute test period that’s above the 60Hz refresh rate common to most LCDs. Impressive.

Battlefield: Bad Company 2
BC2 uses DirectX 11, but according to this interview, DX11 is mainly used to speed up soft shadow filtering. The DirectX 10 rendering path produces the same images.

We turned up nearly all of the image quality settings in the game. Our test sessions took place in the first 60 seconds of the “Heart of Darkness” level.

The Radeon HD 5870 has long performed relatively well in this game at these settings, and I had hoped to see Cayman improve on that tradition. I’m not sure two FPS qualifies as an improvement, though. Two FPS is also the difference between the 6970 and the GTX 570, our marquee matchup. Again, not much to write home about. I shouldn’t complain, though. With frame rate minimums in the mid-30s, even the 6950 is more than fast enough to handle this one.

Metro 2033
We decided to test Metro 2033 at multiple image quality levels rather than multiple resolutions, because there’s quite a bit of opportunity to burden these GPUs simply using this game’s more complex shader effects. We used three different quality presets built into the game’s benchmark utility, with the performance-destroying advanced depth-of-field shader disabled and tessellation enabled in each case.

Dude, so, yeah. At the lower quality settings, the GeForces’ higher geometry throughput with tessellation like totally puts them on top of the older Radeons. The situation evens out with higher-quality pixel shaders. But, check it, man. The Cayman cards are like handing it to the GeForces even at the lower quality levels. Righteous. You can feel that extra tessellation goodness.

Oh, and even though the older DX10 cards can’t do tessellation at all and it’s, like, totally unfair, they’re still way slower.

Aliens vs. Predator
AvP uses several DirectX 11 features to improve image quality and performance, including tessellation, advanced shadow sampling, and DX11-enhanced multisampled anti-aliasing. Naturally, we were pleased when the game’s developers put together an easily scriptable benchmark tool. This benchmark cycles through a range of scenes in the game, including one spot where a horde of tessellated aliens comes crawling down the floor, ceiling, and walls of a corridor.

For these tests, we turned up all of the image quality options to the max, with two exceptions. We held the line at 2X antialiasing and 8X anisotropic filtering simply to keep frame rates in a playable range with most of these graphics cards.

Here’s another case where Cayman is incrementally faster than Cypress, but that proves to be enough to put the 6970 ahead of the GeForce GTX 570.

DiRT 2: DX9
This excellent racer packs a scriptable performance test. We tested at DiRT 2‘s “ultra” quality presets in both DirectX 9 and Direct X 11. The big difference between the two is that the DX11 mode includes tessellation on the crowd and water. Otherwise, they’re hardly distinguishable.

DiRT 2: DX11

This final game test doesn’t do much to decide the contest between the 6970 and GTX 570: at the highest quality settings, only one FPS separates the two.

Power consumption
Now for some power and noise testing. Notice that the cards marked with asterisks in the results below have custom cooling solutions that may perform differently than the GPU maker’s reference solution.

The 6950 and 6970 draw a few watts less at idle than the GTX 570, fitting for a smaller chip. When running Left 4 Dead 2, however, the 6970 actually pulls a little more juice than the GTX 570. (We expect one might find different results with a different sort of graphics workload, but we think L4D2 is a good, representative game with relatively high power draw.) The 6950 is kind of in a class of its own, but its power draw is relatively low under load, only slightly more than the GeForce GTX 460 1GB’s.

Noise levels and GPU temperatures

Nearly all of the single-GPU solutions are pretty quiet at idle, and most are perilously close to the noise floor for the rest of our system’s components. Still, the 6950 and 6970 prove to be exceptional citizens, among the quietest solutions we tested. Dropping in a second 6970 does raise the noise levels at idle a bit, likely due to the obstruction of airflow into the primary card’s blower.

Under load, Nvidia’s stock coolers simply outperform AMD’s. The GTX 570’s GPU temperature exactly matches the 6970’s, the two cards’ power draw is only 5W apart, yet the GTX 570 is 2.5 dB quieter.

The value proposition
Now that we’ve stuffed you full of benchmark results, we’ll try to help you make some sense of the bigger picture. We’ll start by compiling an overall average performance index, based on the highest quality settings and resolutions tested for each of our games, with the notable exception of the disputed HAWX 2. We’ve excluded directed performance tests from this index, and for Civ V, we included only the “late game view” results.

Holy moly, we have a tie. The GTX 570 and 6970 are evenly matched overall in terms of raw performance. With the results this close, we should acknowledge that the addition or subtraction of a single game could sway the results in either direction.

With this performance index established, we can consider overall performance per dollar by factoring price into the mix. Rather than relying on list prices all around, we grabbed our prices off of Newegg where possible. The exception: out of necessity, we’re trusting AMD that its suggested prices for the 6900 cards will translate into similar street prices.

Generally, for graphics cards with reference clock speeds, we simply picked the lowest priced variant of a particular card available. For instance, that’s what we did for the GTX 580. For the cards with custom speeds, such as the Asus GTX 460 768MB and 6850, we used the price of that exact model as our reference.

AMD card	Price	Nvidia card
	$169.99	GeForce GTX 460 768MB
Radeon HD 6850	$179.99
	$214.99	GeForce GTX 460 1GB 810MHz
Radeon HD 6870	$239.99
	$259.99	GeForce GTX 470
Radeon HD 5870	$289.99
Radeon HD 6950	$299.00
	$349.99	GeForce GTX 570
Radeon HD 6970	$369.00
	$429.99	GeForce GTX 480
Radeon HD 5870 2GB	$499.99
Radeon HD 5970	$499.99
	$509.99	GeForce GTX 580

A simple mash-up of price and performance produces these results:

The lower-priced solutions tend to bubble to the top whenever you look at raw price and performance like that.

We can get a better sense of the overall picture by plotting price and performance on a scatter plot. On this plot, the better values will be closer to the top left corner, where performance is high and price is low. Worse values will gravitate toward the bottom right, where low frame rates meet high prices.

Either way you slice it, the GTX 570 looks to be a better value than the Radeon HD 6970 for a simple reason: equivalent performance and a $20 price gap in the GTX 570’s favor. Happily for AMD, the Radeon HD 6950 looks to be a better value than either of them, albeit at a lower performance level.

Another way we can consider GPU value is in the context of a larger system purchase, which may shed a different light on what it makes sense to buy. The 6900-series Radeons are definitely enthusiast-type parts, so we’ve paired it with a proposed system config that’s similar to the hardware in our testbed system but a little more economical.

CPU	Intel Core i7-950	$294.99
Cooler	Thermaltake V1	$51.99
Motherboard	Gigabyte GA-X58A-UD3R	$194.99
Memory	6GB Corsair XMS3 DDR3-1333	$74.99
Storage	Western Digital Caviar Black 1TB	$89.99
Storage	Asus DRW-24B1ST	$19.99
Audio	Asus Xonar DG	$29.99
PSU	PC Power & Cooling Silencer Mk II 750W	$119.99
Enclosure	Corsair Graphite Series 600T	$159.99
Total		$1,036.91

That system price will be our base. We’ve added the cost of the video cards to the total, factored in performance, and voila:

Factor in the price of a complete system, and guess what? That $20 gap between the 6970 and GTX 570 pretty much melts into irrelevance. In fact, the 6970’s ever-so-teeny performance advantage must justify the additional 20 bucks.

Remember that these results would look very different with a more or less expensive system, so your mileage may vary.

Conclusions
I’ll let you in on a little secret. Those of you who just skip to the conclusions of these articles truly aren’t seeing our best work. We write the conclusions after, you know, everything else. Right now, I’ve barely moved from this spot in six days, my blood-caffeine level must be five times any sane legal limit, and I can’t feel my legs.

What’s more, I have almost no idea how you choose between our two marquee contestants, the Radeon HD 6970 and the GeForce GTX 570. Overall, their performance is equivalent. They both end in -70. Could be a wash!

Let’s make the case for both and see where we land.

In the GTX 570’s favor are a host of traditional winning attributes for a graphics card. It’s 20 bucks cheaper, draws a little less power under load, and generates less noise. What’s more, although most games aren’t yet taking advantage of it, the GTX 570 has measurably and markedly superior geometry processing throughput. That may be a forward-looking architectural feature, and the question of whether and how much that matters is far from settled. Given the performance parity elsewhere, though, it’s hard to ignore.

Cayman’s main advances in antialiasing capabilities, geometry processing, and shader scheduling efficiency move AMD closer to what Nvidia has offered in its Fermi architecture for the better part of 2010. That doesn’t really grant Cayman a minty fresh scent of newness. Cayman is an incremental change—an improvement, no doubt—that makes these new Radeons more, not less, like the competition.

On the other hand, the Radeon HD 6970 has 2GB of RAM onboard, can support three or more displays from a single card, and will allow you to play games across them via Eyefinity. You’ll need two GTX 570 cards in order to partake of Nvidia’s competing Surround Gaming feature, and even then, GTX 570 cards have 1280MB of memory, which could be limiting at six or more megapixels. In this sense, the Radeon HD 6970 outclasses the GTX 570. We can see paying the extra 20 bucks for that, if you aspire to multi-display gaming—or even if you think you might someday.

With no direct competitors and a nice price of $300, the Radeon HD 6950 gives us no sense of conflict about its merits. It would be an ideal step up from a cheaper offering like the Radeon HD 6870. Indeed, because of Cayman’s many improvements, we’d be very tempted to make the leap if we were deciding between the two. The fact that the 6950 has the same set of display outputs and 2GB of memory makes it an intriguing candidate for an Eyefinity setup, too. There really is nothing else in its class.

Scott Wasson

View all posts by Scott Wasson

AMD’s Radeon HD 6950 and 6970 graphics processors

Scott Wasson

Scott Wasson

Most Popular News

Latest News

Ethereum ETH’s Potential Rebound After Hitting Target Low

Bitcoin (BTC) Coming Back Strong and Might Reach $200,000 Post-Halving

Crypto analyst Predicts Bitcoin Consolidation and Identifies Altcoin Bottom

Cardano Founder Celebrates Blockchain’s Cost-Effectiveness

New Meme Coin on BASE Blockchain Has the Potential to Make Millionaires

28 Google Employees Fired for Protesting Against The Company’s Israeli Contract

90+ Jaw-Dropping Squarespace Statistics of 2024 You Must See