Texture Mappers (1995-1999)
Fixed T&L, Early Shaders (2000-2002)
Shader Model 2.0/3.0 (2003-2007)
Unified Shaders (2008+)
Featureset determines where a card will go, not its year of introduction
Unified Shader? GPGPU?
Around 2004 time, when SM3.0 was new and cool, we were noticing from framerate profiling that some frames, roughly one or two a second, took far more rendering time than all others. This caused "micro-stutter", an uneven framerate. The shader profile showed that these frames were much more vertex shader intensive than the others and GPUs had much less vertex shader hardware than pixel shader hardware, because most frames needed about a four to one mix.|
As shader units were getting more and more complex, it made sense for each shader to be able to handle both vertex and pixel programs. ATI introduced this with the Radeon HD2xxx series, though due to other "features" on the GPU, they were generally slower than the previous generation X1xxx series. Nvidia jumped on board with the Geforce 8 series, with 500 MADD GFLOPS on the top end Geforce 8800 parts. The teraflop GPU was in sight.
The first GPU with enough shaders running fast enough to be able to manage one trillion operations per second was the AMD Radeon HD4850, though the previous generation 3870 X2 had surpassed this, it was two GPUs on one card. Nvidia struggled during this generation, while the Geforce GTX285 did manage to hit one TFLOPS, it was expensive, noisy, power hungry and rare.
The most complex operation an FPU historically did was the "multiply-add" (MADD) or "fused-multiply-add" (FMA). It's doing a multiply and an add at the same time. The operation is effectively an accumulate on value a where a = a + (b x c) and the distinction is that intermediate value b x c in a MADD is rounded to final precision before it is added to a, while in an FMA, the intermediate is at a higher precision and not rounded before the add, only the final value is rounded. The real-world distinction is minor and usually ignorable but for FMA being performed in a single step, so typically twice as fast. AMD introduced FMA support in TeraScale 2 (Evergreen, 2009) and Nvidia in Fermi (2010). A common driver optimisation is to promote MADD operations to FMA.|
For consistency's sake, where we list "MADD GFLOPS" or "FMA GFLOPS", we may mean either, whichever one is fastest on the given hardware. An FMA, despite being treated as one operation, is actually two and therefore if a GPU can do one billion FMAs a second, its GFLOPS (giga-floating point operations per second, 1 GFLOPS is one billion ops per second) is 2.
PNY GeForce 8800 GTS 512 MB - 2007|
It's fitting that we lead this article with the GeForce 8800, which was the first DirectX 10 class GPU, the first use of unified shaders, the first implementation of Nvidia's new "Tesla" architecture, which was still called G80 back then. What we understand a modern GPU to be, this was the first.
As a popular card, there were a lot of variants, and PNY's lot usually used its "XLR8" branding, but this one used Nvidia's reference branding.
What G80 did was make very simple ALUs which could be double-pumped. The G80 architecture did SIMD over eight "stream processors" (later to be called CUDA cores) which were natively FP32. With 16 compute units, G80 had 128 stream processors. It clocked them from 812.5 MHz clock to a 1625 "MHz-equivalent", but the rest of the GPU clocked at a more sensible 600 MHz. This double pumped ALU stack was using clock doubling, so it triggered on the rising and falling edges of the clock.
The shader domain on G80/Tesla/Fermi ran the ALUs at the double-pumped clock, but the supporting structures, such as cache, registry, flow control, special functions, dispatch and retire only ran at the base clock.
This was not an original GeForce GTS 8800. The originals were the GTS 320 MB (12 units) and GTS 640 MB (14 units), both of which clocked much lower than the GTS 512 MB and had fewer cores. They used the G80 GPU on TSMC's 90 nm process.
When 65 nm came along, Nvidia was soon transitioning over to it, and the G92 GPU was the result. It had the same 16x8 streaming processor configuration, but was a lot smaller, 324 mm^2 instead of G80's 480. This GTS 8800 512MB was a G92 and was configured almost identically to the GeForce 9800 GTX.
G92 was not a direct die shrink of G80, it was a little more capable in its CUDA configuration (Compute Capability 1.1, not 1.0), but the 55 nm G92b was a direct die shrink of G92. G92 improved the PureVideo IP block, meaning VC-1 and H.264 could be GPU assisted.
G92 also had an alarming incompatibility with PCIe 1.0a motherboards, which led to the video card not initialising during boot. This caused a very high return rate, as Nvidia's backwards compatibility (something the PCIe specification demands) only went as far back as PCIe 1.1.
Nvidia's spec for the memory was 820 MHz GDDR3, but most vendors used faster RAM. The 8800 and 9800 series were just before Nvidia started to lose the plot a little. Tesla's extreme shader clocks served it well for small pixel shader programs of the sort DX9 would use. G80 was up against ATI's RV670, which had 320 cores, but in a VLIW-5 design. While VLIW-5 was harder to get peak performance from, it used fewer transistors, less die area and so resulted in a smaller, cheaper GPU - It also used much less power.
A product line as popular as the 8800 series was means it was influential. Games even as far as 2015, eight years later, would often say a 8800 GT or GTS was the minimum requirement.
While ATI's RV670 in, say, Radeon HD 3870 (8800 GTS's most direct competitor) had a board power of just over 100 watts, 8800 GTS was rated at 150 watts. So far, so good. By the time reach 2009, two years later, we have Radeon HD 5870's Cypress GPU against Tesla's swansong, GT200B, we have ATI's 180 watt board power against Nvidia's 238 watt GeForce GTX 285.
Ultimately it was not possible for Nvidia to keep throwing more and more power at the problem.
Core: 16 ROPs, 32 TMUs, 650MHz (20.8 billion texels per second, 10.4 billion pixels per second)
RAM: 256 bit GDDR3, 970MHz, 62,080MB/s
Shader: 16x Unified shader 5.0 (16 SIMD of 8, 128 total)
MADD GFLOPS: 416
Most recent driver is 342.01 from 2016, when G80/G9x class Tesla support was dropped.
MSI Radeon HD3450 (V118 R3450-TD256H Hewlett Packard OEM variant) - 2007
All I had for this was an Underwriters Laboratories certification number, E96016 which identified it as nothing beyond the basic PCB manufactured by Topsearch of Hong Kong and the model number, V118. That's not a lot to be going on.
It came out of a HP Pavillion desktop, pretty typical consumer junk and certainly nothing intended for gaming, so it's not going to be anything spectacular (the size of the PCB says that too). All hardware tells a story, it's just a matter of listening. The silkscreening on the PCB tells us that it's got guidelines for quite a few different coolers, this PCB is meant for a family of video cards perhaps all based around the same GPU or similar GPUs. The VGA connector can be omitted easily and isn't part of the PCB - It's aimed at OEMs who know what they want and need no flexibility.
It's a Radeon HD3450 with 256MB manufactured by MSI using a Topsearch manufactured PCB. It's amazing what the right search terms in Google can let you infer, isn't it?
MSI's standard retail part comes with a large passive heatsink and S-video out, but OEMs can get whatever variant of a design they like if it'll seal a deal for thousands of units. These things, the V118 model, were retailing at about $20.
Under the heatsink is a 600MHz RV620 (A near-direct die shrink to 55nm of the HD2400's 65nm RV610) GPU feeding 1000MHz GDDR2 memory, but the good ends there. The memory is 64 bit, giving a meagre 8GB/s bandwidth and the core has only four ROPs - It's a single quad. Unified shader 4.1 is present, 40 units, but they're pretty slow.
Core: RV620 with 4 ROPs, 1 TMU per ROP, 600MHz (2.4 billion texels per second, 2.4 billion pixels per second)
RAM: 64 bit DDR2, 1000MHz, 8000MB/s
Shader: 1x Unified shader 4.1
MADD GFLOPS: 48
ATI Radeon HD 3450 256 MB (AMD B629 Dell OEM variant as 'W337G') - 2007|
ATI's OEM parts were never easy to identify. This one carried its FCC identification as ADVANCED MICRO DEVICES MODEL: B629, dating it to after AMD's acquisition of ATI. The PCB carries a date code showing it was made in week 19, 2010.
The rear connector carries a DMS-59 output and the standard-for-the-time S-Video output. Looking at the size and layout of the PCB shows us that it is clearly a derivative of the one above with many components sharing identical placement. Notably, this one has a half-height bracket for the Dell Optiplex SFF 360/380/755/760/780.
The performance of the Radeon HD 3450's RV620 GPU was, to a word, lacking. It carried DDR2 memory as standard, running it at 500 MHz for just 8 GB/s memory bandwidth. The memory was provided by SK Hynix as four BGA packages, two on each side. At 600 MHz the GPU itself was never going to set any records, especially with just one R600-architecture shader package (40 individual shaders).
Getting more performance out of this was an exercise in how highly the RAM would clock. Even with the down-clocked GPU, 8 GB/s plain was not enough. The GPU itself would normally hit 700-800 MHz (and in the HD 3470, the same GPU was running at 800 MHz on the same PCB, with the same cooler!) but this didn't help when the RAM was slow 64 bit DDR2.
When it arrived here, in 2019, it had been sat in a corporate stock room for years and was completely unused, not a single grain of dust on the fan.
On power on, as was common for GPUs of the day, the fan spins up to maximum before winding back. This is really quite noisy for such a small fan!
We'll talk about the GPU architecture on this one. R600 used a ring bus to connect the ROPs to the shader cores, a rather unusual architecture, and took a VLIW-5 instruction set later back-named TeraScale 1. This means each shader block of 40 was eight individual cores, which could handle a single VLIW-5 instruction to its five execution units. TeraScale 2 would increase this to sixteen cores (and 80 shaders per block). R600's ring bus also, as noted, decoupled the ROPs from the execution units, but the texture units were "off to one side" of the ring bus, so each group of execution units did not have its own samplers and combining samples with pixels to do MSAA meant first colour data exiting the shaders into the ring bus, then into the texture units, then back to the ring bus to head to the ROPs to be finally MSAA sampled. This was inefficient and, in the original R600, plain didn't work. Even in the revised silicon, which RV620 here is based on, it was best to avoid antialiasing. The RV620 also improved the UVD video block from "1.0" to "2.0", adding better video decoding to the GPU.
Physically, the GPU was 67 mm^2 and contained 181 million features. At 67 mm^2, it barely cost anything to make. It would be AMD's smallest GPU until Cedar/RV810, three years later in 2010, at 59 mm^2. A drop in the price of bulk silicon from TSMC after 28 nm and the general increase in very low end GPU cost meant that no subsequent GPU has been this small, though Nvidia's 77 mm^2 GP108 came close. Powerful IGPs from AMD and Intel embedded in the CPUs have put paid to the very small entry level GPU segment.
Core: RV620 with 4 ROPs, 1 TMU per ROP, 600MHz (2.4 billion texels per second, 2.4 billion pixels per second)
RAM: 64 bit DDR2, 1000MHz, 8000MB/s
Shader: 1x Unified shader 4.1
MADD GFLOPS: 48
PowerColor Radeon HD 3850 AGP 512MB - 2007
If one were to chart all the GPUs made since the early DX7 era, he would notice a near-perfect rule applying to the industry. One architecture is used for two product lines. For example, the NV40 architecture was introduced with Geforce 6 and then refined and extended for Geforce 7. There's an entire list covering the whole history of the industry. Observe:
|This||Was released as||Then changed a bit into||Which was released as
|NV20||Geforce 3||NV25||Geforce 4|
|R300||Radeon 9700||R420||Radeon X800|
|NV40||Geforce 6||G70||Geforce 7|
|G80||Geforce 8||G90||Geforce 9|
|R600||Radeon HD 2xxx||Rv670||Radeon HD 3xxx|
What's happening is that a new GPU architecture is made on an existing silicon process, one well characterised and understood. ATI's R600 was on the known 80nm half-node process, but TSMC quickly made 65 nm and 55 nm processes available. The lower end of ATI's Radeon HD 2000 series were all 65 nm, then the 3000 series was 55 nm. The die shrink gives better clocking, lower cost and better performance, even if all else is equal. It's also an opportunity for bugfixing and fitting in those last few features which weren't quite ready first time around.
In ATI's case, the Radeon HD 2 and 3 series are the same GPUs. Here's a chart of what uses what:
|R600||Radeon HD 2900 series
|RV610||Radeon HD 2400
|RV615||Radeon HD 4230, 4250
|RV620||Radeon HD 3450, 3470
|RV630||Radeon HD 2600
|RV635||Radeon HD 3650, 45x0, 4730
|RV670||Radeon HD 3850, 3870, 3870 X2
Whew! ATI used the R600 architecture across three product lines, making it almost as venerable as the R300, which was used across three lines and in numerous mobile and IGPs. ATI used R300 so much because it was powerful and a very good baseline. So what of R600?
To a word, it sucked. The top end 2900 XTX was beaten by the previous generation X1900 XT, let alone the X1950s and the top of the line X1950 XTX. It suffered a catastrophic performance penalty from enabling antialiasing and Geforce 7 could beat it pretty much across the board: Yet here was a product meant to be taking on Geforce 8!
While this was released in 2008 (announced in January, but was in the channel by December the previous year), the PCIe version was 2007, so that is listed as the year here. ATI did, however, throw the price really low. You could pick them up for £110 when new and even today (August 2009) they're still in stock for around £75.
There'd not been a really decent mid-range card since the Geforce 6600GT (though the Geforce 7900GT was fairly mid-range, it was overpriced) and the ageing Geforce 7 series was still available, it was not at the DirectX 10 level. The ATI Radeon HD 3850 pretty much owned the mid-range of the market. Better still, it was available in this AGP version which wasn't a cut down or in any way crippled model as most AGP versions were. Indeed, Sapphire's AGP version was overclocked a little and both Sapphire and PowerColor's AGP cards had 512 MB of memory, up from the standard 256 MB. The PowerColor card used this very large heatsink but Sapphire used a slimline single-slot cooler.
In early 2008, these were a damned good buy, especially for the ageing Athlon64 X2 (or Opteron) or older Pentium4 system which was still on AGP.
Core: RV670 with 16 ROPs, 1 TMU per ROP, 670MHz (10.7 billion texels per second, 10.7 billion pixels per second)
RAM: 256 bit GDDR3, 1660MHz, 53,100 MB/s
Shader: 4x Unified shader 4.1
MADD GFLOPS: 429
Supplied by Doomlord
AMD Radeon HD 4870 - 2008
After losing their way somewhat with the Radeon HD 2xxx (R600) debacle, AMD were determined to set matters straight with the 3xxx and 4xxx series. The 3xxx were little more than bugfixed and refined 2xxx parts, comprising of RV620, 635 and 670 (though the RV630 was actually the Radeon HD 2600).
ATI had postponed the R400/Loki project when it turned out to be far too ambitious and had simply bolted together two R360s to produce their R420, which powered the Radeon X800XT and most of the rest of that generation, sharing great commonality with the Radeon 9700 in which R300 had debuted. Indeed, a single quad of R300, evolved a little over the years, powers AMD's RS690 chipset onboard graphics as a version of RV370.
By 2006, the R400 project still wasn't ready for release (ATI had been sidetracked in building chipsets, being bought by AMD) and R500, which R400's Loki project was now being labelled as, was still not ready. Instead, R520 was produced. It was quite innovative and the RV570, as the X1950, became one of the fastest things ever to go in an AGP slot. It also provided the XBox 360's graphics.
It took until 2008 for what was the R400 to finally hit the streets as R600. All the hacks, tweaks and changes made to the chipset meant that it was barely working at all. Things as basic as antialiasing (which should have been handled by the ROPs) had to be done in shaders because the ROPs were rumouredly broken. This crippled the R600's shader throughput when antialiasing was in use and led to pathetically low benchmark scores.
R700 corrected everything. Sporting unified shaders at 4.1 level (beyond DirectX 10) and eight hundred of them at that (these are raw ALUs, the RV770 contains ten shader pipelines, each with 16 cores, each core being 5 ALUs - It's more correct to say that RV770 contains 10 shaders, and 160 shader elements, what AMD call 'stream processors'), corrected the ring memory controller's latency issues and fixed the ROPs. After two generations of being uncompetitive, AMD were back in the ring with the RV710 (4450, 4470), RV730 (4650, 4670) and RV770 (4850, 4870).
RV770's ROPs are again arranged in quads, each quad being a 4 pipeline design and having a 64 bit bus to the memory crossbar. Each pipeline has the equivalent of 2.5 texture mappers (can apply five textures in two passes, but only two in one pass)
Worth comparing is a Radeon HD 2900XT: 742MHz core, 16 ROPs, 105.6GB/s of raw bandwidth, 320 shaders, but about a third of the performance of the 4870 - Even when shaders aren't being extensively utilized.
Core: 16 ROPs, 40 TMUs, 750MHz (30.0 billion texels per second, 12.0 billion pixels per second)
RAM: 256 bit GDDR4, 1800MHz, 115200MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 10x Unified shader 4.1 (80 units each)
MADD GFLOPS: 1,260
It is apparently quite difficult to coax full performance out of the shaders, most synthetic benchmarks measure between 200 and 800 GFLOPS
Thanks to Filoni for providing the part
Nvidia Quadro FX 380 256 MB - 2008|
The Quadro FX 380 used Nvidia's G96 GPU, and ran it at 450 MHz. It had 256 MB of GDDR3 RAM, running at 700 MHz. It used the Tesla architecure, had 16 shaders, 8 TMUs and 8 ROPs. It was rated for a very low 34 watt TDP. This meant it compared against the embarrasingly slow GeForce 9400 GT:
The significantly more powerful 9400 GT was also about a quarter of the price. The G96 GPU had four execution cores, each with 8 CUDA cores, but the Quadro FX 380 had half the entire GPU disabled. It, to a word, stank.
|Name||GPU||GPU Clock||CUDA Cores||RAM Bandwidth|
|GeForce 9400 GT||G96||700 MHz||16||25.6 GB/s|
|Quadro FX 380||G96||450 MHz||16||22.4 GB/s|
Core: 4 ROPs, 8 TMUs, 450MHz (3.6 billion texels per second, 1.8 billion pixels per second)
RAM: 128 bit DDR3, 700MHz, 22,400MB/s
Shader: 2x Unified shader 4.0 (16 units)
MADD GFLOPS: 28.8
AMD Radeon HD 5450 512 MB - 2009|
Here's a beauty, the slowest thing that could even be a Terascale2 GPU and still work. If you just wanted to add a DVI port to an older PC or a build without any onboard video, it was worth the £40 or so you'd pay. Like all AMD's Terascale 2, they could take a DVI-HDMI adapter and actually output HDMI signals, so the audio would work.
It was up against Nvidia's entirely unattractive GeForce 210 and GT220, which it resoundly annihilated, but nobody was buying either of these for performance.
Some of them were supplied with a low profile bracket, you could remove the VGA port (on the cable), and the rear backplane, and replace it with the low profile backplane. This XFX model was supplied with such a bracket.
The 'Cedar' GPU was also featured in the FirePro 2250, FirePro 2460 MV, Radeon HD 6350, Radeon HD 7350, R5 210 and R5 220. It was pretty much awful in all of these, but it was meant to be. Nobody bought one of these things expecting a powerful games machine.
I dropped it in a secondary machine with a 2.5 GHz Athlon X2, and ran an old benchmark (Aquamark 3) on it. It scored roughly the same as a Radeon 9700 from 2003, despite the shader array being three times faster, the textured pixel rate is very similar, and the memory bandwidth much less. This particular unit uses RAM far below AMD's specification of 800 MHz, instead using common DDR3-1333 chips. In this case, Nanya elixir N2CB51H80AN-CG parts rated for 667 MHz operation, but it clocked them at 533 MHz (1067 MHz DDR). I got about 700 MHz out of them, 715 MHz was too far and my first attempts at 800 MHz (before I knew exactly which RAM parts were in use) were an instant display corruption, Windows 10 TDR loop and eventual crash. Being passive, the GPU couldn't go very far. I had it up to 670 MHz (11.2 GB/s) without any issues, but I doubt it'd make 700 MHz.
I'm sure XFX would say the RAM was below spec and even underclocked for power reasons, but DDR3 uses practically no power and the RAM chips aren't touching the heatsinks anyway.
AMD's official (if confused) specs gave RAM on this card as "400 MHz DDR2 or 800 MHz DDR3". AMD also specifies the bandwidth for DDR3 as 12.8 GB/s, which is matched by 64 bit DDR3-1600. The 8 GB/s of this card is far below that spec. Even the memory's rated DDR3-1333 is still only 10.6 GB/s. Memory hijinks aside, all 5450s ran at 650 MHz, seemingly without exception.
What rubs is that Cedar spends 292 million transistors on being around as fast as the 107 million in the R300. It does do more with less RAM, and has Shader Model 5.0, and three to four times the raw shader throughput, but pixel throughput is around the same, and so is comparable performance.
Core: 4 ROPs, 4 TMUs, 650MHz (2.6 billion texels per second, 2.6 billion pixels per second)
RAM: 64 bit DDR3, 533MHz, 8000MB/s
Shader: 1x Unified shader 5.0 (80 units)
MADD GFLOPS: 104
AMD Radeon HD 5750 - 2009
AMD's Radeon HD 5700 series rapidly became the mid-range GPUs to rule them. Represented by the Juniper GPU, it sported a teraflops of shader power and over 70GB/s of raw bandwidth. The 5750 promised to be the king of overclockability, being the same Juniper GPU as the 5770, with one shader pipeline disabled and clocked 150 MHz lower, but supplied with an identical heatsink on largely identical PCBs. So it'd produce less power per clock and would overclock further, right?
Wrong. AMD artificially limited overclocking on Juniper-PRO (as the 5750 was known) to 850 MHz and even then most cards just wouldn't reach it. 5770s would hit 900 MHz, sometimes even 1 GHz from a stock clock of 850, yet the very same GPUs on the 5750 would barely pass 800 from stock of 700. This one, for example, runs into trouble at 820 MHz.
Why? AMD reduced the core voltage for 5750s. Less clock means less voltage required, meaning lower power use but also lower overclocking headroom. 5750s ran very cool, very reliable, but paid the price in their headroom.
Juniper was so successful that AMD rather cheekily renamed them from 5750 and 5770 to 6750 and 6770. No, really, just a pure rename. A slight firmware upgrade enabled BluRay 3D support and HDMI1.4, but any moron could flash a 5750 with a 6750 BIOS and enjoy the "upgrade". Unfortunately there was no way of unlocking the disabled 4 TMUs and shader pipeline on the 5750 to turn it into a 5770, it seems they were physically "fused" off.
The Stream Processors were arranged very much like the previous generation, 80 stream processors per pipeline (or "compute engine"), ten pipelines (one disabled in the 5750). Each pipeline has 16 cores, and each "core" is 5 ALUs, so our 5750 has 144 VLIW-5 processor elements. With a slightly downgraded, but more efficient and slightly more highly clocked GPU and slightly more memory bandwidth, the 5750 was that touch faster than a 4850. In places it could trade blows with a 4870 (see above). The 5xxx series really was just a fairly minor update to the earlier GPUs.
This card is the Powercolor version, and the PCB is quite flexible. It is able to be configured as R84FH (Radeon HD 5770), R84FM (This 5750) and R83FM (Radeon HD 5670) - Redwood shared the same pinout as Juniper, so was compatible with the same PCBs. It also could be configured with 512 MB or 1 GB video RAM. The 512 MB versions were a touch cheaper, but much less capable.
Core: 16 ROPs, 36 TMUs, 700MHz (25.2 billion texels per second, 11.2 billion pixels per second)
RAM: 128 bit GDDR5, 1150MHz, 73,600MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 9x Unified shader 5.0 (80 units each)
MADD GFLOPS: 1,008
AMD Radeon HD 5770 - 2009
The Radeon HD 5770 was about the champion of price effectiveness in 2009 on release, and for much of 2010. A cheaper video card was usually much slower, a faster one was much more expensive. The closely related 5750 was about 5-10% cheaper and 5-10% slower.
AMD had planned a "Juniper-LE", to complete the "PRO" (5750) and "XT" (5770) line up, but the smaller, slower and much cheaper Redwood GPU overlapped, so the probable HD 5730 was either never released or very rare. A "Mobility Radeon HD 5730" was released, which was a Redwood equipped version of the Mobility 5770, which used GDDR3 memory instead of GDDR5. Redwood, in its full incarnation, was exactly half a Juniper. Observe:
It's quite clear what AMD was up to, "PRO" and "LE" had parts disabled, while the "XT" was fully enabled. Further more, Redwood was half of Juniper, which was half of Cypress. Cedar was the odd one, it was far below the others and the "PRO" monicker hinted it was at least partly disabled, but no 2-CU Cedar was ever released. From the die size relative to others, it does appear to only have one compute unit.
|Name||GPU||Die Area (mm^2)||Shader ALUs (Pipelines)||TMUs||ROPs||Typical Clock|
|HD 5450||Cedar PRO||59||80 (1)||8||4||650|
|HD 5550||Redwood LE||104||320 (4)||16||8||550|
|HD 5570||Redwood PRO||104||400 (5)||20||8||650|
|HD 5670||Redwood XT||104||400 (5)||20||8||775|
|HD 5750||Juniper PRO||166||720 (9)||36||16||700|
|HD 5770||Juniper XT||166||800 (10)||40||16||850|
|HD 5830||Cypress LE||334||1120 (14)||56||16||800|
|HD 5850||Cypress PRO||334||1440 (18)||72||32||725|
|HD 5870||Cypress XT||334||1600 (20)||80||32||850|
AMD's VLIW-5 architecture clustered its stream processors in groups of five (this allows it to do a dot-product 3 in one cycle), there are 16 such groups in a "SIMD Engine" or shader pipeline. Juniper has ten such engines. Each engine has four texture mappers attached.
Back to the 5770 at hand, when new it was about £130 (January 2011) and by far the apex of the price/performance curve, joined by its 5750 brother which was a tiny bit slower and a tiny bit cheaper.
Core: 16 ROPs, 40 TMUs, 850MHz (34 billion texels per second, 13.6 billion pixels per second)
RAM: 128 bit GDDR5, 1200MHz, 76,800MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 10x Unified shader 5.0 (10 VLIW-5 of 80, 800 total)
MADD GFLOPS: 1,360
The world came to a crashing halt when AMD introduced GCN. AMD's previous designs, all the way back to R520 it inherited from ATi, were VLIW-5. "VLIW" is "Very Long Instruction Word", meaning an instruction word is a package of instruction and data. VLIW in this case is five units wide, so a VLIW-5 operation on, say, TeraScale 2, can do five FP32 operations: But has to do them all at once, and there are restrictions as to what instructions can be used together.
GCN is very different, it is a SIMD-16 design at core. Each GCN Compute Engine contains up to sixteen Compute Units, the diagram of a Compute Unit is shown below.
So these are four SIMD-16 units in parallel, meaning they can do four instructions at once, each instruction on sixteen FP32 values. A Wave64 wavefront presented to a CU an be viewed as a 4x64 array, where wavefronts 0 to 3 are being executed at any one time. It's best to view it clock by clock:
0. Wavefronts are loaded
1. Instruction 1 loads 16 values each from wavefronts 0 to 3 into all four SIMD-16s
2. Instruction 1 is completed, Instruction 2 loads 16 values from each wavefront...
3. Instruction 2 is completed, Instruction 3 loads 16 values from each wavefront...
4. Instruction 3 is completed, Instruction 4 loads 16 values from each wavefront...
5. Repeat - Load new wavefronts
A GCN compute unit therefore works on four wavefronts at once, and can issue new wavefronts every four cycles. Unfortunately, if we only have one wavefront, we still block the entire four as we can only load a new one every four cycles. This means in games GCN tended to be under-utilised. Over time developers did learn to work better with GCN - They had to: GCN won the console.
It was the GPU architecture of the Xbox One and PlayStation 4. If your game engine ran badly on GCN, it was a has-been engine. So on release, a Radeon HD 7970 was as much as 20% slower than a GeForce GTX 680 in unfavourable cases, and in general around competitive across the board. Four years later, the GCN-based 7970 was as much as 40% faster than the GTX 680 in contemporary workloads. Games better learned how to handle GCN's one-in-four wavefront dispatch.
GCN also, which wasn't exposed to begin with, had a really cool feature called asynchronous compute. This allowed compute tasks to be "slotted into" spare wavefronts. GCN had a lot of spare wavefronts! This meant AMD was aware of GCN's wavefront issue problem from the get-go, and had hardware to alleviate it. The asynchronous compute engine as finally enabled to generic software in DirectX 12. Nvidia also supported asynchronous compute, but emulated it in software. There was some benefit on the Green Team, but not as much. Nvidia primarily has less wasted capacity in a wavefront anyway.
GCN initially had poor tessellation throughput. Games barely used it, so one tessellator was fine for first generation GCN (GFX6). Because this was an AMD weakness, Nvidia pushed tessellation heavily in its GameWorks middleware! In some cases, tessellation was so extreme that 32 polygons were under each pixel, dramatically slowing performance for no additional image accuracy at all. GCN's tessellation issues were mostly worked out in GFX7, Hawaii and Bonaire. Hawaii doubled the tessellators.
The next enhancement to GCN was GFX8, Fiji and Tonga. Polaris (Ellesmere, Baffin and Leka) was also GFX8 level. AMD claimed a "color compression" boost, but this is GPU engineering parlance for anything which causes more effective utilisation of memory bandwidth. Larger L2s and CU prefectching added most of GFX8's advantage, where it had advantage at all.
The peak of GCN was "NCE", used in Vega. Vega shipped quite badly broken, the Primitive Shader (combines vertex and pixel data for dispatch) was either broken or didn't add any performance. Vega's CU occupancy in even highly optimised games was around 50-70%, almost criminally bad, and this was the best of GCN. The Drawstream Binning Rasteriser also appeared to be non-functional, but it almost certainly did work. Vega managed a substantial improvement over Fiji, even clock corrected, despite having much less raw memory bandwidth. Like Maxwell, Vega made better use of the on-chip storage and, like Maxwell, the tile based dynamic rendering shows most benefit in bandwidth constrained situations.
AMD Radeon HD 7970 - 2011
AMD's Graphics Core Next was originally a codename for what was coming after VLIW-4 (Cayman, seen in the HD 6970), the instruction set was to change from VLIW to SIMD.
Each GCN "block" consists of four vector ALUs (SIMD-16) and a simple scalar unit. Each SIMD-16 unit can do 16 MADDs or FMAs per clock, so 128 operations per clock for the whole thing. The texture fetch, Z and sample units are unchanged from Terascale2/Evergreen, there are 16 texture fetch, load/store units and four texture filter units per each GCN "compute unit".
AMD's Radeon HD 6000 generation had been disappointing, with rehashes of previous 5000 series GPUs in Juniper (HD 5750 and 5770 were directly renamed to 6750 and 6770) while the replacement for Redwood, Evergreen's 5 CU part, was Turks, a 6 CU part. It seemed a bit pointless. The high-end was Barts, which was actually smaller and slower than Cypress. Only the very high end, Cayman, which was a different architecture (VLIW-4 vs VLIW-5), was any highlight.
On release, the HD 7970 was as much as 40% faster than Cayman. Such a generational improvement was almost unheard of, with 10-20% being more normal. Tahiti, the GPU in the 7970, was phenomenally powerful. Even Pitcairn, the mainstream performance GPU, was faster than everything but the very highest end of the previous generation.
Over time, as games and drivers matured, Tahiti gained more and more performance. On release it was of similar performance to the GeForce GTX 680, but a few years later it was running much faster. It kept pace so well that, eventually, its 3 GB RAM became the limiting factor!
Tahiti was one of those rare GPUs which takes everything else and plain beats it. It was big in every way, fast in every way, and extremely performant in every way. Notably, its double-precision floating point performance was 1/4, meaning it hit almost 1 TFLOPS of DP performance. That was still at the high end of things in 2016.
The Radeon HD 7970 was the first full implementation of the "Tahiti" GPU, which had 32 GCN compute units, organised in four clusters, clocking in at 925 MHz. This put it well ahead of Nvidia's competing Kepler architecture most of the time. An enhanced "GHz Edition" was released briefly with a 1000 MHz GPU clock (not that most 7970s wouldn't hit that), which was then renamed to R9 280X. At that point, only the R9 290 and R9 290X, which used the 44 units of AMD's "Hawaii", a year later, was any faster.
This card eventually died an undignified death, beginning with hangs when under stress, then failing completely. As it was on a flaky motherboard (RAM issues), I assumed the motherboard had died, and replaced it with a spare Dell I got from the junk pile at work (Dell Optiplex 790). This video card couldn't fit that motherboard due to SATA port placement, only when a PCIe SATA controller arrived did the GPU's failure become apparent.
It was likely an issue on the video card's power converters and the Tahiti GPU remains fully working on a PCB unable to properly power it.
Core: 32 ROPs, 128 TMUs, 925 MHz
RAM: 384 bit GDDR5, 1375 MHz, 264,000MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 32x Unified shader 5.1 (32 GCN blocks - 2048 individual shaders)
MADD GFLOPS: 3,790
Zotac GeForce GT620 1GB DDR3 - 2012
This bears a sticker on the rear telling us it was manufactured in 2014, at which point the GF119 GPU was three years old. It has debuted as a low end part of the GeForce 500 series, in GeForce 510 and GeForce 520. The GeForce 620 retail normally used the GF108 GPU (even older, and first appeared in the GeForce 430), but OEM parts were GF119. This was a relabel of the GeForce 520 and used to meet OEM lifetime requirements.
The provenance of this particular card is not well known: It arrived non-functional in an off-the-shelf PC which had onboard video (GeForce 7100) as part of its nForce 630i chipset, so clearly was not part of that PC when it shipped.
It used the old Fermi architecture and only one compute unit of it, giving it just 48 CUDA cores. The GPU clock ran at 810 MHz (the CUDA cores in Tesla and Fermi were double-pumped) and DDR3 ran at 900 MHz over a 64 bit bus, all reference specs, as this card doesn't show up on the bus.
In a system which could keep the IGP running with a video card present, the GT620 actually appeared on the bus and could be queried. It turned out to be a Zotac card with an 810 MHz GPU clock and 700 MHz DDR3 clock. No display output was functional of the HDMI and DVI present. The header for a VGA output was fitted, but the actual port was not.
In later testing, the GT620 was found to be fully functional. Most likely some manner of incompatibility with BIOS or bad BIOS IGP settings caused it. The system it was in had lost its CMOS config due to a failed motherboard battery.
The GeForce GT620 was very cheap, very low end, and very slow. It would handle basic games at low resolutions, such as 1280x720, but details had to be kept in check. In tests, it was about as fast as a Core i5 3570's IGP and around 30% better than the Core i5 3470's lesser IGP. Given they were contemporaries, one wondered exactly who Nvidia was selling the GeForce GT620 to. The GeForce GT520's life did not end with the GT620. It had one more outing as the GeForce GT705, clocks up a little to 873 MHz.
Its contemporary in the bargain basement was AMD's Radeon HD 5450 and its many relabels (6350, 7350, R5 220), which it more or less equal to.
Core: 4 ROPs, 8 TMUs, 810 MHz
RAM: 64 bit DDR3, 700 MHz, 11.2GB/s
Shader: 1x Unified shader 5.0 (1 Fermi block - 48 individual shaders)
MADD GFLOPS: 155.5
Asus GT640 2 GB DDR3 - 2012
This large, imposing thing is actually Asus' GeForce GT640. What could possibly need such a large cooler? Not the GT640, that's for sure. The DDR3 version used Nvidia's GK107 GPU with two Kepler units, for 384 cores, but also 16 ROPs. The DDR3 held it back substantially, with the RAM clock at 854 MHz, the 128 bit bus could only deliver 28.5 GB/s. The GPU itself ran at 901 MHz on this card. Asus ran the RAM a little slower than usual, which was 892 MHz for most GT640s.
GK107 was also used in the GTX 650, which ran the GPU at 1058 MHz, 20% faster, and used GDDR5 memory to give 3.5x the memory performance of the GT640. It was around 30% faster in the real world. It, along with the surprisingly effective (but limited availability) GTX645, was the highlight of Nvidia's mid-range. The GT640, however, was not.
GT640 was among the fastest of Nvidia's entry level "GT" series and did a perfectly passable job. Rear connectors were HDMI, 2x DVI and VGA. It could use all four at once. At the entry level, performance slides off much quicker than retail price does, and while GT640 was near the top of it, it was still much less cost-effective than GT650 was. The very top, GTX680, was also very cost-ineffective.
The low end and high end of any generation typically have similar cost to performance ratios, the low end because performance tanks for little savings, and the high end because performance inches up for a large extra cost.
RAM: 128 bit DDR3, 854 MHz, 27,328 MB/s
Shader: 2x Unified shader 5.1 (2 Kepler blocks - 384 individual shaders)
MADD GFLOPS: 691.2
EVGA GeForce GTX 680 SuperClocked 2 GB - 2012
The GTX 680 was the Kepler architecture's introductory flagship. It used the GK104 GPU, which had eight Kepler SMX units (each unit had 192 ALUs, or "CUDA cores"), each SMX having 16 TMUs attached and the whole thing having 32 ROPs. Memory controllers were tied to ROPs, each cluster of four ROPs having a 32-bit link to a crossbar shared among four ROPs, so each crossbar memory controller, which served 8 ROPs, had a 32 bit memory channel to RAM. With 32 ROPs, GTX 680's GK104 had 256 bit wide memory.
Kepler appeared to have been taken by surprise by AMD's GCN, but just about managed to keep up. As games progressed, however, GK104's performance against the neck-and-neck Radeon HD 7970 began to suffer. In more modern titles, the Tahiti GPU can be between 15 and 30% faster.
Nvidia's Kepler line-up was less rational than AMD's GCN or TeraScale 2, but still covered most of the market:
GK107 had 2 units
GK106 had 5 units
GK104 had 8 units
Nvidia disabled units to make the single unit GK107 in GT630 DDR3 and the three unit GK106 in GTX645. The second generation of Kepler, in (some) Geforce 700s added GK110 with 15 units, Nvidia pulled out all the stops to take on GCN, and more or less succeeded.
We're getting ahead of ourselves. GTX 680 was released into a world where AMD's Tahiti, as Radeon HD 7970, was owning everything, in everything. How did GTX 680 fare? Surprisingly well. Kepler was designed as the ultimate DirectX 11 machine and it lived up to this... These days, however, by showing how badly it has aged. While the 7970 kept up with modern games, the GTX 680 tended not to maintain its place in the lineup. The newer the game, the more the 7970 beats the GTX 680 by.
Core: 32 ROPs, 128 TMUs, 1150 MHz
RAM: 256 bit GDDR5, 1552 MHz, 198,656MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 8x Unified shader 5.1 (8 Kepler blocks - 1536 individual shaders)
MADD GFLOPS: 3,532
Nvidia Quadro K2000 2 GB GDDR5 - 2013
The Quadro K2000 was Nvidia's "mainstream" professional video card at $599 on launch in 2013. It is based around the Nvidia GK107 GPU, which has three Kepler blocks on it, each of which contains 192 CUDA cores (256 functional units including the very limited-use special function unit), giving 384 shader units on this GPU.
Practically, it's a GeForce GTX 650 with double the memory and lower clocks. The memory on this professional card is not ECC protected and provided by commodity SK Hynix, part H5GQ2GH24AFR-R0C.
Most GTX 650s (and the GT740 / GT745 based on it) used double slot coolers, while this uses a single slot design. The GTX 645 used almost the exact same PCB layout as the Quadro K2000, but the more capable GK106 GPU. The GTX 650 used a slightly different PCB, but also needed an additional power cable. Nvidia put Kepler's enhanced power management to good use on the K2000, and, in testing, it was found to throttle back quite rapidly when running tight CUDA code, the kind of thing a Quadro is intended to do. When processing 1.2 GBs worth of data through a CUDA FFT algorithm, the card had clocked back as far as 840 MHz, losing over 10% of its performance. It stayed within its somewhat anaemic 51 watt power budget and reached only 74C temperature.
Professionals wanting more performance than a $150 gaming GPU should have probably bought a GTX 680 a few months earlier with the money, and had enough left over to get some pizzas in for the office. Professionals wanting certified drivers for Bentley or Autodesk products should note that both AMD and Nvidia's mainstream cards and drivers are certified.
This came out of a Dell Precision T5600 workstation, where video was handled by two Quadro K2000s (GTX 650 alike, $599) to give similar performance to a single Quadro K4000 (sub-GTX 660 $1,269). By the time it arrived here, one of the K2000s was missing. The K4000 was probably the better choice, but that's not what we're here for.
Core: 16 ROPs, 32 TMUs, 954 MHz
RAM: 128 bit GDDR5, 1000 MHz, 64,000 MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 3x Unified Shader 5.1 (3 Kepler blocks - 384 individual shaders)
MADD GFLOPS: 732.7
XFX RX 570 RS XXX Edition 8 GB - 2017
The RX 570 was announced as a rebrand of the RX 470 with slightly improved GPUs. Advances in the 14 nm FinFET process at TSMC meant the GPU controlling firmware could run them faster for longer, or at the same clocks with less power.
With the Polaris 20 GPU having 36 compute/vector units (each having 4 GCN SIMD-16 cores), the 32 core variant was essential for yield harvesting. If a single core was faulty, the fault could be mapped into the disabled area. In GCN, dispatch units were four wide, so logically arranged as four columns of compute units. Polaris 20 implemented 9 compute units per column (compare with Tahiti and Tonga which used 8) and they had to be symmetrical, so long as they all had the same "depth" of compute units.
RX 570 disabled one unit in each column and was die fused, so they couldn't be re-enabled. This left it with 32 units active, the same as Tonga and Tahiti, being fed by its 256 bit GDDR5 bus, the same as Tonga (R9 285, R9 380, R9 380X). It behaved very much like a die-shrunk Tonga as the RX 570 because, well, that's what it was.
|Tahiti (R9 280X)||Tonga (R9 380X)||Polaris 10 (RX 470)||Polaris 20 (RX 570)
A lot about Polaris was "the same as Tonga" because the GPU was, in fact, the same as Tonga. It was the same design, resynthesised for 16 nm FinFET, and given 9 compute units per dispatch block (there were four such dispatch blocks) instead of 8. Being on the smaller 16 nm process, Polaris 10/20 was only 232 square millimetres. On the older 28 nm process, Tonga was 359 square mm. Polaris had all the same technology, all the same instructions, features and capabilities as Tonga.
This is not to say Tonga was bad. It was AMD's "GFX8" technology level (which actually describes compute capability), with DX12 compliance (at the 12_0 featureset), SIMD-32 instruction set and hardware asynchronous compute. It was, however, released as a minor update to GFX7 in 2014! By 2017, GFX9 (Vega) had been released, and 2019 saw GFX10, RDNA, released. As there isn't a good list of AMD's architecture levels anywhere, here it is:
|Rage 6/7||GFX1||R100-R200||Radeon VE, Radeon 8500|
|Rage 8||GFX2||R300-R520||Radeon 9700 - Radeon X1950|
|Terascale||GFX3||R600-RV770||Radeon HD 2900, Radeon HD 4870|
|Terascale 2||GFX4||Juniper, Cypress, Barts, Turks||Radeon HD 6870|
|Terascale 3||GFX5||Cayman, Trinity, Richland||Radeon HD 6970|
|GCN Gen1||GFX6||Tahiti, Cape Verde, Pitcairn||Radeon HD 7970|
|GCN Gen2||GFX7||Bonaire, Hawaii||R9 290X|
|GCN Gen3||GFX8||Tonga, Fiji, Polaris||R9 285, R9 Fury, RX 590|
|NCE||GFX9||Vega, Raven Ridge, Arcturus||Radeon Vega 64|
|RDNA||GFX10||Navi||Radeon RX 5700 XT|
Polaris 10 being the biggest of the 14nm Polaris generation was seen as somewhat baffling. It was significantly less powerful than Hawaii of the R9 290/390 series. Its 36 units at 1.25 GHz slightly couldn't keep up with Hawaii's 44 units at 1 GHz and its 1.75 GHz 256 bit GDDR5 memory had no hope of keeping up with Hawaii's 1.5 GHz 512 bit array. RX 580, Polaris 20 in its most optimal configuration, was a wash against R9 390X. It was also around half the price!
RX 570 (and RX 470) was positioned to replace R9 380 in the market, which it had around a 30-40% advantage over. It was priced next to Nvidia's GTX 1060 3GB, which it usually slightly beat until quality settings got raised, then the 3 GB of Nvidia's card got comprehensively dismantled by AMD's 8 GB provision. The 6 GB GTX 1060 again traded blows with the RX 570 but was significantly more expensive. Who pays more for less?
AMD's position was quite obvious: Pascal, typically as GTX 1050 Ti and GTX 1060, was much more power efficient and got way, way higher performance per watt. AMD's older GCN design couldn't match that for perf/watt, but it could match it for perf/dollar. AMD pushed the RX 570 and RX 580 to the apex of the bang for buck curve.
Later in its life it was up against Nvidia's GTX 1650, which it typically won against quite substantially. Nvidia's smaller Turings offered slightly more in the way of features (DirectX 12 feature level 12_1), but video cards have rarely sold on anything but performance.
This particular card is based around the same PCB and cooler design as XFX's RX 580 GTS XXX Edition, but one heatpipe is missing so it doesn't quite have the same thermal capacity. XFX saw fit to give it switchable "performance" and "stealth" BIOSes which actually enabled quite a cool little trick. Perfomance clocked GPU/RAM at 1286/1750, while stealth clocked at 1100/2050. Ordinarily, of course, the RAM clocks for stealth were not available for performance, but if we set the GPU to stealth, reset the video driver to defaults, make a GPU performance profile, then change the GPU to performance BIOS, we can clock the GPU at normal levels while retaining the 2050 MHz RAM. This would knock the card a little higher than its board power should be.
This one would clock at 1351/2050 with around 95% stability. It was possible to find workloads which would go wrong at thoose clocks and we'd be thermally or power limited in games. A quick bash of The Witcher 3 had it clocking at 1275 MHz with power limit +20%. The GPU power seemed to be set at 125 W, for a 150 watt board power, raising power limit only applies to the GPU, so +20% should give 25 W extra to play with - More than the cut back cooler on the card can handle.
Hynix, Samsung and Micron GDDR5 were common on these cards, typically the Samsung was the fastest in terms of timings, but there wasn't a lot in it. This used Micron GDDR5 rated for 2 GHz and tRCDW-tRCDWA-tRCDR-tRCDRA-tRC-tCL-tRFC of 23-23-29-29-79-22-219. At stock 1750 MHz, it would run 21-21-26-26-70-20-192.
AMD guaranteed a minimum clock of 1168 MHz, which was its "base clock". This is what you'd get if you met the bare minimum, put it in a cramped case with nearly no airflow, and gave it an intensive workload. This is up from the RX 470's base clock of 926 MHz, but that had to fit into a board power of 120 watts (so only 100 watts for the GPU). In nearly all cases, including every case I tested this card on, it did not drop as low as base clock, even in The Witcher 3 at "ultra" (no Hairworks) and the GPU running into thermal limits at 90C. In fact, the AMD boost clock of 1244 MHz was about the lowest observed.
In reality, nearly all RX 570s ran a boost clock of 1286 (some factory OCs ran as high as 1340), and "boost clock" really meant "the clock it will run at unless it reaches power or thermal limits".
By 2020, when I bought it for £125, it was showing its age and was only good for 1080p at medium/high in most demanding games of that year. The 8 GB RAM did help: Cards with more RAM tend to age better than ones with less, even if the don't have a lot of benefit when new. In late 2020, COVID-19 shortages hit other GPUs, meaning RX 570s nad RX 580s rose a little in price, then in early 2020 was a cryptocoin mining boom again... This RX 570 hit £350. Had I wished to resell it, it would have fetched £200-250 on eBay, as the 8 GB model was suitable for Ethereum mining where the 4 GB model was not.
Reference Benchmarks at 1351/2050 (literature scores used for other GPUs)
|Benchmark||RX 570||RX 580||RX 470||R9 380|
|Luxmark 3.1 "Hotel" 2645||2894||2370||1419|
|The Witcher 3||39.4||40.7||35.3||26.8|
Core: 32 ROPs, 128 TMUs, 1286 MHz
RAM: 128 bit GDDR5, 1750 MHz, 224,000 MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 32x Unified shader 5.1 (32 GCN blocks - 2048 individual shaders)
MADD GFLOPS: 5,267
What Is A Video Card?
In theoretical and historical terms, computer video is an evolving proof of the Wheel of Reincarnation (see Sound Cards too) wherein:
1. A new type of accelerator is developed, which is faster and more efficient than the CPU alone
2. They eventually take on so much processing power that they are at least as complex as the host CPU
3. Functions done on the GPU are essentially done in software, as the GPU is so flexible
4. The main CPU becomes faster by taking in the functions that dedicated hardware used to do
5. A new type of accelerator is developed, which is faster and more efficient than the CPU alone
Modern video cards are at stage 4, Ivy Bridge and AMD's APUs represent the first generation. They cannot match a GPU's memory bandwidth, so future accelerators will be about reducing this need or supplying it, but the GPU shader core is inexorably bound for the CPU die.
The Current Cycle
In the early days, the video hardware was a framebuffer and a RAMDAC. A video frame was placed in the buffer and, sixty (or so) times a second, the RAMDAC would sequentially read the bitmap in the framebuffer and output it as an analog VGA signal. The RAMDAC itself was an evolution of the three-channel DAC which, in turn, replaced direct CPU control (e.g. the BBC/Acorn Micro's direct CPU driven digital video) and even by the time of the 80286, the dedicated framebuffer was a relatively new concept (introduced with EGA).
This would remain the layout of the video card (standardised by VESA) until quite late in VGA's life (early to middle 486 era) where it incorporated a device known as a blitter, something which could move blocks of memory around very quickly, ideal for scrolling a screen or moving a window with minimal CPU intervention. At this stage the RAMDAC was usually still external (the Tseng ET4000 on the first video page is an example) with the accelerator functions in a discrete video processor.
The next development, after more "2D" GDI functions were added was the addition of more video memory (4MB is enough for even very high resolutions) and a texture mapping unit (TMU). Early generations, such as the S3 ViRGE on this page, were rather simple and didn't really offer much beyond what software rendering was already doing. These eventually culminated in the TNT2, Voodoo3 and Rage 128 processors.
While increasing texture mapping power was important, the newly released Geforce and Radeon parts were concentrating on early versions of what the engineers called 'shaders', small ALUs or register combiners able to programmatically modify specific values during drawing, these were used to offload driver API code which set up triangles and rotated them to fit the viewing angle, and then another part where lighting was applied.
These became known as "Transformation and Lighting" engines (T&L) and contained many simple ALUs for lighting (which is only 8 bit pixel colour values) and several, not as many, more complex ALUs for vertex positions, which can be 16 bits.
As it became obvious that GPUs had extreme levels of performance available to them (a simple Geforce 256 could do five billion floating point operations per second!), it was natural to try to expose this raw power to programmers.
The lighting part of the engine became a pixel shader, the vertex part, a vertex shader. Eventually they became standardised into Shader Models. SM2.0 and above have described a Turing Complete architecture (i.e. able to perform any computation) while SM4.0 arguably describes a parallel array of tiny CPUs.
Current SM4.0/SM5.0 (DX10 or better) shaders are very, very fast at performing small, simple operations repetitively on small amounts of data, perfect for processing vertex or pixel data. However, SIMD and multi-core CPUs are also becoming fast at performing these operations and much more flexible. This has led many to believe that the lifespan of the GPU is nearing an end. By saving the expense of a video card and using the extra budget to tie much faster memory to a CPU and add more cores to the CPU, a more powerful machine even when not playing games could be likely realised in future PCs.
A modern GPU's array of shader power is the extreme of one end of a scale which goes both ways. The far other end is a CPU, which has few but complex cores. A CPU is much faster on mixed, general instructions than a GPU is, a GPU is much faster on small repetitive workloads. The two extremes are converging, CPUs are evolving simpler cores while GPUs are evolving more complex ones. Intel's future Larrabee GPU is actually an array of 16-48 modified Pentium-MMX cores which run x86 instructions, truly emulating a hardware GPU, Larrabee has no specific video hardware and could, with some modifications, be used as a CPU, though it would be incredibly slow for the poorly threaded workloads most CPUs handle.