A little over seven years ago in March 2010, after delays, NVIDIA launched their first 40nm architecture built for DX11 from the ground up, named after Italian-American Physicist, Enrico Fermi. The ‘Fermi’ Architecture’s first implementation was in the form of the GF100 GPU at the heart of the flagship GeForce GTX 480 and its cut-down sibling, the GeForce GTX 470. At the time, the GTX 480 often beat AMD’s Radeon HD 5870 for the title of the fastest single-GPU graphics card in the world. But it did so with considerably higher power consumption and heat output on a substantially large die, with a higher price tag to go with it.

In 2012 we saw NVIDIA ditch the “big-dies first” practise for consumers. Ever since 28nm Kepler and the  GTX 680, Nvidia’s strategy has been to launch a smaller, ‘mid-sized’ die as the flagship, then refresh it later with the ‘big die’ GPU. 2013 even saw the use of the “Ti” moniker for a flagship card – the GTX 780 Ti representing the fullest implementation of the Kepler Architecture, the GK110 chip. Fast Forward to late 2017 and NVIDIA’s 16nm Pascal architecture has been fully realised across an entire stack of GPUs, from the lowest GT 1030 to the second most powerful single-GPU graphics card in the world, the TITAN Xp.

GTX 480

Magnificent Oldie(ish) Mashup

Today I’m putting 2010’s fastest single-GPU graphics card, the venerable GTX 480, up against NVIDIA’s $450 fifth-fastest single GPU graphics card, the GeForce GTX 1070 Ti. I will also bench it against a modern low-power, low-cost card from AMD, the RX 560 2GB which retails for just $99. Tracing the lineage of the mid-sized GP104 chip back to the GeForce 400 series and one actually finds the GTX 460 as the ‘true’ 2010-equivalent of this card. But since the playing field has changed considerably, NVIDIA is able to sell smaller chips for higher price points. The GeForce GTX 1070 Ti retails today for around $450 – fifty dollars less than the GTX 480 was at launch. Just how much more performance does fifty dollars less than the GTX 480 buy you in 2017?

The GPUs

Lets start by taking a closer look at the GPUs powering these three graphics cards.

GeForce GTX 480

Key Facts

  • 529mm² die size
  • 3.2b transistors
  • 40nm manufacturing process
  • 480 CUDA cores
  • 60 TMUs
  • 48 ROPs
  • 384-bit memory interface
  • 768KB L2 cache
  • 1.5GB GDDR5
  • 3.7 Gbps Memory data rate
  • 700 MHz core clock
  • 1400 MHz shader clock

A simplified block diagram of the GF100 GPU in the GTX 480. Note: Each “memory controller” represents a 64-bit channel.

The GeForce GTX 480 is powered by NVIDIA’s first 40nm and DX11-capable GPU based on the ‘Fermi’ architecture – the GF100. The chip is 529mm² and contains approximately 3.2 billion transistors. The full chip features 16 processor clusters termed by Nvidia as ‘Streaming Multiprocessors’. Each SM provides 32 CUDA cores, 4 Texture Mapping units and a PolyMorph engine for work such as Tessellation. This gives the full GPU 512 CUDA cores and 64 TMUs. Its implementation in the GTX 480 isn’t fully featured, however. Likely due to yields on the new at the time 40nm process, the GF100 in the GTX 480 has a SM disabled, leaving only 480 CUDA cores and 60 TMUs active. The GPU has 48 ROP units backing out onto a 384-bit memory interface wired to 1.5 GB of GDDR5 running at 3.7 Gbps – which produces 177GB/s of raw memory bandwidth. The GTX 480 has a TDP of 250W.

Up until Kepler in 2012, NVIDIA GPUs actually operated their processor cores at a separate frequency to the GPU core itself – the Fermi architecture has a ‘double-pumped’ shader clock. This means the GPU processor cores operate at twice the core frequency. GTX 480’s core is maintained at 700 MHz, this means the CUDA cores are running at 1400 MHz. As a result, GF100 in the GTX 480 produces 1.34 TFLOPs of single-precision floating-point performance.

GeForce GTX 1070 Ti

Key Facts

  • 314mm² die size
  • 7.2b transistors
  • 16nm manufacturing process
  • 2432 CUDA cores
  • 152 TMUs
  • 64 ROPs
  • 2MB L2 cache
  • 256-bit memory interface
  • 8GB GDDR5
  • 8 Gbps Memory data rate
  • 1607 MHz core clock
  • 1683 MHz Boost clock
  • actual clock under load: ~1800 MHz

A simplified block diagram for the GP104 chip in the GTX 1070 Ti. Note: each “Memory controller” actually represents a 32-bit channel.

The GTX 1070 Ti is the latest implementation of the mid-sized 16nm GP104 silicon. The first card based on this chip was 2016’s flagship GTX 1080, followed by the cut-down GTX 1070. With the launch of AMD’s highly competitive RX Vega 56 graphics card, NVIDIA saw the need for a new model to slot in between the slower 1070 and faster 1080 GPUs. The result is a card that performs somewhere between the two, but is usually faster than the RX Vega 56, but also carries a higher price tag.

The Pascal architecture for consumer GeForce cards is very similar to the Maxwell architecture from an top-level view. The full Gp104 chip features 20 SMs, each containing 128 CUDA cores and 8 TMUs and the PolyMorph engine. This results in a total of 2560 CUDA cores and 160 TMUs. The chip’s render back-end can process 64 pixels per clock (64 ROP units) and backs out onto a 256-bit interface. The implementation of the GP104 in the GTX 1070 Ti isn’t fully featured and has a single SM disabled likely to improve yields. Aside from the disabled SM, the biggest difference between this card and the fully featured GTX 1080 is the use of slower, 8Gbps GDDR5 rather than 10Gbps GDDR5X on the more expensive model. As a result, the 8GB framebuffer on the GTX 1070 Ti produces 256GB/s of bandwidth versus 320 GB/s on the 1080. The GTX 1070 Ti operates with a nominal TDP of 180W.

Aside from the relative increase in all major GPU resources, the Pascal Architecture, like the Maxwell architecture before it, increases performance per CUDA core (compared to pre-Maxwell architectures) as well as increasing clock rates to boot. Significant changes were made to the layout of the Streaming Multiprocessor to improve performance by simplifying scheduling and allowing for higher utilisation.

Radeon RX 560 2GB

Key Facts

  • 123mm² die size
  • 3b transistors
  • 14nm manufacturing process
  • 1024 Stream processors
  • 64 TMUs
  • 16 ROPs
  • 1MB L2 cache
  • 128-bit memory interface
  • 2 or 4GB GDDR5
  • 7 Gbps Memory data rate
  • 1175 MHz core clock
  • 1275 MHz Boost clock

Simplified block diagram of the Polaris 21 GPU used in the RX 560

AMD’s Radeon RX 560 is based on the Polaris 21 GPU. This chip is essentially a re-badge of the older Polaris 11 chip that powered the RX 460. Improvements in yields have allowed AMD to enable the chip’s full Compute Unit count in the RX 560 and improve clock speeds to boot. Compared to the Radeon RX 460, the RX 560 has 128 more Stream Processors (two CUs of 64 SP each). Other specifications of the chip include a 16-rop render backend, 128-bit memory interface and either 2 or 4GB of GDDR5 running at 7Gbps. Today I will be testing the 2GB model. The Radeon RX 560 has a Typical Board Power of between 60 and 80 W depending on the model. The card I am using features a 6-pin power connector but some models lack this and can be fed by the socket (75W max).