Where Is The Performance?

To answer why APU’s are so behind even cheap dedicated cards, we need to look at multiple aspects, all of which limit what can be done with APU’s currently. 

First, we look at the size of the chips, with 245mm^2 for the 7890K, and 133mm^2 for the 5775C. The size is important, as it dictates how many transistors can be crammed inside the chip. And since graphical workloads are inherently highly parallel, more transistors means more cores, meaning more performance. The 5775C is built on Intel’s 14nm process, while the 7890K is built on Global Foundries’ 28nm SHP process, therefore every square milimeter for Intel can fit more transistors than for AMD, assuming equal designs. Now, if we look at pure GPU’s, for example the HD 7770 GHz Edition, 245mm^2 might seem like a respectable size, given that the HD 7770 GHz Edition is 123mm^2 on a similar process. The same can be said for the 133mm^2 of the 5775C, closely matching the RX 460’s 123mm^2, both using a similar process. However, APU’s need to fit a CPU inside them too, therefore a large part of the die is unusable for the GPU.

AMD Die Shot

Captioning added by Guru3D

This is why despite having nearly double the size of the HD 7770 GHz Edition, the A10 7890K’s GPU only has 80% of the Compute Units that the 7770 has, with the 7770 having 10 and the 7890K having 8. This is a very important factor to consider, as die size dictates difficulty, and therefore cost, of manufacturing. Despite having less performance, the manufacturing cost is far higher, making pricing of the chip a difficult situation. This isn’t helped by AMD’s Bulldozer based CPU’s, specifically Steamroller in the 7890K, being very poor performers relegated to the bottom tier of the market.

Another important factor to consider is power. Looking at the 7770’s specifications, you would spot an 80W TDP. The 7890K’s TDP is 95W. Very quickly we can see a problem here, as a high clock speed CPU has to share a power envelope with a GPU, all the while having a similar power envelope of a lone GPU. This means the iGPU inside the 7890 cannot operate to the same level as its full blown dGPU brother, even with a similar hardware configuration. A similar thing can be said with the 5775C, as despite having a more efficient process than the 7770’s, when compared to the RX 460 on a similar process, it still sees a similar limitation, at 65W TDP for the 5775C and 75W TDP for the RX 460. Granted, for dGPU’s the TDP includes the power from memory and the PCB components, however that still leaves far more power for the dGPU than found in an APU.

But perhaps the biggest limitation that APU’s have to deal with is memory bandwidth. System memory tends to be optimized for low access times, to reduce CPU stalls as much as possible. However due to the nature of graphical workloads, memory latency can be easily masked by the design, instead calling for higher bandwidth. This is why there are different memory standards for these two use scenarios, with DDR4 being used for system memory, while GDDR5/X or more recently HBM2, being used as VRAM. When you have a CPU and GPU under the same package then, this creates a massive issue. How do you feed the GPU with the relatively miniscule bandwidth that CPU’s operates with?

To put things into perspective, the 7890K uses up to 2133MT/s DDR3 memory in dual channel configuration, meaning a maximum theoretical bandwidth of 34GB/s, shared between the CPU and GPU. Meanwhile, the HD 7770 GHz Edition is equipped with a 128bit interface, and 4.5GT/s GDDR5, amounting to 72GB/s theoretical bandwidth, used only by the GPU on the card. Here we can see the absolute massive disparity in bandwidth, as a GPU with 80% of the theoretical performance of its big dGPU brother has to live with less than HALF of the dGPU’s bandwidth, AND sharing it with the CPU.

If you recall, I mentioned the i7 5775C has 128MB of eDRAM on its package. This eDRAM acts as a sort of buffer, mitigating the banwidth limitation. However, if you recall our discussion of die sizes and manufcaturing cost, you will realize the problem with this solution. These 128MB of eDRAM are expensive to manufcature, and have to be fit on a substrate connecting the two, which yet again adds to the costs. This is not in any way shape or form economical, hence not solving much. In addition, this only mitigates the issue, not solves it. 128MB of eDRAM is nothing compared to the full Gigabytes found on dGPU’s, hence forcing a lot more data movement from system memory, which means bandwidth is still a very big limitation.

Intel APU

Intel Crystal Well

And the final issue to be mentioned here, is an Intel specific one. Intel entered the graphics market far later than either AMD (Who acquired ATi) or NVIDIA did. Intel lacks the sheer hardware and software engineering expertise for GPU’s that those two companies accumulated over decades, and it shows. The company requires as big or even bigger die sizes, even excluding the CPU, than their competitors need for equal performance. And that’s despite a node advantage and using expensive solutions like eDRAM. Using the Intel 14nm process and having eDRAM on board, an AMD built APU could have more than double or triple its current performance, and trounce anything Intel can currently put out. Intel makes up for their lack of GPU know-how by throwing money at the problem (eDRAM, larger die sizes), but there’s no going around it. The company cannot make APU’s economical enough compared to dGPU’s with their current level of GPU expertise.