Better bring jaw supports for this one
Completely in character, but in the most welcome way possible, NVIDIA’s Jen-Hsun Huang uses GTC to announce their next generation datacenter GPU – Tesla GV100, based on the new Volta architecture, half a decade in the making. An absolutely monsterous GPU, clocking in at 815mm^2(!!!!), which is at the reticule limit for TSMC! In other words, this is the largest GPU one can build according to NVIDIA. But what do those 815mm^2 of GPU bring? Quite an incredible GPU, for starters.
Specifications of GV100
The new Volta GPU contains 80 SM’s (cut down from 84 SM’s on die), in which there are 5120 FP32 cores, 2560 FP64 cores, and new with Volta: 640 Tensor cores. We’ll come back to that later, but it might just be the most exciting innovation within the Volta architecture for deep learning. The clockspeed is 1455MHz, pointing to around Pascal ranges of clockspeeds (Which is by no means a bad thing).
Moving on to memory, the GPU has a 4096-bit HBM2 interface and 16GB of HBM2 memory, just as GP100 before it. The difference here is that the HBM2 used here is far faster, providing up to 900GB/s peak bandwidth. The exact specifications are not yet available, but at the minimum it would have to be using 1.8Gbps stacks to achieve this mind boggling bandwidth.
The GPU contains an astonishing amount of transistors, 21.1 billion transistors of them, and is built on TSMC’s new 12nm FFN, providing higher density and performance than 16nm FF+ used previously with Pascal. Interestingly enough, the N in FFN stands for NVIDIA, as TSMC made a customized process for NVIDIA.
The Volta Streaming Multiprocessor
As with GP100 before it, each of GV100’s 84 SM’s contains 64 FP32 cores and 32 FP64 cores, though new with Volta is a quad partioning scheme. In Pascal’s GP100, the shader cores are partitioned into two blocks within an SM, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File. With Volta’s GV100, the SM is partitioned into four blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. This is done in order to improve SM utilization and therefore performance.
In GP100, INT32 operations were executed instead of FP32 operations. Not in GV100, as now there are separate units for both, meaning both can be executed at the same time with full throughput.
Tensor Cores – Deeply integrated Deep Learning
The tensor cores introduced in GV100 are the ace in the hole of an already impressive GPU. Aside from sporting 15TFLOPs of FP32, 30TFLOPS of FP16, and 7.5TFLOPS of FP64, new with Volta are tensor cores sporting 120 (!!!) TFLOPS for tensor deep learning operations. The purpose of this is quite simple: To cement NVIDIA’s dominance in deep learning, an emerging high margin market. These new tensor cores provide gigantic increases in performance for very specific operations that DL uses, thereby powering the massive neural networks that hunger for more and more computational grunt.
Each tensor core performs 64 FP Fused Multiply Add operations per clock, resulting in 1024 operations per clock in an SM, as compared to 128 FP32 operations per SM – an 8X increase in throughput! They operate on four 4×4 matrices, one being an FP32 result, two being FP16 matrices that are multiplied, and then an FP32 matrix accumulater that is added to the multiplied FP16 matrices (D = A*B+C). These operations are exposed in the CUDA C++ API, and the libraries and frameworks from both NVIDIA and many third parties have been updated to take advantage of these cores.
And Much More…
Volta as an architecture, and GV100 as a GPU, are both incredibly impressive feats of engineering. It is now clear why it has been in the oven for so long, with it appearing on roadmaps dating back to 2013, and surely been in early development before that. This is a groundbreaking architecture for NVIDIA and the ecosystem in general. This article only covered the most major features of the architecture, as there’s an absolute ton to discuss. Over the next few weeks more articles will come along discussing Volta to a deeper extent, covering the L1 data cache and shared memory subsystem, new SIMT model, and speculations for what all of this means for the gaming market, but for now, I think you’ve had enough to chew on.
Now, if you’ll excuse me, I’ll go find my jaw again.