Deep Learning Hardware

Deep Learning uses multi-layered deep neural networks to simulate the functions of the brain to solve tasks that have previously eluded scientists. As neural networks have multiple layers they are best run on highly parallel processors. For this reason, you’ll be able to train your network much faster on GPUs than CPUs as the latter are more suited to serial tasks.

DGX-1 with V100

960 TFLOPS

DGX-1 with P100

170 TFLOPS

Dual CPU Server

3 TFLOPS

Half-precision (FP16) performance

The graph above shows the dramatic performance difference in teraFLOPs of the DGX-1 Deep Learning server versus a server with eight GPUs plus a traditional server with two Intel Xeon E5 2697 v3 CPUs. Of course you can train Deep Learning Networks without a GPU however the task is so computationally expensive that it’s almost exclusively done using GPUs.

The graph below shows the real world benefits in time saved when training on a second-generation DGX-1 versus a server with eight GPUs plus a traditional server with two Intel Xeon E5 2699 v4 CPUs.

DGX-1 with V100

7.4 Hours

8x GPU Server

18 Hours

Dual CPU Server

711 Hours

Time to train ResNet50, 90 epochs to solution

Which GPU(s) should I use in my Deep Learning Server?

Most Deep Learning frameworks make use of a specific library called cuDNN (CUDA Deep Neural Networks) which is specific to NVIDIA GPUs. So how do you decide which GPUs to get?

There are many NVIDIA GPU options ranging from the cheaper TITAN X workstation oriented cards, to the more powerful Tesla - compute oriented cards, and beyond to the even more capable DGX-1 server. All of these can run Deep Learning applications, but there are some important differences to consider.

The differences & how they affect my Deep Learning System?

How many CUDA cores does the GPU have?
GPUs with more cores have more raw compute performance.

How many Tensor cores does the GPU have?
Tensor cores are a new type of programmable core exclusive to GPUs based on the Volta architecture that run alongside standard CUDA cores. Tensor cores can perform 4x4 Matrix operations in one unit, significantly boosting performance in FP16 and FP32 calculations. For instance, a single Tensor core produces the equivalent of 64 FMA operations per clock, equivalent to 1024 FLOPs per SM, compared to just 256 FLOPs per SM for standard CUDA cores.

How fast is the memory?
The latest Tesla V100 GPUs are based on the Volta architecture and utilise (HBM2) High Bandwidth Memory which provides up to 900GB/sec memory bandwidth. In contrast, the Pascal-based TITAN Xp uses slower GDDR5X memory which provides up to 548GB/sec memory bandwidth, which will make a big difference in Deep Learning tasks.

What about floating point compatibility?
Most Deep Learning only requires half precision (FP16) calculations, so make sure you choose a GPU that has been optimised for this type of workload. For instance, while most GeForce gaming cards are optimised for single precision (FP32) they do not run FP16 significantly faster. Similarly, many older Tesla cards such as those based on the Kepler architecture were optimised for single (FP32) and double (FP64) precision and so are not such a good choice for Deep Learning. In contrast, Tesla are GPUs based on the Pascal architecture can process two half precision (FP16) calculations in one operation, effectively halving the memory load leading to a big speed up in Deep Learning. However, this is not true for all Pascal GPUs, which is why we don’t recommend GeForce cards in our Deep Learning systems. The latest Tesla GPUs are based on the Volta architecture and in addition to CUDA cores also have Tensor cores which are dedicated for deep learning, massively speeding up training time.

What about NVLink?
NVLink is a high bandwidth interconnect developed by NVIDIA to link GPUs together allowing them to work in parallel much faster than over the PCI-E bus. NVLink is currently only available in the DGX-1 server, with NVLink helping the first-generation DGX-1 with eight Tesla P100 cards to communicate 5x faster than PCI-E, and is the primary reason for the huge performance difference between the DGX-1 and standard GPU servers. The second-generation DGX-1 with eight Tesla V100 cards features an improved version of NVLink, with communication between the GPUs boosted up to 10x compared to a standard GPU server. This performance increase is achieved by increasing the bandwidth of NVLink from 20 to 25GB/sec (bidirectional) plus increasing the number of links per GPU from 2 to 6. This increased throughput enables more advanced modelling and data-parallel techniques for stronger scaling and faster training.

In summary, while gaming GPUs are adequate for Deep Learning, the larger and faster memory available on Tesla cards provides much better performance. In addition, make sure you are using cards based on the latest Volta architecture so you can enjoy full FP16 performance. Finally, in multi GPU environments where possible choose a server that supports NVLink as this will provide much more performance than PCI-E.

Deep Learning GPUs Compared

The table below highlights the key features and performance characteristics of the most popular GPUs for Deep Learning.

	Quadro GP100	TITAN Xp	TITAN V	Tesla K80	Tesla M40	Tesla P100 (PCI-E)	Tesla P100 (NVLink)	Tesla V100 (PCI-E)	Tesla V100 (NVLink)
Architecture	Pascal	Pascal	Volta	Kepler	Maxwell	Pascal	Pascal	Volta	Volta
Tensor Cores	0	0	640	0	0	0	0	640	640
CUDA Cores	3584	3840	5120	2496 per GPU	3072	3584	3584	5120	5120
Memory	16GB	12GB	12GB	12GB per GPU	24GB	12GB or 16GB	16GB	16GB	16GB
Memory Bandwidth	717GB/sec	548GB/sec	653GB/sec	240GB/sec per GPU	288GB/sec	540 or 720GB/sec	720GB/sec	900GB/sec	900GB/sec
Memory Type	HBM2	GDDR5X	HBM2	GDDR5	GDDR5	HBM2	HBM2	HBM2	HBM2
ECC Support	yes	no	no	Yes	Yes	Yes	Yes	Yes	Yes
Interconnect Bandwidth	32GB/sec	32GB/sec	32GB/sec	32GB/sec	32GB/sec	32GB/sec	160GB/sec	32GB/sec	300GB/sec

Deep Learning Hardware

Which GPU(s) should I use in my Deep Learning Server?

The differences & how they affect my Deep Learning System?

Deep Learning GPUs Compared

Contact Us

01204 474747

[email protected]