Deep Learning Hardware

Deep Learning uses multi-layered deep neural networks to simulate the functions of the brain to solve tasks that have previously eluded scientists. As neural networks have multiple layers they are best run on highly parallel processors. For this reason, you’ll be able to train your network much faster on GPUs than CPUs as the latter are more suited to serial tasks.




Performance in teraFLOPS

The graph above shows the dramatic performance difference in teraFLOPs of the DGX-1 Deep Learning server versus a traditional server with two Intel Xeon E5 2697 v3 CPUs. Of course you can train Deep Learning Networks without a GPU however the task is so computationally expensive that it’s almost exclusively done using GPUs.

The graph below shows the real world benefits in time saved when Deep Learning on a server with four Tesla M40 cards versus two Intel Xeon E5 2699 v3 CPUs.

Training Time in Days

GPU Server with x4 Tesla M40

0.4 Days

Dual CPU

5.9 Days

Which GPU(s) should I use in my Deep Learning Server?

Most Deep Learning frameworks make use of a specific library called cuDNN (CUDA Deep Neural Networks) which is specific to NVIDIA GPUs. So how do you decide which GPUs to get?

There are many NVIDIA GPU options ranging from the cheaper GeForce GTX - gaming oriented cards, to the more expensive Tesla - compute oriented cards, and beyond to the even more expensive DGX-1 server. All of these can run Deep Learning applications, but there are some important differences to consider.

NVIDIA Tesla & Titan X Cards

The differences & how they affect my Deep Learning System?

How many CUDA cores does the GPU have?
GPUs with more cores have more raw compute performance.

How fast is the memory?
The latest Tesla P100 GPUs are based on the Pascal architecture and utilise (HBM2) High Bandwidth Memory which provides up to 720GB/sec memory bandwidth. In contrast, the Pascal-based TITAN X uses slower GDDR5X memory which provides up to 480GB/sec memory bandwidth, which will make a big difference in Deep Learning tasks.

What about floating point compatibility?
Most Deep Learning only requires half precision (FP16) calculations, so make sure you choose a GPU that has been optimised for this type of workload. For instance, while most GeForce gaming cards are optimised for single precision (FP32) they do not run FP16 significantly faster. Similarly, many older Tesla cards such as those based on the Kepler architecture were optimised for single (FP32) and double (FP64) precision and so are not such a good choice for Deep Learning. In contrast, the latest Tesla GPUs based on the Pascal architecture can process two half precision (FP16) calculations in one operation, effectively halving the memory load leading to a big speed up in Deep Learning. However, this is not true for all Pascal GPUs, which is why we don’t recommend GeForce cards in our Deep Learning systems.

What about NVLink?
NVLink is a high bandwidth interconnect developed by NVIDIA to link GPUs together allowing them to work in parallel much faster than over the PCI-E bus. NVLink is currently only available in the DGX-1 server and is a big reason why it is faster for Deep Learning than eight PCI-E Tesla P100 cards in a standard GPU server.


In summary, while gaming GPUs are adequate for Deep Learning, the larger and faster memory available on Tesla cards provides much better performance. In addition, make sure you are using cards based on the latest Pascal architecture so you can enjoy full FP16 performance. Finally, in multi GPU environments where possible choose a server that supports NVLink as this will provide much more performance than PCI-E.

Deep Learning GPUs Compared

The table below highlights the key features and performance characteristics of the most popular GPUs for Deep Learning.

TITAN X (2016 Edition) Tesla K40 Tesla K80 Tesla M40 Tesla P100 (PCI-E) Tesla P100 (NVLink)
Archirecture Pascal Kepler Kepler Maxwell Pascal Pascal
CUDA Cores 3584 2880 2496 per GPU 3072 3584 3584
Memory 12GB 12GB 12GB per GPU 24GB 12GB or 16GB 16GB
Memory Bandwidth 480GB/sec 288GB/sec 240GB/sec per GPU 288GB/sec 540 or 720GB/sec 720GB/sec
ECC Support No Yes Yes Yes Yes Yes
Interconnect Bandwidth 32GB/sec 32GB/sec 32GB/sec 32GB/sec 32GB/sec 160GB/sec
Double-Precision (FP64) Performance 0.34 teraFLOPs 1.43 teraFLOPs 2.91 teraFLOPs 0.21 teraFLOPs 4.70 teraFLOPs 5.30 teraFLOPs
Single-Precision (FP32) Performance 4.29 teraFLOPs 4.29 teraFLOPs 8.74 teraFLOPs 7.00 teraFLOPs 9.30 teraFLOPs 10.60 teraFLOPs
Half-Precision (FP16) Performance 4.29 teraFLOPs 4.29 teraFLOPs 8,74 teraFLOPs 7.00 teraFLOPs 18.70 teraFLOPs 21.20 teraFLOPs