High performance training solutions
Using a custom training system for deep learning and AI workloads gives you the ultimate control. Not only in that you can choose the ideal specification for your projects but also in that you can build in flexibility as required. A system can be configured so that no resources are under utilised, or a larger chassis can be partially populated at purchase leaving space for scaling at a later date. The choice is yours.
Every 3XS custom training system is almost infinitely configurable from accelerator cards to CPU, memory to storage, right through to connectivity, power, cooling and software - all from the market leading component brands listed below.
NVIDIA GPU Accelerators
The NVIDIA Ampere family of GPU accelerator cards represents the cutting edge in performance for all AI workloads, offering unprecedented compute density, performance, and flexibility to deliver up to 5 petaFLOPS AI performance in a single system. The high-end NVIDIA A100 accelerator is available in either standard PCIe or high-density SXM4 formats featuring HBM2 memory, with the mid-range A30 accelerator available as a PCIe card. Any of these Ampere passively cooled GPUs offer the flexibility to be installed in a wide variety of both air and liquid cooled server chassis.
NVIDIA HGX Server
8x A100 SXM4
4x A100 SXM4
NVIDIA EGX Server
A100 PCIe Gen 4
A30 PCIe Gen 4
The Ampere Advantage
The NVIDIA Ampere GPU architecture, designed for the age of elastic computing, delivers the next giant leap by providing unmatched acceleration at every scale. When compared to previous generations of server GPUs the A100 and A30 outperform them significantly due to the increased CUDA core count, additional memory and next generation PCIe bus.
New Sparsity Acceleration
Modern AI networks are big and getting bigger, with millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions and inference, and some can be converted to zeros to make the models sparse without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.
Multi-Instance GPU (MIG) expands the performance and value of each NVIDIA A100 GPU. MIG can partition the A100 GPU into as many as seven instances, each fully isolated with their own high-bandwidth memory, cache, and compute cores. Now administrators can support every workload, from the smallest to the largest, offering a right-sized GPU with guaranteed quality of service (QoS) for every job, optimising utilisation and extending the reach of accelerated computing resources to every user.
3rd Gen NVLink and NVSwitch
Scaling applications across multiple GPUs requires extremely fast movement of data. The third generation of NVIDIA NVLink in A100 doubles the GPU-to-GPU direct bandwidth to 600GB/s, almost 20x more than PCI-E 4.0. When paired with the latest generation of NVIDIA NVSwitch, all GPUs in the server can communicate with each other at full NVLink speed for incredibly fast training.
In addition to enterprise-class GPUs there are numerous other acceleration devices that can aid deep learning and AI training workloads. These cards may be for specific tasks, allow programmability or meet a tighter budget requirement.
Alveo PCIe Gen4
NVIDIA RTX Accelerators
NVIDIA offers a very wide range of RTX GPU accelerators capable of giving great performance for workloads of single, double or half precision. Whilst having lower costs than the Ampere enterprise-class cards, they can deliver a cost-optimised training solution when absolute top of the range performance isn’t required.
The Xilinx Alveo range of accelerator cards deliver compute, networking, and storage acceleration in an efficient small form factor, and available with 100GbE networking, PCIe 4, and HBM2 memory. Designed to deploy in any server, they offer a flexible solution designed to increase performance for a wide range of datacentre workloads.
Intel FPGA-based accelerator cards provide hardware programmability on production qualified platforms, so data scientists can design and deploy models quickly, while allowing flexibility in a rapidly changing environment. Complete with a robust collection of software, firmware, and tools designed to make it easier to develop and deploy FPGA accelerators for workload optimisation in datacentre servers.
Micron's Deep Learning Accelerator platform is a solution comprised of a modular FPGA-based architecture, powered by Micron memory, running FWDNXT’s high performance engine tuned for a variety of neural networks. Featuring a broad deep learning framework support combined with an easy to use toolset and software programmability, these accelerators have the ability to run multiple neural networks simultaneously.
Either AMD EPYC or Intel Xeon Scalable processors can be chosen when designing your server. Both, now in their 3rd Generation offer expansive ranges of models delivering performance for every budget - all supporting PCIe 4.0 with 64 lanes. Additionally EPYC P-series processors allow for single socket configurations where GPU acceleration will be the primary server use, making a server as cost-effective as possible.
Depending on the type of workload, a large amount of system memory may have less or more relevance than GPU memory, but with a custom training server memory capacity can be tailored to your needs. Additionally, a bespoke server allows for simple future memory expansion is required.NVIDIA recommends at least double the amount of system RAM as GPU RAM, so high-end systems may scale into the TBs. Additionally Intel Xeon based servers can make use of a combination of traditional DIMMs and Intel Persistent Optane Memory DIMMs, allowing a flexible solution addressing performance, fast caching and extra storage capacity.
Storage within a training server is also a very personal choice - it may be that a few TB of SSD capacity are enough for datasets for financial organisations where a large volume of files is still relatively small. Alternatively, image-based datasets may be vast, so there is never any real option of using internal storage and a separate fast flash storage array is the way to go. If this is thecase, internal SSD cost can be minimised and this remaining budget used elsewhere. Flexibility and performance can also be gained by choosing M.2 formats, NVMe connectivity or Optane options. as required.
Depending on whether connectivity is needed to a wider network, or an external flash storage array, networking interfaces and speeds can be customised to suit. Ethernet or Infiniband options are available up to 400Gb/s in speed, both providing powerful CPU offloading to maximise performance, and minimise latency.
Additionally, advanced NVIDIA BlueField Data Processing Unit (DPU) NICs can be specified where the highest performance is required, as these cards not only include networking functionality but also accelerate software management, security and storage services by offloading these tasks from the CPU.
From 2U compact servers up to 4U expandable systems, chassis choice is key dependant upon whether space saving is the key factor or scalability is required. As a custom server can be partially populated, a larger chassis can be chosen with a view to expandability in the future. Additionally, both air cooled and liquid cooled server systems are available.
NVIDIA Virtual Compute Server (vCS) enables the benefits of hypervisor-based server virtualisation for GPU-accelerated servers. Datacentre admins are now able to power any compute-intensive workload with GPUs in a virtual machine (VM). vCS software virtualises NVIDIA GPUs to accelerate large workloads, including more than 600 GPU-accelerated applications for AI and deep learning.
With GPU sharing, multiple VMs can be powered by a single GPU, maximising utilisation and affordability, or a single VM can be powered by multiple virtual GPUs, making even the most intensive workloads possible.
It may be that over time, rather than a single bespoke training server, you end up with several systems as technologies advance and workloads increase. Although servers with difference CPUs, GPUs and storage will communicate effectively when using common networking interfaces, it may be that you aren’t getting the maximum utilisation from the various GPUs you have. In this case Run:AI GPU virtualisation software may be able to help.
Run:AI works by pooling different GPU resources into a virtual pool and allowing workloads to be scheduled by user or project across the available resource, ensuring that no hardware or data scientist dips in productivity.
Our intuitive online configurators provide complete peace of mind when building your training server
alternatively speak directly to one of our friendly system architects.