NVIDIA DGX A100
The Universal System for AI InfrastructureRegister for PoC
A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10X higher throughput for single-precision dense matrix multiply operations.
The worlds first AI System built on NVIDIA A100
NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.
8X NVIDIA A100 GPUS WITH 320 GB TOTAL GPU MEMORY
12 NVLinks/GPU, 600 GB/s GPU-to-GPU Bi-directonal Bandwidth
6X NVIDIA NVSWITCHES
4.8 TB/s Bi-directional Bandwidth, 2X More than Previous Generation NVSwitch
9x MELLANOX CONNECTX-6 200Gb/S NETWORK INTERFACE
450 GB/s Peak Bi-directional Bandwidth
DUAL 64-CORE AMD CPUs AND 1 TB SYSTEM MEMORY
3.2X More Cores to Power the Most Intensive AI Jobs
15 TB GEN4 NVME SSD
25GB/s Peak Bandwidth, 2X Faster than Gen3 NVME SSDs
The NVIDIA Ampere architecture, designed for the age of elastic computing, delivers the next giant leap by providing unmatched acceleration at every scale The A100 GPU brings massive amounts of compute to datacentres. To keep those compute engines fully utilised, it has a leading class 1.6TB/sec of memory bandwidth, a 67 per cent increase over the previous generation DGX. In addition, the DGX A100 has significantly more on-chip memory, including a 40MB Level 2 cache—7x larger than the previous generation—to maximise compute performance.
TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x. Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.
Modern AI networks are big and getting bigger, with millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions and inference, and some can be converted to zeros to make the models 'sparse’ without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.
Multi-Instance GPU (MIG) expands the performance and value of each NVIDIA A100 GPU. MIG can partition the A100 GPU into as many as seven instances, each fully isolated with their own high-bandwidth memory, cache, and compute cores. Now administrators can support every workload, from the smallest to the largest, offering a right-sized GPU with guaranteed quality of service (QoS) for every job, optimising utilisation and extending the reach of accelerated computing resources to every user.
Expand GPU access to more users
With MIG, you can achieve up to 7X more GPU resources on a single A100 GPU. MIG gives researchers and developers more resources and flexibility than ever before.
Optimise GPU utilisation
MIG provides the flexibility to choose many different instance sizes, which allows provisioning of right-sized GPU instance for each workload, ultimately delivering optimal utilization and maximizing data center investment.
Run simultaneous mixed workloads
MIG enables inference, training, and high-performance computing (HPC) workloads to run at the same time on a single GPU with deterministic latency and throughput.
Upto 7 GPU instances in a single A100
Dedicated SM, Memory, L2 Cache, Bandwidth for hardware QoS & isolation
Simultaneous workload execution with guaranteed quality of service
All MIG instances run in parallel with predicatable throughput & latency
Right-sized GPU allocation
Different sized MIG instances based on target workloads
To run any type of workload on a MIG instance
Diverse deployment environment
Supported with Bare metal, Docker, Kubernetes, Virtualised env.
Scaling applications across multiple GPUs requires extremely fast movement of data. The third generation of NVIDIA NVLink in A100 doubles the GPU-to-GPU direct bandwidth to 600GB/s, almost 20x more than PCI-E 4.0. When paired with the latest generation of NVIDIA NVSwitch, all GPUs in the server can communicate with each other at full NVLink speed for incredibly fast training.
NVLink and NVSwitch are essential building blocks of the complete NVIDIA datacentre solution that incorporates hardware, networking, software, libraries, and optimised AI models and applications from NVIDIA GPU Cloud (NGC).
The NVIDIA GPU Cloud
The NGC provides researchers and data scientists with simple access to a comprehensive catalogue of GPU-optimised software tools for deep learning and high performance computing (HPC) that take full advantage of NVIDIA GPUs. The NGC container registry features NVIDIA A100 tuned, tested, certified, and maintained containers for the top deep learning frameworks. It also offers third-party managed HPC application containers, NVIDIA HPC visualisation containers, and partner applications.
The DGX A100 Pod
As an end-to-end AI solution provider, Scan can provide complete DGX A100 architectures paired with Mellanox InfiniBand low-latency high-throughput network fabric and a choice of all-flash storage solutions. These reference architectures ensure maximum GPU performance and utilisation and are supported by complete software and services packages.
• DGX POD more attainable than ever with DGX A100
• Experience a faster start with building flexible AI infrastructure
• Proven architectures, with leading storage partners
• Up to 40 PFLOPS computing power in just 2 racks
• 700 PFLOPS of power to train the previously impossible
AI Optimised Storage
Deep learning appliances such as the DGX A100 only works as intended if the GPU accelerators are fed data consistently and rapidly enough that the maximum utilisation is delivered. Scan offers a wide range of AI-optimised storage appliances suitable for deployment with the DGX A100.
Proof of Concept
Sign up to try one of the AI & Deep Learning solutions available from Scan ComputersRegister for PoC >
|DGX with NVIDIA A100|
|GPUs||8x NVIDIA A100 Tensor Core GPUs|
|GPU Specifications||6912 CUDA cores & 432 TF32 Tensor Cores per GPU|
|GPU Memory||40GB per GPU, 320GB total|
|GPU Interconnect||6x NVSwitch|
|Host CPUs||2x AMD EPYC 7742, total 128 cores / 256 threads|
|System Memory||1TB ECC Reg DDR4|
|System Drives||2x 1.92TB NVMe SSDs|
|Storage Drives||4x 3.84TB NVMe SSDs|
|Networking||8x single-port Mellanox ConnectX-6 VPI 200Gb/s HDR Infiniband. 1x dual-port Mellanox ConnectX-6 VPI 200Gb/s Ethernet.|
|Operating System||Ubuntu Linux|
|Operating Temperature Range||5ºC to 30ºC (41ºF to 86ºF)|