As the types and uses of AI has evolved, two distinct types of inferencing have emerged. The first type is large scale, where significant GPU resource is still required to deliver the requested outcome. For example, LLMs such as ChatGPT route requests back to a datacentre full of servers processing many actions simultaneously. This type of inferencing relies very much on the same type of hardware used to train the model - so to make the best choice, we advise you read our AI Training Hardware Buyers Guide, where model parameters and memory sizes are discussed in greater detail.

The second type of inferencing is much more focused, where low-power embedded GPU modules are sufficient to deliver the desired outcome. Examples include the brain of an autonomous vehicle or a robot, where a scaled down version of the AI model is installed within the device in order to control how it behaves and reacts to external inputs and information.

This guide is focused on this type of inferencing, its major use cases, and the embedded GPU systems that support it.

AI edge inferencing hardware

Use Cases

Explore use cases that require rapid real-time inferencing at the edge by clicking the tabs below.

Robotics

Robotics is undergoing a revolution, moving beyond the era of specialist machines to generalist robots. This shift moves away from single-purpose, fixed-function robots toward adaptable robots trained to perform diverse tasks across varied environments. Inspired by human cognition, these adaptable robots combine fast, reactive responses with high-level reasoning and planning, enabling more efficient learning and adaptation.

Humanoid robot hardware and software architecture

Building a typical humanoid robot requires four essential layers:

Hardware Abstraction

Hardware Abstraction

Integrates all key sensing and actuation modalities, enabling the robot to perceive its environment and interact physically with the world.

Real-time control framework

Real-Time Control Framework

Manages precise, low-latency control of the robot's movement, where minimising latency is absolutely critical for safe and responsive operation.

Perception and planning

Perception and Planning

Equips the robot with environmental understanding, grasp and motion planning, locomotion, object recognition, and localisation—allowing effective interaction with the surrounding world.

High-level reasoning

High-Level Reasoning

Powers advanced functions such as scene understanding, complex task planning, and natural language interaction, where longer processing times are acceptable to support deeper reasoning and adaptability.

NVIDIA AI Software Stack

To deliver a seamless cloud-to-edge experience, embedded GPUs run the NVIDIA AI software stack for physical AI applications, including NVIDIA Isaac for robotics, NVIDIA Metropolis for visual agentic AI, and NVIDIA Holoscan for sensor processing. The resulting model is then inferred on the embedded GPU installed within the robot.

NVIDIA Isaac GR00T components

Smart Cities

Cities around the world are using AI and digital twins to reimagine how their most valuable physical assets and spaces are managed. NVIDIA embedded GPUs at the edge deliver real-time data from a host of cameras and sensors to feed deep learning-powered video analytics. When combined, these help to increase operational efficiency and safety across a broad range of spaces—from city streets and airports to event centres, shops and factory floors.

NVIDIA Isaac GR00T components
Smart Cities

Smart Cities

AI brings innovative ways to build sustainable cities, keep infrastructure in top shape, and enhance public spaces like roadways for residents and communities. By turning data from countless sensors and IoT devices into crucial decisions with vision AI, the transformation begins.

Smart Airports

Smart Airports

Handling millions of passengers annually, airports need to quickly and accurately manage incidents to minimise disruptions. AI-powered video analytics turn surveillance cameras into a source of actionable insights, ensuring smooth operations and a better passenger experience.

Smart Campus

Smart Campus

Corporate buildings and educational campuses can benefit from Vision AI, offering proactive safety solutions through continuous monitoring without needing staff and rapid response to potential issues that could otherwise go unnoticed and unreported.

Smart Venues

Smart Venues

Entertainment venues are designed for enjoyment, whether for sports, concerts, or other events but, they also need to ensure safety and efficiency. Using IoT sensors and cameras to provide real-time responses to any concerns, maintaining a secure and enjoyable environment.

Smart Retail

Smart Retail

Use AI to reduce losses, speed up checkout processes, prevent stockouts, and gain insights into customer behaviour for better merchandising. Camera and sensor data provide analytics that enhance decision-making, streamline operations, and boost efficiency.

Smart Manufacturing

Smart Manufacturing

In manufacturing, the automation and monitoring of assets, systems, and environments are crucial. Companies are leveraging AI and IoT sensors to gain real-time insights, leading to a safer and more efficient workplace.

The NVIDIA Omniverse Blueprint for smart city AI provides the complete software stack needed to accelerate the development and testing of AI agents in physically accurate digital twins of cities. It includes:

NVIDIA Omniverse digital twin simulation

NVIDIA Omniverse

Builds physically accurate digital twins and run simulations at city scale.

NVIDIA NeMo model training and fine-tuning

NVIDIA Cosmos

Generate synthetic data at scale for post-training AI models.

NVIDIA Cosmos synthetic data generation

NVIDIA NeMo

Curates high-quality data and use that data to train and fine-tune vision language models (VLMs) and LLMs.

NVIDIA Metropolis video analytics platform

NVIDIA Metropolis

Builds and deploys video analytics AI agents for video search and summarisation (VSS).

The blueprint workflow comprises three key steps. First, developers create a SimReady digital twin of locations and facilities using aerial, satellite or map data with Omniverse and Cosmos. Second, they can train and fine-tune AI models, such as computer vision models and VLMs, using NVIDIA TAO to improve accuracy for vision AI use cases. Finally, real-time AI agents powered by these customised models are deployed to alert, summarise and query camera and sensor data using the Metropolis VSS. Embedded GPUs feature widely through the physical edge AI systems needed to gather critical data.

Healthcare

The healthcare industry is deploying AI at the edge, providing high-performance computing in small, power-efficient devices for real-time medical image analysis, robotic surgery guidance, digital pathology, patient monitoring, and accelerated genomic sequencing. These platforms enable devices to process data locally, enhancing speed, reliability, security, and enabling faster clinical decisions by bringing AI power directly to the point of care.

AI powered medical imaging

Embedded GPUs are able to deliver a range of healthcare technologies, including:

Medical image analysis

Medical Image Analysis

NVIDIA MONAI helps to process medical images such as X-rays, CT scans, and ultrasounds in record time to detect anomalies, improve image quality, and track changes over time, supporting faster and more accurate diagnoses.

Digital surgery guidance

Digital Surgery

Embedded systems and NVIDIA Holoscan can guide surgeons in operating rooms, providing real-time AI assistance for tasks such as tool tracking, organ segmentation, and enhanced visual clarity on streaming video.

Digital pathology microscope

Digital Pathology

AI-powered microscopes can help pathologists identify subtle abnormalities, improve diagnostic accuracy, and streamline workflows for analysing tissue samples.

Patient monitoring system

Patient Monitoring & Safety

Bedside patient monitoring systems built with Clara Guardian help process sensor data, detect falls, track patient movement, and generate alerts for early intervention - improving safety and providing real-time data analytics.

Embedded GPU systems connected to cloud-based GPUs drive scalability and AI deployment with NVIDIA Triton Inference Server, to help the scaling of AI models for medical imaging, ensuring rapid and consistent diagnostics across multiple locations.

Medical devices with AI

Introducing NVIDIA Jetson

NVIDIA is the leading provider of edge AI and robotics platforms, offering powerful, compact Jetson GPU-accelerated modules and the JetPack software development kit (SDK). Jetson hardware is available as developer kits or modules for system integration. JetPack provides pre-built software services to fast-track sophisticated edge AI applications, including robotics, generative AI and computer vision.

It supports all Jetson modules, delivering real-time sensor processing, visual AI, and advanced robotics features in a unified ecosystem that is seamlessly compatible with NVIDIA DGX hardware and software stacks, and the NVIDIA Omniverse platform for simulation and development of digital twins.

The latest Jetson modules brings more power to NVIDIA's three computer solution for building AI-powered robots. The first being a DGX appliance to train the AI model that will be deployed on the robot; the second an Omniverse platform to simulate how the robot will move and react in the real world; and the third a Jetson module running the model on the robot.

NVIDIA Jetson software stack

The NVIDIA Jetson ecosystem offers a comprehensive range of products and services, including AI software, development tools, and hardware solutions such as servers, edge appliances, and industrial PCs from certified partners such as Scan. These solutions support industries ranging from robotics and manufacturing to retail, transportation, and healthcare, with commercial and ruggedised options also available.

Embedded AI Hardware

NVIDIA Jetson modules span a wide range of performance levels and price points, making them suitable for a wide variety of autonomous applications. The two main series are Thor and Orin, although the older Xavier, TX2 and Nano models are still available for legacy projects.

NVIDIA Jetson Thor

NVIDIA Jetson Orin

NVIDIA Jetson Thor is available as either an AGX developer kit or a choice of two GPU modules - the T5000 and the T4000 which need system integration. With Thor, robots no longer need to be reprogrammed for each new job, as it is the ultimate platform for physical AI, providing powerful compute for generative reasoning and multimodal, multi-sensor processing. Thor can be integrated into next-generation robots to accelerate foundation models, allowing flexibility for challenges such as object manipulation, navigation and following complex instructions.

Architecture

Jetson Thor is a SoC (System on Chip), comprising a Blackwell GPU with 5th gen Tensor cores and an Arm CPU, each sharing a unified memory pool. The AGX Thor Developer Kit and T5000 module have the same specs, with the T4000 module consuming less power at the cost of lower performance. However, the power consumption of all three variants can be configured to meet your project requirements.

NVIDIA Jetson Thor
Jetson AGX Thor Developer Kit and Jetson T5000 Jetson T4000
AI Performance (FP4) 2,070 TOPS 1,200 TOPS
GPU NVIDIA Blackwell, 2,560 CUDA cores, 96 5th gen Tensor cores NVIDIA Blackwell, 1,536 CUDA cores, 64 5th gen Tensor cores
GPU Max Frequency 1.57GHz
CPU 14-core Arm Neoverse-V3AE 12-core Arm Neoverse-V3AE
Memory 128GB LPDDR5X 64GB LPDDR5X
Networking 4x 25GbE 3x 25GbE

With its Multi-Instance GPU (MIG) technology and suite of accelerators, Thor can handle real-time video data streaming and AI inference, making it ideal for building AI agents performing video search and summarisation (VSS) tasks at the edge. Thor modules also support a wide range of generative AI models - including VLA (Vision Language Action), LLMs (Large Language Models) and VLMs (Vision-Language Models), delivering seamless cloud-to-edge integration.

Relative Performance & Capability

Jetson Thor is the most powerful inferencing module. It delivers over 7.5x higher AI compute than Jetson Orin, with 3.5x better energy efficiency.

Module Jetson Thor T5000 Jetson Thor T4000 Jetson AGX Orin 64GB Jetson AGX Orin 64GB Industrial Jetson AGX Orin 32GB Jetson Orin NX 16GB Jetson Orin NX 8GB Jetson Orin Nano 8GB Jetson Orin Nano 4GB
AI performance (FP4) 2,070 TOPS 1,200 TOPS 275 TOPS 248 TOPS 200 TOPS 157 TOPS 117 TOPS 67 TOPS 34 TOPS
Memory 128GB 64GB 64GB 64GB 32GB 16GB 8GB 8GB 4GB
Power 40-130W 40-70W 15-60W 15-75W 15-40W 10-40W 10-40W 7-25W 7-25W
Dimensions 243x112mm 100x87mm 100x87mm 100x87mm 100x87mm 69.6x45mm 69.6x45mm 69.6x45mm 69.6x45mm

NVIDIA Jetson Orin

NVIDIA Jetson Orin

NVIDIA Jetson Orin is available either as AGX developer kits or a choice of seven GPU modules which need system integration. The compact Jetson AGX Orin Developer Kit offers maximum performance, but can also emulate any of the Jetson Orin modules; while the Jetson Orin Nano Super Developer Kit is smaller, and includes a reference carrier board compatible with all Orin NX and Orin Nano modules. All Orin modules are provided with a powerful software stack featuring pre-trained AI models, reference AI workflows and vertical application framework, accelerating end-to-end development for generative AI, as well as edge AI and robotics applications.

Architecture

Jetson Orin is a SoC (System on Chip), comprising an Ampere GPU with 3rd gen Tensor cores and an Arm CPU, each sharing a unified memory pool. With nine models to choose from, Orin is available in a wide variety of performance, power consumption and budget levels.

NVIDIA Jetson Orin
Jetson AGX Orin Developer Kit Jetson AGX Orin 64GB Jetson AGX Orin Industrial Jetson AGX Orin 32GB Jetson Orin NX 16GB Jetson Orin NX 8GB Jetson Orin Nano Super Developer Kit Jetson Orin Nano 8GB Jetson Orin Nano 4GB
AI Performance 275 TOPS 248 TOPS 200 TOPS 157 TOPS 117 TOPS 67 TOPS 34 TOPS
GPU NVIDIA Ampere, 2,048 CUDA cores, 64 3rd gen Tensor cores NVIDIA Ampere, 1,792 CUDA cores, 56 3rd gen Tensor cores NVIDIA Ampere, 1,024 CUDA cores, 32 3rd gen Tensor cores NVIDIA Ampere, 512 CUDA cores, 16 3rd gen Tensor cores
GPU Max Frequency 1.3GHz 1.2GHz 0.9GHz 1.17GHz 1GHz
CPU 12-core Arm Cortex A78AE v8.2 8-core Arm Cortex A78AE v8.2 6-core Arm Cortex A78AE v8.2
Memory 64GB LPDDR5 32GB LPDDR5 16GB LPDDR5 8GB LPDDR5 4GB LPDDR5
Networking 10GbE 1GbE

Relative Performance & Capability

Jetson Orin modules occupy the high-end to mid-range, offering great flexibility and versatility, especially with the developer kits.

Module Jetson Thor T5000 Jetson Thor T4000 Jetson AGX Orin 64GB Jetson AGX Orin 64GB Industrial Jetson AGX Orin 32GB Jetson Orin NX 16GB Jetson Orin NX 8GB Jetson Orin Nano 8GB Jetson Orin Nano 4GB
AI performance (FP4) 2,070 TOPS 1,200 TOPS 275 TOPS 248 TOPS 200 TOPS 157 TOPS 117 TOPS 67 TOPS 34 TOPS
Memory 128GB 64GB 64GB 64GB 32GB 16GB 8GB 8GB 4GB
Power 40-130W 40-70W 15-60W 15-75W 15-40W 10-40W 10-40W 7-25W 7-25W
Dimensions 243x112mm 100x87mm 100x87mm 100x87mm 100x87mm 69.6x45mm 69.6x45mm 69.6x45mm 69.6x45mm

Ready to Buy?

Click the links below to view the range of AI inferencing solutions. If you still have questions on how to select the perfect configuration, don't hesitate to contact one of our friendly advisors on 01204 474210 or at [email protected].

Shop NVIDIA Jetson Solutions View NVIDIA Jetson Developer Kits & Modules
NVIDIA Jetson hardware

Need AI Training Hardware?

Looking to train AI models for deployment on edge devices? Check out our AI Training Hardware Buyers Guide to find the right datacenter GPUs and systems for your AI development needs.

View AI Training Guide
NVIDIA AI software stack

Frequently Asked Questions FAQ

Here are some common questions and answers to help you find the information you need.

Inferencing is the final stage of an AI project, where a trained model is presented with unseen data or prompts to complete the task it was designed for.

An inferencing in AI is the output or end result of an AI model, the nature of which is determined by the data the model was trained on.

An inference system is the where the trained model is presented with unseen data and asked to respond. Common examples include chatbots which inference their response based on previously analysed conversations, or image generators which create new images based on previously analysed art.

The first type of inferencing is large scale, where significant GPU resource is required to deliver the requested outcome. For example, LLMs such as ChatGPT route requests back to datacentres full of servers processing many actions simultaneously.

The second type of inferencing is much more focused, where low-power embedded GPU modules are sufficient to deliver the desired outcome. Examples include the brain of an autonomous vehicle or a robot, where a scaled down version of the AI model is installed within the device in order to control how it behaves and reacts to external inputs and information.