PRESSZONE

BLOG POST By James Gorbold 01/09/2025
BACK TO MENU

SCAN AI AI PROJECT PLANNING PART 5 OF 7 - MODEL TRAINING

Last time, in part four of this guide, we looked at model development - using GPU-accelerated resources to test and optimise models prior to selecting the best one to train and scale.

Model training is where your final development model is put through many epochs and iterations to result in your final full production model. It is the most expensive phase of an AI project, so it is vital to have the most suitable hardware and software infrastructure for your goals and expected outcomes.

Invest in infrastructure

Once a development pipeline has been established and an AI model is ready for production, the next step is to train the AI model against your training dataset. The training phase requires significant GPU and storage resource as many iterations will be needed - it is therefore unsurprisingly the most expensive part of any AI project. Training requires much more hardware resource – at minimum a multi-GPU server, supported by fast storage and connected via high-throughput, low-latency networking. If these three component parts of your AI infrastructure aren’t matched or optimised, productivity and efficiency will be impacted - the fastest GPU-accelerated super computer will be a waste of money if connected to slow storage that cannot keep its GPUs 100% utilised.

GPU server hardware for AI compute acceleration

AI SERVERS

Harness the power of NVIDIA GPUs in custom designed 3XS Systems EGX, MGX or HGX servers. Built from the ground up with AI workloads in mind, our in-house system build division produces fully-configurable server solutions, tailored to every size of AI project. For larger projects NVIDIA DGX appliances may be a better alternative.

Learn More
AI-optimised storage solution for high-performance data access

AI-optimised storage

PEAK:AIO has developed a software defined storage (SDS) platform from the ground up for AI workloads to complement the NVIDIA EGX, HGX and DGX range of servers. Delivering ultra-low latency and tremendous bandwidth at a price which allows more investment to be made on GPU resource and less on storage.

Learn More
AI-ready networking infrastructure with NVIDIA switches

AI-ready networking

NVIDIA Spectrum Ethernet and Quantum InfiniBand switches matched to your servers, provide the throughput and latency required for AI workloads. Offering speeds of up to 800Gb/s and outstanding resiliency, they ensure maximum GPU utilisation across your entire infrastructure.

Learn More

Our team of Scan AI experts can design, install and manage your AI infrastructure - either on your premises or hosted with one of our datacentre partners - ensuring optimal performance at all times, delivering maximum ROI for your business or organisation.

To buy or rent?

The GPU-accelerated systems needed for the training phase of AI projects represent the largest single cost, so if you intend to purchase an in-house infrastructure you need to consider those GPU optimisation and utilisation points above. You’ll also need to consider the complexities of connecting and configuring servers and storage via high throughput, low-latency networking solutions, and where to house it - on premise or hosted in a datacentre. Purchasing and owning this complete infrastructure is one approach - using a cloud service provider (CSP) is another.

32% of enterprises use only a public cloud approach, 32% use only a private cloud approach, and 36% use both, based use case

Enterprise Technology Research, 2023

Either option comes with pros and cons, however deploying hybrid environments, which allow for the best of both worlds, is increasingly becoming the norm. As AI projects require differing resource at the development, training and inferencing stages, a hybrid deployment allows for the cost control and integrity for sensitive data associated with owning hardware, alongside the ability to burst into public GPU compute farms in the cloud when extra capacity is needed fast.

There are many horror stories of cloud costs spiralling wildly, but retaining some element of on-premise hardware reduces this risk, coupled with the practice of optimising models prior to training so any CSP GPU instances are correctly sized, monitored and controlled. Look for a CSP with a pedigree in AI, that offers support from engineers and data scientists, rather than a ‘hands-off’ hyperscaler provider. Applying an 80/20 rule with this combination is key to delivering projects within budgets as they scale, without overly relying on either to the degree that costs start to get out of hand and project overspend leads to the benefits being forgotten.

The Scan Cloud Difference

Using the cloud for AI that rapidly scales can be daunting. Scan makes it simple.

Our cloud platforms are specifically designed to accelerate GPU-demanding applications, rather than adapted from general purpose systems. Additionally, workload specialists will guide you through every stage, from initial proof of concept right through to deployed solution.

The power of an NVIDIA GPU, anywhere.

Get Cloud Computing in 3 Simple Steps
Icon 1. Choose your GPU

Browse our available options or connect with a specialist to discuss a bespoke solution.

Icon 2. Rapid Provisioning

We’ll provision your environment and take you through a guided onboarding process.

Icon 3. Enjoy SCAN Cloud

You’re online! Set up, get to work, and access support anytime.

RAG and Fine - Tuning your model

If your model was based on a foundation model (FM) or from previous work you have done, it will require project-specific training to improve its accuracy. Retrieval Augmented Generation (RAG) is a technique for querying additional data and combining it with the original query, to provide greater context for language models. Fine-tuning is a technique to alter some or all of the model weights, using a new dataset to better fit a specific task. A backpropagation algorithm passes examples from the dataset to the model and collects its outputs, calculating the gradient of the loss between the model’s actual and expected outputs. The model’s parameters are then updated to reduce the loss, using a gradient descent or adaptive learning rate algorithm. This is repeated for multiple epochs until the model converges.

Read our 7 part AI Project Planning Guide