PRESSZONE
SCAN AI AI PROJECT PLANNING PART 5 OF 7 - MODEL TRAINING
Last time, in part four of this guide, we looked at model development - using GPU-accelerated resources to test and optimise models prior to selecting the best one to train and scale.
Model training is where your final development model is put through many epochs and iterations to result in your final full production model. It is the most expensive phase of an AI project, so it is vital to have the most suitable hardware and software infrastructure for your goals and expected outcomes.
Invest in infrastructure
Once a development pipeline has been established and an AI model is ready for production, the next step is to train the AI model against your training dataset. The training phase requires significant GPU and storage resource as many iterations will be needed - it is therefore unsurprisingly the most expensive part of any AI project. Training requires much more hardware resource – at minimum a multi-GPU server, supported by fast storage and connected via high-throughput, low-latency networking. If these three component parts of your AI infrastructure aren’t matched or optimised, productivity and efficiency will be impacted - the fastest GPU-accelerated super computer will be a waste of money if connected to slow storage that cannot keep its GPUs 100% utilised.
AI SERVERS
Harness the power of NVIDIA GPUs in custom designed 3XS Systems EGX, MGX or HGX servers. Built from the ground up with AI workloads in mind, our in-house system build division produces fully-configurable server solutions, tailored to every size of AI project. For larger projects NVIDIA DGX appliances may be a better alternative.
Learn More
AI-optimised storage
PEAK:AIO has developed a software defined storage (SDS) platform from the ground up for AI workloads to complement the NVIDIA EGX, HGX and DGX range of servers. Delivering ultra-low latency and tremendous bandwidth at a price which allows more investment to be made on GPU resource and less on storage.
Learn More
AI-ready networking
NVIDIA Spectrum Ethernet and Quantum InfiniBand switches matched to your servers, provide the throughput and latency required for AI workloads. Offering speeds of up to 800Gb/s and outstanding resiliency, they ensure maximum GPU utilisation across your entire infrastructure.
Learn MoreOur team of Scan AI experts can design, install and manage your AI infrastructure - either on your premises or hosted with one of our datacentre partners - ensuring optimal performance at all times, delivering maximum ROI for your business or organisation.
To buy or rent?
The GPU-accelerated systems needed for the training phase of AI projects represent the largest single cost, so if you intend to purchase an in-house infrastructure you need to consider those GPU optimisation and utilisation points above. You’ll also need to consider the complexities of connecting and configuring servers and storage via high throughput, low-latency networking solutions, and where to house it - on premise or hosted in a datacentre. Purchasing and owning this complete infrastructure is one approach - using a cloud service provider (CSP) is another.
32% of enterprises use only a public cloud approach, 32% use only a private cloud approach, and 36% use both, based use case
Enterprise Technology Research, 2023
Either option comes with pros and cons, however deploying hybrid environments, which allow for the best of both worlds, is increasingly becoming the norm. As AI projects require differing resource at the development, training and inferencing stages, a hybrid deployment allows for the cost control and integrity for sensitive data associated with owning hardware, alongside the ability to burst into public GPU compute farms in the cloud when extra capacity is needed fast.
There are many horror stories of cloud costs spiralling wildly, but retaining some element of on-premise hardware reduces this risk, coupled with the practice of optimising models prior to training so any CSP GPU instances are correctly sized, monitored and controlled. Look for a CSP with a pedigree in AI, that offers support from engineers and data scientists, rather than a ‘hands-off’ hyperscaler provider. Applying an 80/20 rule with this combination is key to delivering projects within budgets as they scale, without overly relying on either to the degree that costs start to get out of hand and project overspend leads to the benefits being forgotten.
The Scan Cloud Difference
Using the cloud for AI that rapidly scales can be daunting. Scan makes it simple.
Our cloud platforms are specifically designed to accelerate GPU-demanding applications, rather than adapted from general purpose systems. Additionally, workload specialists will guide you through every stage, from initial proof of concept right through to deployed solution.
The power of an NVIDIA GPU, anywhere.
Get Cloud Computing in 3 Simple Steps
Browse our available options or connect with a specialist to discuss a bespoke solution.
We’ll provision your environment and take you through a guided onboarding process.
You’re online! Set up, get to work, and access support anytime.
RAG and Fine - Tuning your model
If your model was based on a foundation model (FM) or from previous work you have done, it will require project-specific training to improve its accuracy. Retrieval Augmented Generation (RAG) is a technique for querying additional data and combining it with the original query, to provide greater context for language models. Fine-tuning is a technique to alter some or all of the model weights, using a new dataset to better fit a specific task. A backpropagation algorithm passes examples from the dataset to the model and collects its outputs, calculating the gradient of the loss between the model’s actual and expected outputs. The model’s parameters are then updated to reduce the loss, using a gradient descent or adaptive learning rate algorithm. This is repeated for multiple epochs until the model converges.
Read our 7 part AI Project Planning Guide
- Part 1 - Where do I start?
- Part 2 - Setting Expectations
- Part 3 - Data Preparation
- Part 4 - Model Development
- Part 5 - Model Training
- Part 6 - Model Integration
- Part 7 - Governance