PRESSZONE
SCAN AI AI PROJECT PLANNING PART 3 OF 7 - DATA PREPARATION
The last part of this guide dealt with setting expectations within your organisation. We covered the main stages of an AI project and how you should expect to allocate time and costs between them. This part will focus on the first ‘hands-on’ stage - data preparation.
Data is key to any AI project and the scope of any AI project will have a direct impact on the data required. The larger the project, the more data required and the more data preparation needed, as it is the quality of this data that is most critical as garbage results in garbage out.
Why good data is important
AI is field of computer science that aims to create intelligent agents capable of reasoning, learning and autonomous action. The quality of data used to train and test AI models determines how reliable, accurate and ethical those agents are, and lays the foundation for effective learning, bias mitigation, efficiency, generalisability and improved performance. The volume of data required is task dependent and, for example, a predictive maintenance model requires much less data than an LLM designed to generate new content.
The more data you are working with, the longer the timeframe required to structure, clean and engineer it sufficiently to develop a model that could be internally or commercially deployed. Even though every generation of GPU delivers the promise of significant performance increases over the last, shortening times to results with growing numbers of parallel processing cores and more advanced memory configurations, it all counts for nothing if data is of a poor quality.
Around 60% of time spent developing AI models is done so at this data preparation stage
Poor data quality can cover a number of issues – incompleteness jeopardises prediction models, for example running credit checks with income data missing; inconsistencies such as age not matching birthdate necessitate data cleaning before performing analysis; outliers can skew results and distort patterns and often result from inputting errors; duplicates can lead to overrepresentation of points introducing bias and volume - simply not enough to train the model effectively. When starting any AI project the technical team need to keep their leaders apprised of the state of data and realistic timeframes needed prepare it for effective use. Similarly, leaders need to understand that time taken here will benefit the outcomes and that it is estimated that around 60% of time spent developing AI models is done so at this data preparation stage. Additionally, input from subject matter experts is key to data preparation, as technical teams may not necessarily understand the impact and weight of any given attribute, or its importance in the final model.
What does data preparation involve?
Data preparation is often seen as a boring task, but don’t fall into the trap of seeing this as ‘menial work’ not worthy of investing time and resource into, as this stage will form the bedrock of future success. Data preparation is a multi-step process that involves data collection, structuring, cleaning, feature engineering, and finally labelling.
Collection
Structure
Cleaning
Labelling
Engineering
Training
Once you have collected the data from relevant and varied sources you begin preparation by structuring it. This involves defining the relationships between different data elements, for instance by using a customer ID to link account and order information, which enables efficient storage and retrieval of data. Next, data cleaning identifies and corrects errors, inaccuracies and inconsistencies, by infilling missing data, addressing inconsistent formatting, using statistical techniques to adjust outliers and removing duplicates. While structuring and cleaning are ways of refining data, the next process of feature engineering creates attributes to enhance it. Feature engineering allows you to define the most important information in your dataset and utilise domain expertise to get the most out of it. This might involve scaling, one-hot coding, binning and time series features to capture temporal patterns. Finally, data labelling is required to signpost differing types of data in your dataset, specifying which parts of the data the AI model will learn from, as when labelling noise in podcasts in order to ignore it. Though improvements in unsupervised learning have resulted in AI projects that do not require labelled data, many systems still rely on labelled data to learn and perform their given tasks.
Data preparation tasks carried out and investments made in toolkits or applications at this stage, will ensure you don’t end up in a development loop where no outcomes are sufficiently good enough to proceed to training, effectively ending the project. Alternatively mistakes at this stage can poison future results and may compromise the project.
How long does it take?
As we indicated in part two of this guide, data preparation is a significant stage of an entire AI project, but the time it takes will depend on the resource you apply to it. A common mistake is to assume that data preparation is not worthy of your best data scientists’ time, who should be concentrating on the ‘real AI work’. However, your most skilled people will be invaluable at sanity checking prep work already carried out - spotting mistakes or improving attributes – and increasingly in deploying AI to curate data for other AI, including the production of synthetic data. The involvement of top engineers will have a positive impact down the line and save time, and potentially costs, later.
Is help available?
There are numerous tools and services that can aid with data structuring, cleaning, and labelling, whilst feature engineering applications help ensure you are using the right data in the first place, spotting trends and patterns that may not be immediately obvious.
However, choosing the best tools to suit your data may not be obvious, so partnering with an expert AI organisation offering data scientist consultancy may be a key investment at this stage. Scan has a comprehensive suite of professional services designed for every stage of an AI project.
Read our 7 part AI Project Planning Guide
- Part 1 - Where do I start?
- Part 2 - Setting Expectations
- Part 3 - Data Preparation
- Part 4 - Model Development
- Part 5 - Model Training
- Part 6 - Model Integration
- Part 7 - Governance