Blog · 04 Mar 2021

The essential foundations for robust machine learning and AI

Effective Artificial Intelligence (AI) involves tackling challenges around data, infrastructure and models. Here’s how to build capability to deal with these factors.

Lorenzo Bavasso
Director, Data Analytics and AI

Today’s AI is sometimes breathtaking in its capability, and yet, in the blink of an eye, can exhibit behaviour ranging somewhere from infuriating to downright dangerous.

We all want to be able to extract value from AI in our businesses, but how do we do this safely with such an apparently fickle technology?

AI needs strong foundations

The attraction of the end goal is often overwhelming. In the race to deploy AI and deliver benefit, it’s easy to overlook the foundational, but often ‘unglamorous’ work that needs to take place if we want to arrive at solutions that simultaneously do what we want, don’t create uncontrolled risks and don’t have hidden side effects or weaknesses.

Data is the fuel of machine learning (which constitutes the bulk of today’s AI), but it doesn’t get the recognition and attention it deserves. Many organisations depend on their ability to harness data for operational purposes as well as being able to conduct near real-time and offline analytics. So, they might be forgiven for thinking they’d be in good shape when it comes to exploiting that data for AI. But, unless they’re very lucky, they may well be wrong. But why? The answer is that AI really is different – it makes new demands and brings new challenges.

Data is critical to effective AI

One of those challenges, is that today’s AI is very hungry – specifically, data hungry. And not just for any old data. Building useful, robust and safe models requires the creation of substantial quantities of labelled training data. Because AI models learn by example, training data must contain the examples together with the ground truth. The examples are often readily available or easy to get hold of, such as text from chat logs or word-for-word agent notes for a Natural Language Processing task. Getting the labels is more difficult. For an object detection model, every image will have to be examined and labelled polygons constructed around each object that corresponds to a class of interest. Typically, this requires manual labour and can be costly and time-consuming. And it shouldn’t be done in a rush. The quality of the labelling will be critical in determining the quality of any model derived from it.

All of this requires infrastructure. Training data needs to be assembled, curated and made discoverable. You need to understand where it came from, how was it collected, and which AI systems have consumed it. The answers to these questions will determine an organisation’s ability to assure the quality of its AI training data; to understand the potential biases and other hazards that it incorporates; and to trace its usage. If problems are discovered in training data, it’s vital that you can identify affected AI systems, and fix any issues.

Moving from cottage industry to factory

It also needs forethought and planning. Today, we have cottage industries – everyone collecting their own training data, often in a rush; no standard processes for labelling and quality checking; little management of provenance or traceability between data and the AI that is derived from it; limited re-use of an expensively created asset. What we need is a factory - an organised capability to build and curate data for training AI; to ensure it’s of known origin and quality; and to maximise safe re-use. In an ideal world, the ‘factory’ would exist before undertaking an AI development. This requires organisations to think upfront about the applications of value to them and start collecting the data that will be necessary to support them.

Is all this work really necessary? Let’s take a simple example from the field of machine vision. State of the art object detection models are often trained on data drawn from ImageNet – a data store of around 14 million hand-labelled images. A typical training set might consist of around one million images together with a separate validation and test set of 150,000 images. Suppose you want to customise a model, pre-trained on ImageNet, to support detection of some new classes of object. You retrain with a new, but much smaller set of images – perhaps five to ten thousand. Now some questions…

  • What guarantees would you be willing to make about a model built, in part, on data whose labelling quality you know nothing about?
  • If you’ve customised a pre-trained model, what can you say – if anything – about the provenance of the up-stream training data
  • Can you be certain that your custom test images have never been ‘seen’ by your model before?

If you don’t have satisfactory answers to these questions, it’s likely you’ll end up with behaviour from your AI systems you can’t account for, and performance you don’t understand.

A successful future depends on building capability

The reality is that modern AI is complex, there are many moving parts. Understanding where data came from, what’s happened to it and how it’s being used is not trivial. Maintaining the links between data, derived models and deployed AI systems is a genuine challenge and one many organisations aren’t set up to handle.

Being successful will need a deliberate effort to create the capability to meet these challenges. An ad hoc approach might work for isolated examples, but it will quickly unravel at any scale. If you want to take advantage of AI, you need to start building this capability now.

To find out more about how we can help you prepare for successful AI, please get in touch with your account manager. We’re here to help.

Discover how your business can exploit technology and innovation both now and in the future by downloading our ‘Winning the innovation race’ brochure