The secret life of algorithms

September 03, 2019 Jon Evans , Big Data

Whatever machine learning model is used for AI, it must be trained with data.

The way you train your AI varies, but it usually comes down to vast amounts of repetition until the AI begins to identify the most appropriate response – except there’s a snag.

AI is only as good as the information it receives.

Supervised learning means teaching AI by using huge quantities of data that has already been organized appropriately by humans, often recruited by online outfits like Mechanical Turk.

Once the data is sorted and labeled, it becomes possible to furnish AI models with millions of images (Google’s Open Images Dataset has nine million of them, ImageNet holds 14 million), some of which contain a picture of a cat to teach it how to recognize when a cat is in the picture. Once your model begins to recognize images that contain a cat, it will become able to recognize felines in new pictures humans haven’t labeled yet.

Unsupervised learning looks for similar sets of data (Google News uses this to curate groups of similar stories).

There’s a third way, Reinforcement Learning, in which the system attempts to find the best solution to a problem by trial and error – this is the kind of tech used by Google DeepMind’s Deep Q network.

Insight on algorithms

Every machine learning model is based on algorithms. Some of the most widely used algorithms at this time include:

Linear regression: This is a prediction based on relatively consistent information. If you know you walk five miles an hour and you just walked for two hours, you’ve probably traveled 10 miles
Logistic regression: A scientist may have figured out some of the differences between tumors to the extent he can estimate which ones are malignant based on the general data he has about them. These estimates aren’t completely accurate, they merely provide means with which to define the likelihood of an event
Linear discriminant analysis: This is like logistic regression but is capable of handling multiple classes of data in order to predict the probability of events
Decision trees: These are widely used algorithms that employ logical progression to reach decisions. Each node represents an attribute, each link a decision and each leaf an outcome
Naive Bayes: This is a probabilistic model that works by analyzing multiple predictors in order to figure things out. It’s the kind of algorithm used in spam filters or recommendation systems. For example, the algorithm may notice you like certain songs or styles of music and will then attempt to predict others you may enjoy based on characteristics it sees in those songs
K-nearest neighbors: This supervised machine learning algorithm attempts to identify clusters of data. For example, it may notice that you like certain songs and will then figure out which other songs you will also like based on what it has been taught about those songs. The challenge with this is that as the quantity of data grows, both the size of the algorithm and its performance slow
Learning vector quantization: This algorithm could be seen as attempting to harness the power of KNN (above) in a smaller engine. Rather than hanging onto the entire data set, this artificial neural network algorithm lets you decide how many sets of data to keep as it evolves and then continues to make predictions based on any given value’s proximity to the information it holds
Support vector machines: These supervised learning models try to divide any set of results into two sets along a notional hyperplane, a decision boundary that helps classify the information. Support vectors are those data points that sit closest to the hyperplane the model identifies. The AI then attempts to define and act on data in accordance to its position vis-à-vis the hyperplane. It may also attempt to classify new data within the model by working out where it sits in relation to those support vectors
Bagging and random forest: These models attempt to combine data from multiple machine learning algorithms. Bootstrap aggregation (bagging) attempts to reduce the variance between these data sets. The idea is that by doing so the AI can use multiple models to help it deliver more accurate results. Random forest attempts to improve on standard bagging models by combining multiple slightly different models together. Each decision tree reflects its own unique subset of data. The idea is that this creates a more dependable average result

Algorithms are yet another sector in this rapidly evolving industry that is on an accelerated innovation path. New algorithms are appearing fast. At every part of the process, there’s a need for data.

Where does the data come from?

“We’re now in the golden age of AI, where advancements come from voluminous sets of data, new algorithms being created, computing power and the ability to do this in the cloud at scale,” Mike Quindazzi, Managing Director at PriceWaterhouseCoopers, told Forbes in a discussion on the impact of AI on procurement.

Within the process, there are different forms of data: training data and in-use data. The first tends to come from vast stacks of (sometimes) public information, private information and corporate data gathered during normal business.

This is the kind of information that may be used to train machines in the first instance – and the quantity of the data required for that training is vast. Microsoft used five years of continuous speech data to teach computers to talk, while Tesla is using 1.3 billion miles of driving data to teach cars how to drive themselves.

The second data stack reflects decisions and learning made by AI once it is in use – in some cases this information ends up in the cloud, in others (such as the Apple model) the data is deeply encrypted, anonymized and only partially stored anywhere but on the end device. And, in most cases, this kind of data is subject to data protection of some kind.

“Garbage in, garbage out”

Bad information leads to bad results. Flawed data means bad decisions. That’s bad when it comes to YouTube recommendation engines, but potentially fatal when applied to automated AI-controlled mass transit or energy supply systems.

Speaking at the Geekwire CloudTech Summit in 2018, Apple’s former Director of Machine Learning and AI (and Siri co-inventor), Carlos Guestrin warned of the danger of poor data. He noted the Shirley cards used by photo processing companies to get the exposure right when printing images until the 90’s. These cards all depicted light-skinned women, which meant companies did a poor job printing images of darker-skinned people. “The choice of data implicitly defines the user experience,” he explained.

“Many studies have shown that if you just train an ML system from data that is randomly selected from the Web, you end up with a system that is racist, misogynist and sexist, and that’s just a mirror to our society – it’s not enough just to think about the data that we use but also how that data reflects our culture and values that we aspire to,” he said.

He's not alone.

“Algorithms make decisions we teach them to make, even deep-learning algorithms,” said Libby Hemphill, Computer Science Professor at the University of Michigan. “They should pick different winners on purpose.”

The bottom line?

If you feed bad information into your AI, you’ll end up with bad results. The industry even has a phrase for it: “Garbage in, garbage out.”

The limitations of AI

There are big differences between the analytical and theoretically unbiased intelligence you’ll find in AI systems and the way humans think.

While machines are good at making decisions, humans are better at understanding the wider consequences of decisions and at putting them through a moral/ethical framework. In addition, when it comes to interacting with intelligent machines, people will still prefer to interact with people.

This is leading the AI industry to recognize the need for soft skills, such as human empathy and creative problem solving, as well as the base technology and engineering skills most enterprises need – even as demand for those skills grows. “We're going to be dependent on our instincts and our individual gifts,” former Apple Retail VP Angela Ahrendts told RBC.

The skills shortage means some of the world’s biggest companies are becoming heavily involved in providing relevant education – many (Google, Apple, Microsoft) offer or sponsor free coding lessons under the “Hour of Code” umbrella.

The ethics of AI

As AI enters wider use, we’re encountering unexpected ethical problems, such as: “Who is to blame if an autonomous vehicle has an accident – the owner of the vehicle, the manufacturer, the software developer, or the government that let such machines onto the roads?”

There is also growing recognition that while intelligent machines may help make humans more productive, they may also create new problems.

What happens to workers displaced by AI systems? Who should pay for their re-education for new roles? Employers who benefit from the efficiencies of such automation, or the wider society that feels the impact of employment reduction? And who pays for the infrastructure that products are distributed on?

Another AI challenge is the lack of a decision trail. Decisions that seem evident to AI systems may not be at all evident to humans, and until there’s a clear record to show how a decision was reached, it will remain hard or impossible to assess where errors crept into the AI decision-making process. “When we can look into the models and understand how they work, we can use the tool for real world problems. We simply want to know why the algorithms suggest a solution,” said Ulf Schönenberg, Head of Data Science at The unbelievable Machine Company (*UM), part of Basefarm.

Sinister implications are also emerging, such as photo realistic images that purport to show people in compromising positions or machine-made “deep fake” videos that seem to show politicians making statements they may not have made.

This is the third blog in a four-part series about how AI works, what data it needs and what happens when AI goes wrong. The other articles are: Everything you always wanted to know about AI (but were afraid to ask), Food for thought: why AI needs good data, and From supercomputers to smartphones: where is AI today?

Jon Evans

Jon Evans is a highly experienced technology journalist and editor. He has been writing for a living since 1994. These days you might read his daily regular Computerworld AppleHolic and opinion columns. Jon is also technology editor for men's interest magazine, Calibre Quarterly, and news editor for MacFormat magazine, which is the biggest UK Mac title. He's really interested in the impact of technology on the creative spark at the heart of the human experience. In 2010 he won an American Society of Business Publication Editors (Azbee) Award for his work at Computerworld.