Software is still eating the world. And now it has sharper teeth
It has been a decade since Marc Andreessen boldly declared that software is eating the world. And true to that declaration, software has been gaining ground in more and more industries as the preferred tool for problem-solving. There are a number of factors that make software such a powerful problem-solving tool. Starting with the fact that modern computers are Universal Turing machines - which means you can run any algorithm on them. Add to that the fact that you can in effect have unlimited storage for your data and you have a tool that can precisely and repeatably carry out a pre-decided sequence of steps. As long as you can figure out - and code up - the exact sequence of steps needed to solve a problem, you can build a software solution for it. The increased adoption of cloud-based software has accelerated this trend by lowering the barriers for both developing and using software solutions.
However, as the market runs out of the proverbial lower-hanging fruits, we start running into the limitations of software. In particular, we run into problems that are really difficult to "algorithmize". So much so that it became clear that the typical software development methods would prove inadequate for these problems. It could be because the algorithm was too complicated and had too many moving parts to be specified in advance. Or that the input was too complex and some sort of intelligent pattern detection was needed in order to define meaningful algorithms on top of the data. Or both. Basically, software engineers were expected to encode more and more of the real world in their virtual world - and it was becoming more and more cumbersome to do so.
This was a situation tailor-made for Machine Learning. Aided by large volumes of digitized data, advances in hardware speeds and some cool innovations in Machine Learning we are now arriving at a new recipe for building software. One that enables us to use software to solve problems that were beyond our reach only a decade or so back. We are going to look into this recipe in some detail along with the tradeoffs one has to contend with while employing this. In this post, we will start with the basics of how learning algorithms work and the guarantees they provide. In subsequent posts, we will see how we can use this knowledge to build ML solutions in different scenarios - starting with a simplified version of requirements and working our way up to more complicated ones.
The learning algorithm as a problem solving tool
Before we proceed let's clarify some things. One is that there is a number of different algorithms that come under the umbrella of machine learning. For, the purpose of building software solutions that solve specific problems, we will be mostly interested in what is called supervised machine learning. This is when we have access to labelled data - which can be thought of as pairs of inputs and outputs of an unknown function that we wish to learn. This is not to say that unsupervised and even reinforcement learning algorithms cannot be used for problem-solving [1]. But supervised learning is by far the most common and also the most direct way to model a large class of real-life ML problems.
Another assumption that we shall make is that we are training and using our ML models in a batch setting. In this setting the ML model has two different modes of operation - training and inference. At one time a model can be in one or the other mode conceptually. We train/retrain the model on training datasets in training mode and then deploy it in inference mode to be used for prediction. This can be contrasted with online learning where a model can be trained as it is being used. Most real-world deployments are in the former mode. Primarily because a number of problems can be reasonably cast into the batch learning setting. But also because we have a better understanding of the error bounds in that setting.
With these out of the way, let's take a look under the hood to see what we have got to work with here. What does it mean to train an ML model? And what, if any, guarantees do supervised learning algorithms give us?
Ok. Let us make this a little concrete. Say the task we want to solve is categorizing tweets into two categories - those which talk about any disaster and those which do not. Also, let us assume we have access to a training dataset that consists of a significant number of tweets each of which is labeled as 1 if it talks about any disaster and 0 otherwise. What we are interested in is a function that takes any tweet as input - whether or not it is in the dataset we have - and returns its label as output. The supervised learning approach to this problem is as follows :
a) Start with a Hypothesis class of functions that is big enough so that it is likely to contain the actual function we are interested in finding. (What is a hypothesis class you say? well, for our purposes, it is a compact representation of an entire family of functions - each identified from the other by a set of numbers called parameters. For example, ax + by = 0 defines a set of functions over the variables x and y where the behavior of each function depends on what values of a and b it corresponds to)
b) Choose a loss function that computes how close the predicted output of the current guess for the best function is to the expected output. The closer the current function output is to the actual output, the lower is the loss.
c) Run an optimization algorithm that iteratively tries to minimize the above-defined loss function over the training dataset till some stopping criterion is reached.
Different choices of hypothesis class, loss functions, optimization algorithms, stopping criteria and even the data points on which the function is evaluated leads to variations in the learning algorithm. For instance, if the stopping criterion is that the loss function on the training dataset goes below certain threshold, then this becomes Empirical Risk Minimization algorithm. This may not sound like a great algorithm (especially if you have heard of overfitting), but the good thing is we have mathematical bounds that show this will converge to the desired function as the training data grows in size. Indeed, if you training data consists of all possible data (e.g. all possible tweets in our example) then the best you can hope to do is achieve perfect accuracy on your training set. In this case, the problem of learning reduces purely to the problem of optimization.
Of course, in the real world, we can never hope to have a training set that covers all possible inputs. Hence, instead of blindly optimizing accuracy on the training set, we choose different loss functions and/or stopping criteria. For deep learning algorithms, for instance, the hypothesis class is often multi-layer ReLU networks, loss function for classification is categorical cross-entropy, optimization algorithm is usually AdaM while the stopping criterion is early stopping on validation performance.
Whichever flavor of the supervised learning algorithm we choose, the supervised learning procedure gives us guarantees of the following kind: IF you were to sample new data points from the same data distribution as the data you trained the model on (i.i.d. sampling in data science speak) then you are likely to get an accuracy similar to what you got on the training/validation set. How tight this guarantee is, depends on the number of training data points and the complexity of the hypothesis class you are searching in. In general, if your training data size is big enough, this bound is tight enough to be useful. [2] [3]
The perceptive reader may have noticed the bold and capitalized IF in the guarantee statement above. In fact, it is almost never the case that the i.i.d assumption holds for any real problem. However, in true scientific tradition of assuming cows to be spherical, we shall assume in the next post that this is the case and then try to break down what building an ML solution would entail. We shall see that even in this simplified case, we can run into issues from the product point of view unless we are careful in designing our dataset creation and model validation processes.
[1] Indeed the modern paradigm of problem-solving using ML makes use of a model pre-trained on unsupervised data before using supervised data to tune on a specific task. This has been found to be empirically more effective than starting from a random model.
[2] The dependence on the complexity is a little complicated. In general the bound is tighter when the complexity of the hypothesis class is low. But prima facie this is not the case for neural networks. And hence these bounds have become vacuous for neural nets. This has triggered research in alternative characterization of neural net generalization bounds and also suggestions that the effective complexity of neural networks trained by standard optimization algorithms may be much lower than a naive count of parameters
[3] The emphasis on likely and similar is due to the fact that the kind of mathematical bounds these algorithms often admit is called probably approximately correct bounds. Statements like - With confidence at least 90pc- the probably part- the true label is no more than 0.2 away from the predicted label- the approximately part