Here we'll take a quick look at the basics of Machine Learning, so we can understand in more detail what exactly Neurons (and Keras) is doing behind the scenes.
Machine Learning is the field that studies mathematical techniques that let the computer learn things from data, without being explicitly programmed. There are 3 main ways for a computer to learn in Machine Learning:
Supervised Learning (the one we're using in Neurons): this is the type of learning where the data we feed our program already includes the true outputs we want our model to replicate. Think of it as trying to learn about a subject by doing many multiple choice tests, and then comparing your answers to the answer sheet.
Unsupervised Learning: the program tries to figure out patterns in the data by itself. In this type of learning, it's hard to know if the model is performing well or not, since it isn't given any comparison data.
Reinforcement Learning: very popular in AI research, this is when the model performs a sequence of actions, and is rewarded (or punished) if those actions are desirable (or not). This is very used for AI training in videogames, since we can use things like the game's score to point the model in the right direction.
To perform these different types of learning, the models are usually based on Neural Networks.
A Neural Network is a set of interconnected algorithms (like neurons in our brain), that are designed to recognize patterns in numerical data.
The main components of a Neural Network are the nodes (or neurons) which are the units of computation in our network. Each node receives inputs from our data or previous nodes, processes the data and outputs a signal. Every node has a set of weights attached to it that amplify or suppress each input, based on its significance to the task of the model. So, just like neurons in the brain, a node emits a signal (is activated) whenever the network 'thinks' that specific node is necessary.
How does the model know when to activate a node? By summing all of its weights and passing them through the Activation Function. Therefore, a correct choice of this function is integral to the model's performance.
Here's a scheme of how nodes work in a Neural Network:
Scheme of a node in a Neural Network. Source.
Example of an Activation Function: Rectified Linear Unit (ReLu). If the sum of the weights is negative, the neuron isn't activated.
A Layer is a row of nodes. When a network has layers between the input and output it's called a Deep Neural Network, and we're in the realm of Deep Learning. Those middle layers are called Hidden Layers. All nodes in a given layer are connected to all nodes in the previous and next layers. Here are some examples:
Both the networks shown above are Sequential (or Feedforward) because information moves in only one direction: from input to output. Most advanced applications of Machine Learning use networks with loops or recurrences, where information can be fed to previous layers. One type of network that does this is a Recurrent Neural Network (RNN):
In Neurons, we'll be able to work with Sequential models only.
This is the type of learning Neurons uses. Therefore, the training data we give our models has to include:
Inputs (attributes): these are the values we want our model to use to be able to generate outputs from.
Outputs: this is what we want our model to predict. Since this is Supervised Learning, the model is comparing its predicted outputs to the dataset's true outputs, to evaluate and correct its perfomance.
How does the model know if it's doing a good or a bad job? Using a Loss Function: the function that measures how good (or, more correctly, how bad) the predicted outputs are, compared to the true ones. We want to minimize this function, meaning we are as close to the true values as possible.
If you've ever done a linear regression, well congrats! You've already used (very basic) Machine Learning. In a linear regression, the Loss Function is the sum of the squares of the distances from our points to the regression line. But there are many more Loss Functions that work for different contexts, for example: Binary Cross-Entropy is a function used for problems where the outputs are in binary (0 or 1):
Binary Cross-Entropy formula. p(y_i) is the probability of the output y_i ocurring in our dataset. N is the size of the dataset.
Sidenote: if you're familiar with Thermodynamics, Statistical Physics or Information Theory, you'll notice the resemblence of the expression above to the expression for the (von Neumann) Entropy of a system. Therefore, minimizing this function is (kind of) analogous to minimizing the entropy of a physical system where there are N possible outcomes y_i that occur with probability p(y_i).
In over-simplified terms: we're trying to de-randomize our system, for it to be as orderly as possible and output the correct values instead of random ones.
Well... How exactly does the model minimize the Loss Function values? Answer: that is the job of the Optimization Algorithm (or Optimizer). This is the algorithm the model uses to iteratively find the node weights that correspond to the minimum of the Loss Function.
A common Optimization Algorithm is Gradient Descent. As the name suggests, it uses the gradient (or slope) of the function to find its minimum:
A more advanced and commonly used Optimization Algorithm is Adam: Adaptive Moment Estimation. It usually achieves better results than Gradient Descent faster. This is the one we'll use in the Examples.
When we give our model a dataset for training, there's 2 extra training parameters we can specify: Epochs and Batch Size.
Epochs: this is the number of times the Optimization Algorithm works through the entire dataset. Usually, going through a dataset only once isn't enough to minimize the Loss Function, so we'll go through it 100 times, for example, giving the Optimizer more tries to complete it's job.
Batch Size: a batch is a subset of samples (rows) from our dataset, at the end of which the Optimizer updates the node weights of our network. The batch size is the size of each batch.
For example, take a dataset with N=1000 samples. If we set epochs = 1 and batch size = 1000, the Optimizer will run through the dataset 1 time and update the internal weights 1 time, because we can only fit 1 batch with this size in our dataset.
Now let's change epochs to 100 and batch size = 500. In this case, the Optimizer runs through the data 100 times and updates the weights 200 times in this process. This will probably give us better results, but a much larger training time.
Let's do a quick recap:
A node or neuron is the processing unit of a Neural Network. It has a set of weights associated to its inputs, and an Activation Function that tells the node when to activate, based on the weights' values.
Deep Neural Networks are made of several layers of nodes. Neurons uses Sequential (or Feedforward) networks.
In Supervised Learning, the model is trained by minimizing a Loss Function through an Optimization Algorithm.
An epoch is one cycle of the Optimizer (or Optimization Algorithm) through the dataset. An epoch is divided into batches. At the end of each batch the Optimizer updates the weights of the nodes.
I've taken you through the very basics of Machine Learning. This is an extremely complex subject and everyday numerous scientific papers are produced in the field. I hope I've made it somewhat digestible for someone who is completely new to the subject. Happy Learning!