Foundation Neural Network Topics

The dot product:

The dot product is another name for a weighted sum of some input vector (list of numbers.)

Geometric Properties:

Algebraic Properties:


Statistical Properties:

The variance equation for linear combinations (weighted sum) of random variables lets you calculate the variance of a sum where the terms are random variables, weighted by certain constants. Here's the formula and a breakdown of its components:

Variance Equation:

Var(aX + bY) = a^2 Var(X) + b^2 Var(Y) + 2ab Cov(X, Y)

where:

Key Points:

Central Limit Theorem (CLT):

The CLT states that under certain conditions, the probability distribution of a sum of a large number of independent identically distributed (i.i.d.) random variables will tend towards a normal distribution (also called Gaussian distribution) regardless of the original distributions of the individual variables.

How it applies to weighted sums:

The CLT applies to weighted sums as long as the following conditions are met:

Why it happens:

The CLT doesn't explain exactly why this convergence occurs, but it's a well-established mathematical principle. Intuitively, by summing many random variables with some variability, the positive and negative deviations from the mean tend to cancel each other out, resulting in a bell-shaped normal distribution where most values fall near the average and fewer fall further out in the tails.

Weighted Sums and the Normal Distribution:

Information Storage in the Weighted Sum:

With a suitable training algorithm you can store <vector, scalar> associations in a weighted sum. For example you could train it to recall the number 5 when the input to a weighted sum of two numbers was [3,1]: ie. <[3,1], 5>. You can directly solve for a number of associations using the math for systems of linear equations. Generally though, gradient descent is used because there are under capacity and over capacity cases.

Linear Classifiers and Decision Boundaries:

The Power of the Dot Product:

So, how does the dot product help?


Addition and Subtraction:

Addition and subtraction are dot products, y=a+b+c can be viewed as y=(1)a+(1)b+(1)c, that is the dot product of [a,b,c] with [1,1,1].  Likewise [a,b,c] dot [1,-1,-1]=a-b-c.  This is a useful viewpoint when fast transforms are used in neural networks. 

The ReLU activation function.

ReLU as a Switch

ReLU (Rectified Linear Unit) is a widely used activation function due to its simplicity and effectiveness. Mathematically, it can be expressed as:

f(x) = max(0, x)

This function behaves like a threshold unit:

In essence, the ReLU function introduces non-linearity into the neural network, allowing it to model complex relationships in the data. This is crucial because stacked linear layers would only produce linear outputs.

A simple example of a stacked linear layer is:

Layer 1 with x and y as inputs.

u=2x+3y

v=3x+1y

Layer 2:

w=3u+5v

This can be simplified by basic linear algebra.

w=3(2x+3y)+5(3x+1y)

w=(3)(2)x+(3)(3)y+(5)(3)x+(5)(1)y

Stopping here for a moment you can note that while the x and y terms have remained completely linear the weight terms have become nonlinear, involving products such (3)(2). The deeper the stacking the longer the product terms. However that does not matter as long the composition of the layers remains unchanged.

Finally w=21x+14y.


If say the composition of the layers does change, for example u is switched off by an ReLU function;

u=0

v=3x+1y

w=3u+5v

w=(3)0+5(3x+1y)

w=(5)(3)x+(5)(1)y

w=15x+5y

Since switching happens at zero with ReLU,  the effect of u on the output w will effectively transition smoothly to zero as u decreases to the switching point.  This results in polyhedral shaped regions in a ReLU neural network. 

Polyhedral regions in a ReLU neural network. The zoning is likely related to the Normal (Gaussian) distribution. With more boundaries near the center.

The output of neuron X1 is connected to N weights in the next layer and the pattern embedded in those weights is projected with intensity x1 (the output value of X1) into the next layer.

The Forward Projections in a Neural Network

The output of any particular neuron in a conventional dense neural network is connected to N weights in the next layer, one weight for each neuron in the next layer.  There is no guarantee that the patterns are orthogonal and complete that they could sum to an arbitrary target pattern in the next layer which may cause some minor information loss.  More important is the interaction between the activation function and the intensity with which the forward connected weight pattern is projected.

For example if the neuron output is restricted to binary +1,-1 then each weight pattern is projected with intensity either +1 or 1, which will not be very expressive.  This might also be a problem with sigmoid activation functions that saturate at say +1 and -1.

With ReLU the pattern will be projected with intensity x where the input to ReLU(x) is positive or with intensity zero when the input to ReLU is negative. It might be worth considering providing each neuron with both a positive responding ReLU and a negative responding ReLU and connecting an independent collection of forward connected weights to the +ReLU and the -ReLU.

Point inputs to an inverse FFT (Fast Fourier Transform.)

The output of the inverse FFT.  Each point input is converted to a different sine or cosine wave pattern and the patterns summed.

Likewise other fast transforms such as the fast Walsh Hadamard transform act as point to pattern generators.

Fast Transforms as forward connected pattern generators:

You can replace the weight matrix providing the forward connected weight patterns with a fast transform.  The computation cost falling from n squared to typically nlog(n). Of course the patterns are fixed, however on the positive side they are typically orthogonal and complete. It is really just a question then of modulating the intensity of each pattern in a suitable way to get a target sum of patterns in the next layer. 

Using parametric activation functions is a way of achieving that, where each individual activation function has a number of adjustable variables that alter its behavior.  

If each activation function only has a few parameter of course the resulting sum of patterns is probably more approximate than with a conventional neural network, which may require extra layers over a conventional net.