Foundation Neural Network Topics

The dot product:

The dot product is another name for a weighted sum of some input vector (list of numbers.)

Geometric Properties:

Magnitude and Angle Relationship: The dot product of two vectors a and b is equal to the product of their magnitudes (|a||b|) and the cosine of the angle (θ) between them: a.b = |a||b|cos(θ). This tells us that the dot product captures how "aligned" two vectors are in space.
- A dot product of 0 indicates the vectors are orthogonal (perpendicular) since cos(90°) = 0.
- A dot product equal to the product of the magnitudes (a positive value) means the vectors are pointing in the same direction (cos(0°) = 1).
Projection: The dot product can be used to find the scalar projection of one vector a onto another vector b. This represents the length of the shadow of a cast onto b.

Algebraic Properties:

Commutative: The order of vectors doesn't affect the dot product: a.b = b.a.
Distributive: The dot product distributes over vector addition: a.(b + c) = a.b + a.c.
Scalar Multiplication: Scaling a vector by a scalar multiplies the dot product by the same scalar: (ra).(b) = r(a.b).
Dot Product with Itself: The dot product of a vector with itself is the squared magnitude of the vector: a.a = |a|^2.

Statistical Properties:

The variance equation for linear combinations (weighted sum) of random variables lets you calculate the variance of a sum where the terms are random variables, weighted by certain constants. Here's the formula and a breakdown of its components:

Variance Equation:

Var(aX + bY) = a^2 Var(X) + b^2 Var(Y) + 2ab Cov(X, Y)

where:

Var(aX + bY): This represents the variance of the linear combination (aX + bY). We want to find out how spread out the values of this sum are compared to its mean.
a and b: These are constant weights assigned to the random variables X and Y.
Var(X) and Var(Y): These are the individual variances of the random variables X and Y, respectively. Variance tells us how much each variable deviates from its mean on average.
Cov(X, Y): This is the covariance between X and Y. It captures how the two variables tend to vary together.

Key Points:

The equation considers both the variances of the individual variables (a^2 Var(X) and b^2 Var(Y)) and the covariance between them (2ab Cov(X, Y)).
Squaring the weights (a^2 and b^2): This emphasizes the influence of the weights on the overall variance. A larger weight on a variable with higher variance will contribute more to the final variance of the sum.
Covariance:
- If X and Y are positively correlated (tend to move in the same direction), the covariance term (2ab Cov(X, Y)) will be positive, increasing the overall variance.
- If X and Y are negatively correlated (tend to move in opposite directions), the covariance term will be negative, potentially decreasing the overall variance.
- If X and Y are independent (no relation between their variations), the covariance term becomes zero, and the variance simplifies to a^2 Var(X) + b^2 Var(Y).

Central Limit Theorem (CLT):

The CLT states that under certain conditions, the probability distribution of a sum of a large number of independent identically distributed (i.i.d.) random variables will tend towards a normal distribution (also called Gaussian distribution) regardless of the original distributions of the individual variables.

How it applies to weighted sums:

The CLT applies to weighted sums as long as the following conditions are met:

Large number of terms: There needs to be a sufficiently large number of random variables being summed (a specific number depends on the specific distributions involved, but generally, a larger number strengthens the approximation).
Independent variables: The random variables being summed should be independent. This means the outcome of one variable doesn't influence the outcome of another.
Finite variances: The individual random variables should have finite variances (not infinitely spread out).

Why it happens:

The CLT doesn't explain exactly why this convergence occurs, but it's a well-established mathematical principle. Intuitively, by summing many random variables with some variability, the positive and negative deviations from the mean tend to cancel each other out, resulting in a bell-shaped normal distribution where most values fall near the average and fewer fall further out in the tails.

Weighted Sums and the Normal Distribution:

Weights: The weights don't affect the convergence to normality as long as they are fixed constants. They only scale the resulting normal distribution. The sign of the weights isn't important since the CLT applies equally to sums, differences or any combination of the two.

Information Storage in the Weighted Sum:

With a suitable training algorithm you can store <vector, scalar> associations in a weighted sum. For example you could train it to recall the number 5 when the input to a weighted sum of two numbers was [3,1]: ie. <[3,1], 5>. You can directly solve for a number of associations using the math for systems of linear equations. Generally though, gradient descent is used because there are under capacity and over capacity cases.

Information Storage in the Weighted Sum

Systems of Linear Equations

Linear Classifiers and Decision Boundaries:

Imagine you have data points belonging to different classes, like emails classified as spam or not spam.
A linear classifier aims to draw a line (in 2D) or a hyperplane (in higher dimensions) that separates these classes as best as possible. This line/hyperplane is called the decision boundary.

The Power of the Dot Product:

The dot product measures the projection of one vector onto another. Linear classifiers use a weight vector (w). This vector defines the direction in which the data points are most separated.
The decision boundary is all the points where this projection equals a certain threshold (often zero).

So, how does the dot product help?

By calculating the dot product between the data point (x) and the weight vector (w), we get a sense of how aligned they are.
- A high positive dot product indicates strong alignment with one class.
- A high negative dot product indicates strong alignment with the other class.
- A dot product close to zero means the point falls near the decision boundary.

Addition and Subtraction:

Addition and subtraction are dot products, y=a+b+c can be viewed as y=(1)a+(1)b+(1)c, that is the dot product of [a,b,c] with [1,1,1]. Likewise [a,b,c] dot [1,-1,-1]=a-b-c. This is a useful viewpoint when fast transforms are used in neural networks.

The ReLU activation function.

ReLU as a Switch

ReLU (Rectified Linear Unit) is a widely used activation function due to its simplicity and effectiveness. Mathematically, it can be expressed as:

f(x) = max(0, x)

This function behaves like a threshold unit:

If the input (x) is positive, the output is the input (acts like a conductor passing through the signal).
If the input (x) is negative, the output becomes zero (acts like a switch cutting off the signal).

In essence, the ReLU function introduces non-linearity into the neural network, allowing it to model complex relationships in the data. This is crucial because stacked linear layers would only produce linear outputs.

ReLU as a Switch

A simple example of a stacked linear layer is:

Layer 1 with x and y as inputs.

u=2x+3y

v=3x+1y

Layer 2:

w=3u+5v

This can be simplified by basic linear algebra.

w=3(2x+3y)+5(3x+1y)

w=(3)(2)x+(3)(3)y+(5)(3)x+(5)(1)y

Stopping here for a moment you can note that while the x and y terms have remained completely linear the weight terms have become nonlinear, involving products such (3)(2). The deeper the stacking the longer the product terms. However that does not matter as long the composition of the layers remains unchanged.

Finally w=21x+14y.

If say the composition of the layers does change, for example u is switched off by an ReLU function;

u=0

v=3x+1y

w=3u+5v

w=(3)0+5(3x+1y)

w=(5)(3)x+(5)(1)y

w=15x+5y

Since switching happens at zero with ReLU, the effect of u on the output w will effectively transition smoothly to zero as u decreases to the switching point. This results in polyhedral shaped regions in a ReLU neural network.

Polyhedral regions in a ReLU neural network. The zoning is likely related to the Normal (Gaussian) distribution. With more boundaries near the center.

The output of neuron X1 is connected to N weights in the next layer and the pattern embedded in those weights is projected with intensity x1 (the output value of X1) into the next layer.

The Forward Projections in a Neural Network

The output of any particular neuron in a conventional dense neural network is connected to N weights in the next layer, one weight for each neuron in the next layer. There is no guarantee that the patterns are orthogonal and complete that they could sum to an arbitrary target pattern in the next layer which may cause some minor information loss. More important is the interaction between the activation function and the intensity with which the forward connected weight pattern is projected.

For example if the neuron output is restricted to binary +1,-1 then each weight pattern is projected with intensity either +1 or 1, which will not be very expressive. This might also be a problem with sigmoid activation functions that saturate at say +1 and -1.

With ReLU the pattern will be projected with intensity x where the input to ReLU(x) is positive or with intensity zero when the input to ReLU is negative. It might be worth considering providing each neuron with both a positive responding ReLU and a negative responding ReLU and connecting an independent collection of forward connected weights to the +ReLU and the -ReLU.

Point inputs to an inverse FFT (Fast Fourier Transform.)

The output of the inverse FFT. Each point input is converted to a different sine or cosine wave pattern and the patterns summed.

Likewise other fast transforms such as the fast Walsh Hadamard transform act as point to pattern generators.

Fast Transforms as forward connected pattern generators:

You can replace the weight matrix providing the forward connected weight patterns with a fast transform. The computation cost falling from n squared to typically nlog(n). Of course the patterns are fixed, however on the positive side they are typically orthogonal and complete. It is really just a question then of modulating the intensity of each pattern in a suitable way to get a target sum of patterns in the next layer.

Using parametric activation functions is a way of achieving that, where each individual activation function has a number of adjustable variables that alter its behavior.

If each activation function only has a few parameter of course the resulting sum of patterns is probably more approximate than with a conventional neural network, which may require extra layers over a conventional net.

SwitchNet Zero Fast Walsh Hadamard Transform Neural Network

Page updated

Google Sites

Report abuse