The derivative of a function describes how the function changes with respect to one of its inputs. For a multivariable function, a partial derivative measures the change with respect to one variable while keeping the others fixed. When we collect all partial derivatives into a vector, we call this the gradient, which points in the direction of the steepest increase of the function.
The chain rule is used when differentiating composite functions. For example, if z = f(g(x)), then the derivative of z with respect to x can be expressed as:
Gradient descent is an optimization algorithm that minimizes a loss function by repeatedly moving in the opposite direction of the gradient. The update rule for the parameters θ at time step t is given by:
where η is the learning rate and L(θ) is the loss function.
Integration is the inverse of differentiation. It can be seen as the accumulation of the area under a curve. In probability theory, the integral of a probability density function over the entire real line must equal one:
A probability distribution describes how likely different values of a random variable are. Discrete distributions are used when outcomes are countable, such as the Bernoulli, Binomial, or Poisson distributions. Continuous distributions, like the Normal (Gaussian) and Exponential distributions, describe probabilities over a continuous range using probability density functions (PDFs).
The probability density function of the Normal distribution is:
Conditional probability measures the likelihood of an event A given that another event B has occurred.
Bayes’ theorem provides a way to update probabilities based on new evidence:
The expectation, or mean, of a random variable X is the weighted average of its possible values:
The variance measures how much values deviate from the mean:
A vector is an ordered list of numbers, often used to represent features or weights in machine learning. A matrix is a two-dimensional array of numbers, often used to represent linear transformations or collections of vectors.
Matrix-vector multiplication is defined as:
The transpose of a matrix flips it over its diagonal, the inverse of a matrix satisfies A · A⁻¹ = I, and the identity matrix I leaves vectors unchanged when multiplied.
The rank of a matrix is the number of linearly independent rows or columns. A set of vectors is linearly independent if no vector can be expressed as a combination of the others. The span of a set of vectors is the collection of all possible linear combinations of those vectors.
An eigenvalue λ and eigenvector v of a matrix A satisfy the relation:
Finally, singular value decomposition (SVD) allows any matrix A to be decomposed as:
where U and V are orthogonal matrices and Σ is a diagonal matrix containing the singular values.
The perceptron is inspired by the biological neuron. It takes multiple inputs, multiplies them by corresponding weights, adds a bias, and then passes the result through an activation function to produce an output.
The basic operation is:
Where:
x = input vector
w = weight vector
b = bias
f(z) = activation function (e.g., step function)
For example, if x = [1, 0], w = [0.7, -0.3], and b = 0.1, then:
The perceptron defines a decision boundary given by:
This is a line in 2D, a plane in 3D, and a hyperplane in n-dimensions. It separates the input space into positive and negative regions.
Weights are updated only when predictions are wrong:
η: learning rate
y_true: correct label (0 or 1)
y_pred: perceptron output (0 or 1)
Over time, this adjusts the decision boundary so that misclassified points are moved to the correct side.