At the heart of supervised learning lies a simple yet powerful framework: the relationship between input data and desired output. This framework is akin to learning with a teacher, where the machine learns from labeled examples provided in a training dataset. The goal is to train a model that can make accurate predictions on new, unseen data.
Suppose we have a dataset of m training examples, each consisting of an input X(i) and a corresponding target or label y(i). In supervised learning, we aim to learn a function f that maps inputs to outputs:
y(i) =f(X(i))+ϵ(i)
Here,
y(i) represents the true target, f(X(i)) is the model's prediction, and ϵ(i) captures the error or noise in the prediction. The goal is to minimize this error across all training examples.
Loss functions are the compass that guides supervised learning. They quantify the discrepancy between the model's predictions and the true targets. The choice of a loss function depends on the type of problem being solved, but some common ones include:
Mean Squared Error (MSE) for regression tasks:
J= 1 / 2m * ∑(i=1 to m)(y(i) − y(i)^) ^ 2
Gradient Descent: Navigating the Optimization Landscape
Optimization is the process of fine-tuning the model's parameters to minimize the loss function. Gradient Descent is a fundamental optimization algorithm used in supervised learning. It works by iteratively updating the model's parameters in the direction of the steepest decrease in the loss function.
Gradient Descent Math
The weight update rule for Gradient Descent is:
W=W−α⋅∇J
Here,
W represents the model's weights, α is the learning rate (a hyperparameter), and ∇J is the gradient of the loss function with respect to the weights. The gradient points in the direction of the steepest ascent, so we subtract it from the weights to descend toward a local minimum of the loss function.
Overfitting and Regularization
In supervised learning, one must be vigilant about overfitting, a phenomenon where the model fits the training data too closely, capturing noise and leading to poor generalization on new data. Regularization techniques, such as L1 and L2 regularization, are used to combat overfitting by adding a penalty term to the loss function that discourages overly complex models.