Gradient Descent
Gradient descent is an ML algorithm for training a neural network to learn from data and improve its performance over time.
You are in a mountainous environment and want to navigate to the lowest point. You can only see the nearby terrain around you.
The gradient descent algorithm iteratively adjusts the model's weights and bias parameters to minimize and optimize the value of an error loss function.
The loss function measures the error or discrepancy between the model's predictions and the actual values.
Added June 15, 2025
Models using unsupervised learning can and often do use Gradient Descent.
While Gradient Descent is famously associated with supervised learning (where it minimizes a loss function based on labeled data, like in linear regression or neural networks for classification/regression), its core purpose is optimization: finding the minimum of a differentiable function.
In unsupervised learning, even though there are no explicit "labels" to predict, many algorithms still define an objective function or loss function that they aim to minimize or maximize. This function quantifies how "good" a particular configuration of the model is at discovering patterns or representing the data.
Here are some examples of how Gradient Descent is used in unsupervised learning:
Autoencoders: These are neural networks trained to reconstruct their input. The objective is to minimize the reconstruction error (the difference between the input and the output). This error function is differentiable, and Gradient Descent (or its variants like Adam, RMSProp, etc.) is used to update the network's weights to minimize this error.
Generative Adversarial Networks (GANs): GANs consist of a generator and a discriminator. Both are typically neural networks trained using Gradient Descent. The generator tries to fool the discriminator by creating realistic data, while the discriminator tries to distinguish real data from generated data. This involves a minimax game where both networks are optimized via gradient-based methods.
Word Embeddings (e.g., Word2Vec): Models like Word2Vec learn dense vector representations of words based on their context. While not strictly "unsupervised" in the sense of not having any target, the targets are derived from the input data itself (e.g., predicting surrounding words given a central word, or vice-versa). The objective function (e.g., negative log-likelihood) for these tasks is optimized using Gradient Descent.
Dimensionality Reduction (e.g., some forms of PCA with neural networks): While traditional PCA has an analytical solution, some non-linear dimensionality reduction techniques, particularly those implemented with neural networks, use Gradient Descent to learn the optimal mapping to a lower-dimensional space while preserving certain properties of the data.
Clustering (e.g., Gaussian Mixture Models with EM): While K-Means typically uses an iterative assignment and update process, some more complex clustering algorithms like Gaussian Mixture Models often employ the Expectation-Maximization (EM) algorithm. While EM isn't pure gradient descent, the "Maximization" step in EM often involves optimizing parameters of probability distributions, which can sometimes internally utilize gradient-based optimization.
Key takeaway: If an unsupervised learning model can define a differentiable objective function that quantifies how well it's performing its task (e.g., reconstruction error, likelihood of data, etc.), then Gradient Descent is a powerful tool to find the parameters that optimize that function.