Extreme Learning Machines (ELMs) offer a refreshingly simple yet powerful approach to training single-hidden layer feedforward neural networks (SLFNs). Unlike traditional gradient-based methods that iteratively adjust weights, ELMs take a different path, leading to significantly faster training times and often comparable or even superior generalization performance.
At its core, an ELM consists of three main components:
Random Hidden Layer: This is where the "extreme" part comes in. The weights and biases connecting the input layer to the hidden layer are randomly initialized and, crucially, never updated during training. This eliminates the need for time-consuming backpropagation to adjust these initial layer parameters.
Non-linear Activation Function: Each hidden neuron applies a non-linear activation function (e.g., sigmoid, ReLU, tanh) to the weighted sum of its inputs, just like in other neural networks.
Output Weight Calculation: This is where the learning happens. The outputs of the hidden layer, after activation, form a new feature space. The ELM then calculates the weights connecting this hidden layer to the output layer using a single-shot least squares solution. This is typically achieved by solving a system of linear equations, often using the Moore-Penrose generalized inverse (pseudoinverse).
The ELM Training Process in a Nutshell:
Randomly assign input weights and biases for the hidden layer.
Calculate the hidden layer output matrix (let's call it H). Each row of H corresponds to the output of the hidden layer for a given input sample.
Calculate the output weights (β) by solving Hβ=T, where T is the target output matrix. The solution is typically β=H+T, where H+ is the Moore-Penrose pseudoinverse of H.
Why are ELMs "Extreme"?
Speed: No iterative tuning of hidden layer weights. The "learning" is a single algebraic step.
Simplicity: Minimal hyperparameter tuning compared to traditional ANNs. The number of hidden neurons is often the primary parameter to choose.
Effectiveness: Despite the random hidden layer, ELMs often achieve good generalization performance due to the universal approximation capability of SLFNs and the effective transformation of data into a more linearly separable feature space by the random projections.
While the standard ELM uses fully random weights and biases for the hidden layer, this can be computationally intensive, especially for high-dimensional input data, due to the need to generate and store large random matrices. This is where fast transform-based random projections come into play.
The core idea is to replace the dense random matrix multiplication with a series of operations that are computationally much cheaper. This is often achieved by leveraging properties of certain structured random matrices, particularly those based on:
Fast Johnson-Lindenstrauss Transform (FJLT): The FJLT uses a sparse random matrix (or a product of a sparse random matrix and a diagonal random matrix with ±1 entries) combined with a Fast Fourier Transform (FFT) or similar fast transform. Instead of directly multiplying by a dense random matrix, the projection becomes:
y=DFSx
where:
x is the input vector.
S is a sparse random matrix (e.g., a sub-sampled Rademacher matrix where only a few non-zero entries are randomly chosen for each column).
F is a Fast Fourier Transform (FFT) matrix or a Fast Walsh-Hadamard Transform (FWHT) matrix. These transforms can be computed in O(NlogN) time, significantly faster than the O(N2) for dense matrix multiplication.
D is a diagonal matrix with random ±1 entries.
This sequence of operations effectively generates a random projection but in a much more efficient manner.
Circulant or Toeplitz Matrices with Random Entries: Another approach involves using structured matrices like circulant or Toeplitz matrices whose first row/column is filled with random entries. Multiplication by such matrices can be efficiently performed using FFTs due to their connection with convolution operations.
Benefits of Fast Transform-Based Random Projections in ELMs:
Computational Efficiency: Significantly reduces the time complexity of the random projection step, especially for high-dimensional data, from O(N squared) to O(NlogN) (where N is input dimension).
Reduced Memory Footprint: The random matrices are implicitly defined or sparse, meaning they don't need to be stored explicitly, leading to substantial memory savings.
Scalability: Allows ELMs to be applied to larger datasets and higher-dimensional feature spaces where traditional random projections would be infeasible.
Maintains Performance: Despite the structured nature, these fast transforms can still provide effective random projections that preserve distances and lead to good generalization performance, leveraging the principles of compressed sensing.
In summary, ELMs offer a fast and straightforward approach to neural network training by leveraging random projections in the hidden layer. By replacing these dense random projections with fast transform-based methods, the efficiency and scalability of ELMs can be dramatically improved, making them even more practical for real-world high-dimensional data problems.