Overview

Authors: T. Bouwmans, N. Vaswani, Y. Chi [Please see Publications for the references]

Principal Component Analysis (PCA)

1. Regular PCA [Without outliers]

The basic PCA problem has been studied for over a century since the first paper by Pearson in 1901. Then, it was mainly developed by Hotelling in 1933, and the best modern reference is the book of Jolliffe in 2002. Pearson proposed to find the subspace that minimizes the projection error of the observed data whilst Hotelling instead sought the subspace in which the projected data has maximal variance. These derivations lead to the same idea that is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible.

PCA was first widely used for data analysis in the field of psychology test (psychometrics) and chemometrics but today it is often the first step in more various types of exploratory data analysis, predictive modeling, classification and clustering problems. It finds modern applications in signal processing, biomedical imaging, computer vision, process fault detection, recommendation systems’ design and many more domains. “PCA” refers to the following formulation problem. Given a dataset (a set of data vectors, or, more generally a set of data “tensors” ) and a dimension k, find the k-dimensional subspace that “best” approximates the given dataset. There are various notions of “best”, the traditional one used for classical PCA is minimum mean squared error. Regular PCA, without constraints, and for clean data, is a solved problem (by the Eckart-Young-Mirsky theorem, computing the top k left singular vectors of the data matrix returns the PCA solution).

In several applications, the main limitation of PCA is its sensitivity to outliers. Indeed, regular PCA minimizes the sum of squared errors, which is prone to the presence of outliers, because large errors squared dominate the sum. In addition to the algebraic and geometric interpretation, regular PCA can also be interpreted

in a probabilistic manner with a Gaussian noise model which is sensitive to noise of large magnitude. In literature, other statistical noise models were employed to increase the robustness to outliers. In 2007, Zhao et al. designed a Laplace PCA which is more robsut but Laplace noise assumption based PCA methods cannot deal with dense noise effectively. In 2015, Xie and Xing employed a Cauchy distribution to model noise and derive Cauchy PCA under the Maximum Likelihood Estimation (MLE) framework with low rank constraint. Cauchy PCA can better estimate the low rank matrix regardless of whether noise is large or small, dense or sparse.

Robust PCA (RPCA), which refers to the problem of PCA in the presence of outliers, is both theorically and practically a much harder problem than regular PCA. One for which provably correct solutions have started appearing only recently. The same is true for PCA with missing data (and the related low rank matrix completion problem) as well as for sparse PCA which refers to the PCA problem when the principal components are assumed to be sparse, and for probabilistic PCA in which a probabilistic generative model is employed. In fact, even the classical PCA problem with speed or memory constraints is not a solved problem.

These issues have become practically important for modern datasets because of the following reasons:

  1. The data and noise maybe correlated in several applications (need for correlated PCA)
  2. The data matrix is often so large that it cannot be directly stored in the computer’s memory (need for streaming PCA solutions)
  3. A lot of today’s data consists of missing entries and/or outlier corrupted entries (need for matrix completion, robust PCA and robust subspace recovery)
  4. A lot of data today arrives sequentially, the data subspace itself may change over time, the entire dataset cannot be stored but short buggers can be and decisions are often needed in real-time or near real-time (need for subspace tracking and dynamic RPCA)
  5. Many types of data are better represented as a tensor dataset rather than a vector dataset or matrix (need for tensor PCA/RPCA).

2. Correlated PCA [Correlated data]

Challenges with correlation in PCA may concern occur when data are correlated in time (chemometrics) or when data and noise are correlated. For the case, when regular PCA is used for statistical process monitoring, it relies on the assumption that data are time independent. However, industrial data will often exhibit serial correlation. In 1995, Ku et al. proposed a Dynamic PCA (DPCA) as a remedy for high-dimensional and time-dependent data. In DPCA, the input matrix is augmented by adding time-lagged values of the variables. DPCA includes time-shifted versions of the original variables, in order to accommodate and tacitly model the dynamic behavior of variables within the same PCA model. For the second case, Vawavni and Guo studied in 2016 PCA when data and noise are correlated because in several applications, a part of the corrupting noise is data-dependent, and, as a result, the corrupting noise and the true data are correlated.

3. Recursive PCA/Moving Window PCA [Streaming data][Chemometrics data]

As the limitation of regular PCA applied to monitoring approach is that PCA (once built from the data) is time-invariant while most real industrial processes are time-varying. The time-varying characteristics of industrial processes include changes in mean and variance. Two adaptive PCA methods were developed in the application of monitoring approach in literature. In 1999, Qin et al. designed a recursive PCA algorithm whilst He and Yang proposed in 2008 an adaptive PCA algorithm based on moving window principle component analysis (Moving Window PCA) .

4. Streaming PCA/Subspace Tracking [Streaming data][Computer data]

Streaming PCA (also called subspace tracking) addresed the problem of estimating and tracking an unknown subspace from streaming measurements with outliers and/or missing entries. The first method was proposed by Oja in the field of neural netwoks. In 2010, Balzano et al. proposed an Grassmannian Rank-One UpdateSubspace Estimation (GROUSE) algorithm. In 2012, Chi et al. designed a Parallel Estimationand Tracking by REcursive Least Squares (PETRELS) algorithm which identifies the the underlying low-dimensional subspace via a recursive procedure for ach row of the subspace matrix in parallel, and then reconstructs the missing entries via least-squares estimation if required. GROUSE and PETRELS as well as Oja's algorithm are popular algorithms for subspace tracking. The three differ in their update rules: Oja’s method and GROUSE employ a first-order incremental gradient descent on the Euclidean space and the Grassmannian, respectively, whereas PETRELS can be interpreted as a second-order stochastic gradient descent scheme. These algorithms have been shown to be highly effective in practice, but their performance depends on the choice of parameters like the step size for GROUSE and Oja’s method, and the discount parameter for PETRELS.

Robust Principal Component Analysis (RPCA) [Outliers][Missing data]

Robust PCA (RPCA) methods can be classified into main categories: classical methods and L+S decomposition methods.

1. Classical RPCA [Basic outliers][Low dimensions]

RPCA was first addressed in the fields of statistics and neural networks.

1.1 Classical RPCA in Statistics

In the statistics literature, several approaches were developed to be robust in presence of outliers.

A) Removing and weighting Outliers Techniques

The first approach relies on removing outlier or weighting the data following their reliability.

B) Robust Estimators Techniques

The second approach which is the predominant approach consists in replacing the standard estimation of the covariance matrix with a robust estimator of the covariance matrix. This formulation weights the mean and the outer products which form the covariance matrix. Calculating the eigenvalues and eigenvectors of this robust covariance matrix gives eigenvalues that are robust to sample outliers. Several robust estimators of the covariance matrix were employed in literature like 1) affine-equivariant M-estimators of scatter but these robust estimators not robust to many outliers, 2) positive-breakdown estimators like Minimum Covariance Determinant (MCD), S-estimators, Minimum Volume Ellipsoid (MVE) estimator but these robust estimators are more robust but limited to small to moderate dimensions, and 3) Orthogonalized Gnanadesikan-Kettenring (OGK). The result is more robust, but unfortunately is limited to relatively low-dimensional data.

C) Projection Pursuit Techniques

Since principal component directions are also those that provide projections with the largest variability, robust estimators can alternatively be obtained as the directions that maximize a robust estimator of scale of the projected data. This third approach is known as projection pursuit techniques, and commonly refer as PP-PCA. The idea here is to first find the top robust principal component, then project it out and compute the residual for each data point. Compute the second robust PC by applying the same idea on the residual and repeat this process.

D) Mixed Techniques

In 2005, Hubert et al. proposed an approach called ROBPCA which combines the advantages of the robust estimation techniques and the projection pursuit techniques.

E) Robust Error Estimation

In 2001, De la Torre and Black used a bounded loss function applied to column-wise standardized residuals. In 2008, Maronna and Yohai employed a similar loss function, but modified in such a way that the method reduces to the classical PCA when one uses a squared loss function. These methods remove possible

outliers followed by estimation of the underlying subspace by PCA. These methods are highly non-convex. Nevertheless, Xu et al. provide in 2010 a probabilistic analysis for their near recovery of the underlying subspace.

F) Regression-type Optimization Problem

This approach seeks for the subspace that minimizes an objective function based on the squared orthogonal distances of the observations to this subspace. In 2001, Hawkins et al. considered a robust fit in alternating regression using Least Absolute Deviation (LAD) regression. In 2003, Liu et al. used the Least Trimmed Squares (LTS) method in alternating fitting. Croux et al. implemented the weighted LAD estimator in factor analysis model. In 2008, Chen et al. compared the above robust estimators in the sense of computational issues. They concluded that the LTS estimator is computationally expensive, even if the LTS estimator is robust against high leverage outliers.

1.2. Classical RPCA in Neural Networks.

In the neural network literature, Xu and Yuille in 1995 first addressed the problem of robust PCA by designing a neural network that relied on self-organizing rules based on statistical physics. The PCA energy function proposed by Oja was generalized by adding a binary decision field with a given prior distribution in order to take into account the outliers. The binary variables are zero when a data sample is considered an outlier. In 1999, Yang and Wang extended Xu and Yuille's objective function by using fuzzy membership and derive improved algorithms that can extract the appropriate principal components from the spoiled data set. The difficulty of selecting an appropriate hard threshold in Xu and Yuille's approach is alleviated by replacing the threshold by an automatically selected soft threshold in Fuzzy RPCA (FRPCA). In 2010, Lv et al. employed fuzzy learning algorithms for RPCA neural network. An adaptive learning procedure overcomes the difficulty of selection of learning parameters in Xu and Yuille’s algorithms. Furthermore, the robustness of proposed algorithms is investigated by using the theory of influence functions.

1.3 Classical RPCA with Fuzzy Concepts

Classical RPCA can be divided roughly into two groups: subset-based methods and fuzzy methods. Developed in the statistics, subset-based methods utilize one or more subsets of data to robustly estimate principal components (PCs) . But, subset-based methods tend to suffer the small sample size problem and instability in calculating PCs. To find robust PCs, RPCA with fuzzy concepts adopt fuzzy memberships and reduce the effect of outliers by assigning small membership values to outliers. Although fuzzy methods are efficient in reducing noise sensitivity, they bring about another problem, deciding membership value.

RPCA via L+S Decomposition [Element-wise Outliers][Column/row wise ouliers][Missing data][High dimensions]

In 2009, Candès et al. proposed a decomposition into low-rank and sparse matrices for Robust Principal Component Analysis which works well with respect to grossly corrupted observations distributed in an element-wise manner. This L+S decomposition is also called Principal Component Pursuit (PCP). To separate outliers from noise, Zhou et al. designed in 2010 a stable PCP which rises to a decomposition in three terms L+S+E. To adapt the decomposition when data arrives sequentially, Guo and Vaswani proposed in 2010 a real-time PCP method called ReProCS which was progressively improved over the years by addressing both its performance guarantees (PracReProCS) and its memory efficiency (MEDRoP). In 2015, Rodriguez and Wohlberg designed an incremental PCP (incPCP) algorithm. Several other dynamic RPCA methods (also called Robust Subspace Tracking methods) were also developed in literature.

Numerous works rises to different variants of RPCA to be robust against different type of outliers (different from element-wise outliers) or missing data. In the case of column or row-wise outliers, these methods are called Robust Subspace Recovery. In the case of missing data, these methods are called Robust Matrix Completion. In the case of the low-rank matrix l is considered as a non-negative matrix, these methods are called Robust Non-negative Matrix Factorization. In the case of there is only a strict constrained of low-rank for L and no specific constrained of sparsity for S, these methods are called Robust Low-Rank Approximation.

In addition, regular PCA and RPCA via decomposition into matrices can also be unified into a general decomposition framework as developed in (Bouwmans et al. 2017).

Stochastic Principal Component Analysis

In 2012, Arora et al. studied PCA in a stochastic optimization framework and reviewed and extended common approaches for stochastic PCA. They formulated the PCA objective as a stochastic optimization problem of seeking the k−dimensional subspace. Various stochastic approximation algorithms were studied and were categorized into three different types: stochastic gradient descent (SGD), incremental and online algorithms.These algorithms are computationally efficient an heve good empirical performance. In 2014, Goes et al. designed a robust algorithm in presence of high percentage of outliers.

Sparse Principal Component Analysis [High dimensions]

A particular disadvantage of PCA is that the principal components are usually linear combinations of all input variables. Thus, it encounters great fundamental challenges under high-dimensionality and may produce wrong results under high-dimensionality. Sparse PCA overcomes this disadvantage by finding linear combinations that contain just a few input variables. In 2006, Zou et al. designed a sparse PCA method using the Lasso (also called elastic net) to produce modified principal components with sparse loading.

Probabilistic Principal Component Analysis

In regular PCA, the linear algebraic solution has the geometric interpretation of minimizing the sum of the squared distances from the noisy data point. In addition to these algebraic and geometric interpretations, PCA can also be understood in a probabilistic manner to address the notable weakness of the regular PCA which is the absence of an associated probabilistic model for the observed data. In 1999, Tipping and Bishop addressed this limitation by demonstrating that PCA may indeed be derived within a density-estimation framework. A probabilistic formulation of PCA (PPCA) from a Gaussian latent variable model is then provided. In PPCA, the noise is assumed to be drawn from an unknown distribution and the problem becomes one of identifying the subspace and distribution parameters in a maximum-likelihood sense. When the noise distribution is Gaussian, the algebro-geometric and probabilistic interpretations coincide. However, when the noise distribution is non-Gaussian, the solution to PPCA is no longer linear (as shown by Collins et al. in 2001) where PCA is generalized to arbitrary distributions in the exponential family.

Distributed Principal Component Analysis

In distributed data acquisition systems, distributed PCA algorithms can harness local communications and network connectivity to overcome the need of communicating and accessing the entire array locally. A key feature of distributed PCA algorithm is that they defy the conventional notion that the first step towards computing the principal vectors is to form a sample covariance.

Kernel Principal Component Analysis

Regular PCA only allows linear dimensionality reduction. But, if the data has more complicated structures which cannot be well represented in a linear subspace, regular PCA is not suitable. In 1997, Scholkopf et al presented a Kernel PCA (KPCA) which is the nonlinear form of PCA. Thus, kernel PCA better exploits the complicated spatial structure of high-dimensional features. In KPCA, data in the input space is mapped to higher dimensional feature space where the data can

be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. In 2008, Nguyen and De La Torre presented a technique to overcome this problem, and extends it to a unified framework to deal with noise, missing data, and outliers in KPCA

Local Principal Component Analysis

In regular PCA, a small change or error in the input data influences the whole eigen-representation because regular PCA focus on the global structure as all linear integral transforms. To address this problem, Charlton et al. designed in 2010 a Localy Weighted PCA (LWPCA) for an application in geography. In 2004, Gottumukkal and Asari proposed for face recognition a modular PCA (ModPCA) which divides a pattern into sub-patterns and extracts local Principal Components (PCs) from the sub-patterns. It aims to discriminate patterns better, as compared to the regualr PCA which is global by exploiting local variations that are confined to sub-patterns or sub-image In 2014 Kadappa and Negi proposed to combined the global PCA and ModPCA in a method called Global Modular PCA (GModPCA).