A theory of structured PCA

The assumption of unstructured “noise,” i.e., that the complement of what is considered an interesting “signal” in a certain dataset is pure randomness, has pervaded analytical studies of algorithms for processing big data sets. However, this hypothesis is often too simplistic to capture realistic scenarios. We thus need to understand the role of the noise structure, namely the presence of correlations and dependencies within it, when performing signal extraction from corrupted data. We address this problem in a prototypical model where a simple matrix of numbers representing the sought information/signal is hidden in a complex matrix with non-trivial statistical dependencies among its entries, representing the unwanted noise. This problem is called Principal Components Analysis (PCA) and is probably the most frequently employed signal processing/data cleaning technique in all fields of science in order to make sense of noisy high-dimensional data. Despite its apparent simplicity, our model of PCA is theoretically challenging to analyse and captures features observed in real scenarios. We also propose a new algorithm able to saturate the performance limits predicted by our theory, by optimally capturing the statistical correlations present in the noise. The resulting picture shows that both signal and noise structure should be exploited jointly by algorithms in order to better clean the data at hand.

Check out our interviews on "Fundamental limits in structured principal component analysis and how to reach them."

"Exploiting the Structure of Noise in Big Data", by Charlotte Philips