Data Challenges

1. Presence of Outliers or Missing Data [Robustness]

1.1 Notion of inliers and outliers

Inliers are measurements which represent the true data samples and most of the time the majority of the data. On the other hand, outliers are arbitrarily large valued measurements which do not characterize the true data samples. Outliers can be viewed as observations or values that deviate much from other observation and arouse a suspicion that it was generated by a different mechanism. Outliers are also defined as an unusual observations outside the majority of the data set behaving differently from the rest of the observations. Outliers are due to perturbations in the data such as the following ones:

Noise in signal, image and video data.
Error in signal, image and video data.
Intrusion in signal data (computer science).
Defect in industrial process data.

Data are then less or more corrupted by noise, outliers and missing data. There are considered grossly corrupted if the percentage of corruptions is superior to 30%.

1.2 Classification of outliers

Outliers can be classified by considering their location in the observation matrix (or tensor) (Brahma et al., 2017):

e-type outliers: Element-wise outliers/noise also named cell-wise outliers/noise in (Rousseeuw et al, 2018) (Rousseeuw, 2023) (Raymaekers et al. 2024) (Centofanti et al. 2024).
r-type outliers: Row-wise or column-wise outliers/noise.
Missing data (Matrix completion).

Outliers can be also classified in the subspace where they are the more prominent as developed in (Brahma et al., 2017) :

Outliers in the original observation space (OS) (that are commonly addressed) are relatively visible due to their outlying values.
Outliers in the orthogonal complement (OC) subspace which is the space orthogonal to the primary principal component subspace OC outliers are observations that have arbitrarily large magnitude when projected onto the OC subspace.

2. High dimensional data [Sparsity]

3. Large-scale Dataset [Distributed] [Scalability]

Regular PCA is batch algorithm which requires all the date in once time. Thus, it requires a big amount of memory for large-scale modern dataset making it impractical. Thus, PCA do not scale beyond small-to-medium sized datasets. One way to address this problem is to employ distributed algorithms which can harness local communications and network connectivity to overcome the need of communicating and accessing the entire array locally. An other way consists in using scalable algorithms. For example, Hauberg et al. introduced in 2014 the Grassmann Average (GA) which expresses dimensionality reduction as an average of the subspaces spanned by the data. Because averages can be efficiently computed, scalability is immediatly obtained.

4. Euclidean Structure/Geometry Structure

PCA and LDA effectively discover only the Euclidean structure. They fail to discover the underlying structure if the data lie on a manifold. Besides, a manifold of data usually exhibits significant non-linear structure while PCA and LDA are both linear dimension reduction methods. To discover the intrinsic geometry structure of a data set, many non-linear manifold learning methods have been pro-posed, such as ISOMAP, LLE and Laplacian Eigenmap (Pang et al. 2017).

5. Out-of-sample Problem

The out-of-sample problem occurs when the subspace learning methods can not handle new data points which are not included in the training set. It is the case of ISOMAP, LLE and Laplacian Eigenmap (Pang et al. 2017). Besides, their non-linear property makes them computationally expensive. Thus, they might not be suitable for many real world tasks.

6. Small-sample-size problem

The small-sample-size problem occurs when,the dimension of the sample is much larger than the number of the samples. Then, the generalized eigenvalue problem may be unsolvable. In 2018, Ran et al. designed an effective method to deal with the small-sample-size problem. This method is called exponential neighborhood preserving embedding (ENPE) .

References

P. Brahma, Y. She, S. Li, D. Wu, "Reinforced Robust Principal Component Pursuit", IEEE Transactions on Neural Networks and Learning Systems, 2017.

T. Pang, F. Nie, J. Han,"Flexible Orthogonal Neighborhood Preserving Embedding", International Joint Conference on Artificial Intelligence, IJCAI 2017, 2017.

P. Rousseeuw, W. Van den Bossche, “Detecting deviating data cells”, Technometrics, Volume 60, 135-145, 2018.

P. Rousseeuw, “Analyzing cellwise weighted data”, Econometrics and Statistics, 2023.

J. Raymaekers, P. Rousseeuw, “Challenges of cellwise outliers”, Econometrics and Statistics, March 2024.

F. Centofanti, M. Hubert, P. Rousseew, “Robust Principal Components by Casewise and Cellwise Weighting”, Preprint, 2024.

Page updated

Google Sites

Report abuse