Quicklinks
Principal component analysis (PCA) reduces the dimensions of the data. This is done by taking the input features and transforming them resulting in new features called principal components. The dimension reduction results from compression of input features which are responsible for a small amount of the total variation in the dataset. The principal components are ordered in such a way that progressively the components explain less of the total variation in the dataset.
PCA will allow comparison of the four datasets and their relative complexity.
Justification
PCA is based on the decomposition of the data into two matrices. One of those matrices, referred to as the loadings, reflects the contribution of each of the original values to the principal components. The standard deviation of the loading for a given principal component reflects the degree of fluctuation in contribution of individual features to the variation of the total dataset. Thus, the standard deviation of the loadings within a dataset can be used as an indicator of the complexity of the dataset. The values of standard deviation are proportional to dataset complexity. That is to say, a simple dataset will have lower values of standard deviation particularly in lower principal components. A complex dataset will have higher values of standard deviation particularly in lower principal components.
Many platforms have a multitude of options for PCA. I will be implementing my PCA in python. My implementation will be based on the workflow presented in towards data science.
The datasets
Two expression datasets are from different strains of P. infestans. Two of the datasets are from the output of DNAShapeR. One is from prediction of first order shapes. One is from the prediction of second order shapes.
Dimension analysis
With regards to dimensions, the expression datasets are composed of fewer features. They have 20 features whereas the DNAShapeR based datasets have 386 features.
Standard deviation of loadings from PCA
Snapshots of plots of the standard deviation of the loadings are shown below.
The variation of the loadings of the principal components is substantially greater in the expression datasets. The scale of variation of the 88069 expression is approximately 3 times that of the DNAShapeR dataset for the first two components. The scale of the variation of the 1306 dataset is approximately 50 times that of the DNAShapeR datasets.
As discussed in the aims section, the values of standard deviation of loadings from PCA are proportional to dataset complexity. Thus, analysis of PCA loadings across the four datasets indicates that the expression datasets are substantially more complex than the DNAShapeR datasets despite having considerably less features.