Principal Component Analysis (PCA)

Quicklinks

Import packages form abpy environment on biocluster

Background

Principal component analysis (PCA) reduces the dimensions of the data. This is done by taking the input features and transforming them resulting in new features called principal components. The dimension reduction results from compression of input features which are responsible for a small amount of the total variation in the dataset. The principal components are ordered in such a way that progressively the components explain less of the total variation in the dataset.

Learn more about PCA theory

Aim of PCA analysis

PCA will allow comparison of the four datasets and their relative complexity.

Justification

PCA is based on the decomposition of the data into two matrices. One of those matrices, referred to as the loadings, reflects the contribution of each of the original values to the principal components. The standard deviation of the loading for a given principal component reflects the degree of fluctuation in contribution of individual features to the variation of the total dataset. Thus, the standard deviation of the loadings within a dataset can be used as an indicator of the complexity of the dataset. The values of standard deviation are proportional to dataset complexity. That is to say, a simple dataset will have lower values of standard deviation particularly in lower principal components. A complex dataset will have higher values of standard deviation particularly in lower principal components.

Implementation

Many platforms have a multitude of options for PCA. I will be implementing my PCA in python. My implementation will be based on the workflow presented in towards data science.

Import packages form abpy environment on biocluster

#Packages were already installed as part of previous projects

Setup inputs and lists

Generate two outputs for each file

##csv of summary statistic of principal component loadings

##plot of standard deviation of loadings

Results

The datasets

Two expression datasets are from different strains of P. infestans. Two of the datasets are from the output of DNAShapeR. One is from prediction of first order shapes. One is from the prediction of second order shapes.

Dimension analysis

With regards to dimensions, the expression datasets are composed of fewer features. They have 20 features whereas the DNAShapeR based datasets have 386 features.

Standard deviation of loadings from PCA

Snapshots of plots of the standard deviation of the loadings are shown below.

The variation of the loadings of the principal components is substantially greater in the expression datasets. The scale of variation of the 88069 expression is approximately 3 times that of the DNAShapeR dataset for the first two components. The scale of the variation of the 1306 dataset is approximately 50 times that of the DNAShapeR datasets.

As discussed in the aims section, the values of standard deviation of loadings from PCA are proportional to dataset complexity. Thus, analysis of PCA loadings across the four datasets indicates that the expression datasets are substantially more complex than the DNAShapeR datasets despite having considerably less features.

1306 expression features

88069 expression features

1Shape features

2Shape features

Page updated

Report abuse