## Introduction
What happens when a data set has too many variables ? Here are few possible situations which you might come across: - You find that most of the variables are correlated.
- You lose patience and decide to run a model on whole data. This returns poor accuracy and you feel terrible.
- You become indecisive about what to do
- You start thinking of some strategic method to find few important variables
Trust me, dealing with such situations isn’t as difficult as it sounds. Statistical techniques such as factor analysis, principal component analysis help to overcome such difficulties. In this post, I’ve explained the concept of principal component analysis in detail. I’ve kept the explanation to be simple and informative. For practical understanding, I’ve also demonstrated using this technique in R with interpretations.
## What is Principal Component Analysis ?In simple words, principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible. With fewer variables, visualization also becomes much more meaningful. PCA is more useful when dealing with 3 or higher dimensional data. It is always performed on a symmetric correlation or covariance matrix. This means the matrix should be numeric and have standardized data. Let’s understand it using an example: Let’s say we have a data set of dimension 300 ( In this case, it would be a lucid approach to select a subset of The image below shows the transformation of a high dimensional data (3 dimension) to low dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension is a linear combination of Source: nlpca
## What are principal components ?A principal component is a normalized linear combination of the original predictors in a data set. In image above, The principal component can be written as:
where, - Z¹ is first principal component
`Φp¹` is the loading vector comprising of loadings (`Φ¹, Φ²..` ) of first principal component. The loadings are constrained to a sum of square equals to 1. This is because large magnitude of loadings may lead to large variance. It also defines the direction of the principal component (Z¹) along which data varies the most. It results in a line in*p*dimensional space which is closest to the*n*observations. Closeness is measured using average squared euclidean distance.`X¹..Xp` are normalized predictors. Normalized predictors have mean equals to zero and standard deviation equals to one.
Therefore,
The first principal component results in a line which is closest to the data i.e. it minimizes the sum of squared distance between a data point and the line. Similarly, we can compute the second principal component also.
If the two components are uncorrelated, their directions should be orthogonal (image below). This image is based on a simulated data with 2 predictors. Notice the direction of the components, as expected they are orthogonal. This suggests the correlation b/w these components in zero. All succeeding principal component follows a similar concept i.e. they capture the remaining variation without being correlated with the previous component. In general, for The directions of these components are identified in an unsupervised way i.e. the response variable(Y) is not used to determine the component direction. Therefore, it is an unsupervised approach.
## Why is normalization of variables necessary ?The principal components are supplied with normalized version of original predictors. This is because, the original predictors may have different scales. For example: Imagine a data set with variables’ measuring units as gallons, kilometers, light years etc. It is definite that the scale of variances in these variables will be large. Performing PCA on un-normalized variables will lead to insanely large loadings for variables with high variance. In turn, this will lead to dependence of a principal component on the variable with high variance. This is undesirable. As shown in image below, PCA was run on a data set twice (with unscaled and scaled predictors). This data set has ~40 variables. You can see, first principal component is dominated by a variable Item_MRP. And, second principal component is dominated by a variable Item_Weight. This domination prevails due to high value of variance associated with a variable. When the variables are scaled, we get a much better representation of variables in 2D space.
## Implement PCA in R & Python (with interpretation)How many principal components to choose ? I could dive deep in theory, but it would be better to answer these question practically. For this demonstration, I’ll be using the data set from Big Mart Prediction Challenge III. Remember, PCA can be applied only on numerical data. Therefore, if the data has categorical variables they must be converted to numerical. Also, make sure you have done the basic data cleaning prior to implementing this technique. Let’s quickly finish with initial data loading and cleaning steps:
Till here, we’ve imputed missing values. Now we are left with removing the dependent (response) variable and other identifier variables( if any). As we said above, we are practicing an unsupervised learning technique, hence response variable must be removed.
Let’s check the available variables ( a.k.a predictors) in the data set.
Since PCA works on numeric variables, let’s see if we have any variable other than numeric.
Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do now. We’ll convert these categorical variables into numeric using one hot encoding.
To check, if we now have a data set of integer values, simple write:
And, we now have all the numerical values. Let’s divide the data into test and train.
We can now go ahead with PCA. The base R function prcomp() is used to perform PCA. By default, it centers the variable to have mean equals to zero. With parameter
The prcomp() function results in 5 useful measures: 1.
2. The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.
This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set, the maximum number of principal component loadings is a minimum of (n-1, p). Let’s look at first 4 principal components and first 5 rows.
3. In order to compute the principal component score vector, we don’t need to multiply the loading with data. Rather, the matrix x has the principal component score vectors in a 8523 × 44 dimension. > dim(prin_comp$x) Let’s plot the resultant principal components.
The parameter We infer than first principal component corresponds to a measure of Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that the second component corresponds to a measure of Outlet_Location_TypeTier1, Outlet_Sizeother. For exact measure of a variable in a component, you should look at rotation matrix(above) again. 4. The prcomp() function also provides the facility to compute standard deviation of each principal component.
We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components. To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:
This shows that first principal component explains 10.3% variance. Second component explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do we decide how many components should we select for modeling stage ? The answer to this question is provided by a scree plot. A scree plot is used to access components or factors which explains the most of variability in the data. It represents values in descending order.
The plot above shows that ~ 30 components explains around 98.4% variance in the data set. In order words, using PCA we have reduced 44 predictors to 30 without compromising on explained variance. This is the power of PCA> Let’s do a confirmation check, by plotting a cumulative variance plot. This will give us a clear picture of number of components.
This plot shows that 30 components results in variance close to ~ 98%. Therefore, in this case, we’ll select number of components as 30 [PC1 to PC30] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we’ll use these 30 components as predictor variables and follow the normal procedures.
## Predictive Modeling with PCA ComponentsAfter we’ve calculated the principal components on training set, let’s now understand the process of predicting on test data using these components. The process is simple. Just like we’ve obtained PCA components on training set, we’ll get another bunch of components on testing set. Finally, we train the model. But, few important points to understand: - We should not combine the train and test set to obtain PCA components of whole data at once. Because, this would violate the entire assumption of generalization since test data would get ‘leaked’ into the training set. In other words, the test data set would no longer remain ‘unseen’. Eventually, this will hammer down the generalization capability of the model.
- We should not perform PCA on test and train data sets separately. Because, the resultant vectors from train and test PCAs will have different directions ( due to unequal variance). Due to this, we’ll end up comparing data registered on different axes. Therefore, the resulting vectors from train and test data should have same axes.
So, what should we do? We should do exactly the same transformation to the test set as we did to training set, including the center and scaling feature. Let’s do it in R:
That’s the complete modeling process after PCA extraction. I’m sure you wouldn’t be happy with your leaderboard rank after you upload the solution. Try using random forest!
For more information on PCA in python, visit scikit learn documentation.
## Points to Remember- PCA is used to overcome features redundancy in a data set.
- These features are low dimensional in nature.
- These features a.k.a components are a resultant of normalized linear combination of original predictor variables.
- These components aim to capture as much information as possible with high explained variance.
- The first component has the highest variance followed by second, third and so on.
- The components must be uncorrelated (remember orthogonal direction ? ). See above.
- Normalizing data becomes extremely important when the predictors are measured in different units.
- PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.
- PCA is applied on a data set with numeric variables.
- PCA is a tool which helps to produce better visualizations of high dimensional data.
## End NotesThis brings me to the end of this tutorial. Without delving deep into mathematics, I’ve tried to make you familiar with most important concepts required to use this technique. It’s simple but needs special attention while deciding the number of components. Practically, we should strive to retain only first few k components The idea behind pca is to construct some principal components( Z << Xp ) which satisfactorily explains most of the variability in the data, as well as relationship with the response variable. Did you like reading this article ? Did you understand this technique ? Do share your suggestions / opinions in the comments section below. |