PCA is a dimension reduction method that helps to better understand the data by reducing the data dimension. This is one of the oldest and most popular methods in machine learning that not only helps to better understand the behavior of the data just by looking at a few numbers but also can help to apply other machine learning methods by reducing the variance of training.
Let us consider two-dimensional data set. This means every point in the data has two features; for instance, consider a set of people whose education level and unemployment are reported in the data. Now, thinking of a line that is drawn from the center of the data set and can show a direction that contains the most effective information. This essentially means which linear combination of the data entry can help us to better understand it. To see what is the best direction, one needs to project all the data points in different directions and see which one contains the most variability. Variability can be interpreted as more information simply because, more variability means more ability of distinction between the different data points. On the other hand, if we have a direction that captures less of the data variability there are more overlaps and less power of distinguishing between points. Terminologically, variability is the same as variance or volatility. So we need to find a direction where the projection of the points generates the highest variance.
As shown in the picture with the two different dimensions we get two different variabilities (compare the lengths above ). Once we find the direction that captures the most variability or the highest variance we consider that as the direction of the main component (black arrow below). The rest of the information is of course in a direction that is totally uncorrelated with this. This can mean the rest of the data is in the orthogonal direction (red arrow below).
Now think of more dimensions i.e., three and higher. The same idea can be used to find the first principle component. Once it is found, as discussed above, the rest of the data is in a (hyper-) plane of p-1 dimension that is orthogonal to the first principal component direction. Then we can continue the same argument by projecting the points on the orthogonal hyperplane and proceede in the same manner.
It is important to know how much information is captured by the principal components. For instance, how the percentage of the data information is captured by first or first & second principal components. That also puts an emphasis on the amount of information we need to know which can be pretty subjective. For instance, one can ask how many of the principals we can retain 90 percent or 95 percent of data information. It is clear that using more PC will retain more data. The way to measure this is by measuring the variability. In other words, we can measure how much of the variability or the variance is captured by the PC.
There are two main applications for PCA.
1- The first one is just to see if the major part of the data information can be presented in a space with a few dimensions. This can be used as a way to introduce new indexes; for instance, this can introduce an index in the financial market where one can better decide about the market behavior.
2- The other main application is dimension reduction. Imagine data with hundreds of features. This means the data has hundreds of dimensions. Using the whole dimensions to train a machine learning method increases the variance of the model. PCA can help to first reduce the dimension and then use a space of features with much smaller dimensions to train. This will help reduce the variance.
PCA has many other useful interpretations. Going back to the original definition, one can show that the direction of the first principal component (PC1) is not only the direction that can capture the most variability but also is a direction that has the least mean square error.