Performing Factor Analysis and PCA

SVD operates directly on the numeric values in data, but you can also express data as a relationship between variables. You can compare the variance of two variables to determine whether they correlate.

Checking all the possible correlations of a variable with the others in the set, you can discover that you may have two types of variance:

Unique variance: Some variance is unique to the variable under examination. It cannot be associated to what happens to any other variable.
Shared variance: Some variance is shared with one or more other variables, creating redundancy in the data. Redundancy implies that you can find the same information, with slightly different values.

To determine the reason for shared variance led to the creation of factor analysis and principal component analysis (PCA).

Music by: bensound.com

Looking for hidden factors

A good way to show how to use factor analysis is to start with the Iris dataset.

After loading the data and storing all the predictive features, the FactorAnalysis class is initialized with a request to look for four factors. The data is then fitted.

Documentation: FactorAnalysis

You can explore the results by observing the components_ attribute, which returns an array containing measures of the relationship between the newly created factors, placed in rows, and the original features, placed in columns. You can interpret the numbers as if they were correlations.

You'll have to test different values of n_components because you can't know how many factors exist in the data. If the algorithm is required for more factors than exist, it will generate factors with low or zero values in the components_ array.

> In the test on this dataset, the resulting factors should be a maximum of 2, not 4, because only two factors hav significant connections with the original features.

Achieving dimensionality reduction

The procedure to obtain a PCA is quite similar to the factor analysis. The difference is that you don't specify the number of components to extract.

You decide later how many components to keep after checking the explained_variance_ratio_ attribute, which provides quantification (in percentage) of the informative value of each extracted component.

Documentation: PCA

> In this decomposition of the Iris dataset, the vector array provided by explained_variance_ratio_ indicates that most of the information is concentrated into the first component (92.5%). You saw this same sort of result after the factor analysis. It's therefore possible to reduce the entire dataset to just two components, providing a reduction of noise and redundant information from the original dataset.

< Prev. Lesson

Next Lesson >

Exercise 4.2

SVD on Homes Database
Using homes.csv, try to do the following. Do these exercises in part_A.py and part_B.py.

Part A
Perform Factor Analysis on all the columns in homes.csv. Print the result of it, then determine the right number of components.

Part B
Perform PCA on all the columns in homes.csv. Print the result of it then guess the number of optimal components from the result. (No need to calculate this, just fill in the print sentence manually.)

Page updated

Google Sites

Report abuse