SVD operates directly on the numeric values in data, but you can also express data as a relationship between variables. You can compare the variance of two variables to determine whether they correlate.
Checking all the possible correlations of a variable with the others in the set, you can discover that you may have two types of variance:
Unique variance: Some variance is unique to the variable under examination. It cannot be associated to what happens to any other variable.
Shared variance: Some variance is shared with one or more other variables, creating redundancy in the data. Redundancy implies that you can find the same information, with slightly different values.
To determine the reason for shared variance led to the creation of factor analysis and principal component analysis (PCA).
A good way to show how to use factor analysis is to start with the Iris dataset.
After loading the data and storing all the predictive features, the FactorAnalysis class is initialized with a request to look for four factors. The data is then fitted.
Documentation: FactorAnalysisYou can explore the results by observing the components_ attribute, which returns an array containing measures of the relationship between the newly created factors, placed in rows, and the original features, placed in columns. You can interpret the numbers as if they were correlations.
The procedure to obtain a PCA is quite similar to the factor analysis. The difference is that you don't specify the number of components to extract.
You decide later how many components to keep after checking the explained_variance_ratio_ attribute, which provides quantification (in percentage) of the informative value of each extracted component.
SVD on Homes Database
Using homes.csv, try to do the following. Do these exercises in part_A.py and part_B.py.
Part A
Perform Factor Analysis on all the columns in homes.csv. Print the result of it, then determine the right number of components.
Part B
Perform PCA on all the columns in homes.csv. Print the result of it then guess the number of optimal components from the result. (No need to calculate this, just fill in the print sentence manually.)