The core of data reduction lies in an operation of linear algebra called Singular Value Decomposition. SVD is a mathematical method that takes data as input in the form of a single matrix and gives back three resulting matrices that, multiplied together, return the original input matrix.
The formula of SVD is
M = U * s * Vh
U : Contains all the information about the rows (observations)
Vh: Contains all the information about the columns (features)
s : Records the SVD process
To understand how this all works, you need to look at individual values. For instance, if the sum of s is 99, that means that 99 percent of the information is now stored in the first column of U and Vh. Therefore, you can discard all the remaining columns after the first column without losing any important information for your data science knowledge discovery process.
Let's start with our nat_gas.csv. This file contain a list of countries with 10 variables related to their natural gas and oil situation.
Using the module linalg from Numpy, you can access the svd function that exactly splits the original matrix into three variables: U, s, and Vh.
The output enumerates the shapes of U, s, and Vh, respectively.
In our example, matrix U, representing the rows, has 11 row values. Matrix Vh is a square matrix, and its 10 rows represent the original columns. Matrix s is a diagonal matrix.
Inside s, you find that most of the values are in the first elements, indicating that the first columns is what holds the most information (about 86.8 % percent), the second and third has very little values (about 5% each).
When working with SVD, you usually care about the resulting matrix U, the matrix representing the rows, because it is a replacement of your initial dataset.
In our data, the total of first five variables is around 99.5%. If we drop the sixth variable and on, we won't lose much information.
See the slice of old and new data (the first two rows and columns). The new data is generated by only using 5 variables.
SVD on Homes Database
Using homes.csv, try to do the following:
Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.