Understanding SVD

The core of data reduction lies in an operation of linear algebra called Singular Value Decomposition. SVD is a mathematical method that takes data as input in the form of a single matrix and gives back three resulting matrices that, multiplied together, return the original input matrix.

The formula of SVD is

M = U * s * Vh

U : Contains all the information about the rows (observations)
Vh: Contains all the information about the columns (features)
s : Records the SVD process

To understand how this all works, you need to look at individual values. For instance, if the sum of s is 99, that means that 99 percent of the information is now stored in the first column of U and Vh. Therefore, you can discard all the remaining columns after the first column without losing any important information for your data science knowledge discovery process.

Music by: bensound.com

Dimensionality Reduction

Let's start with our nat_gas.csv. This file contain a list of countries with 10 variables related to their natural gas and oil situation.

Using the module linalg from Numpy, you can access the svd function that exactly splits the original matrix into three variables: U, s, and Vh.

The output enumerates the shapes of U, s, and Vh, respectively.

Documentation: linalg.svd

> One way to make our data only contain numerical value is by setting the country as the index.

In our example, matrix U, representing the rows, has 11 row values. Matrix Vh is a square matrix, and its 10 rows represent the original columns. Matrix s is a diagonal matrix.

Inside s, you find that most of the values are in the first elements, indicating that the first columns is what holds the most information (about 86.8 % percent), the second and third has very little values (about 5% each).

> We put np.set_printoptions(suppress=True) so numpy will not return the number in scientific format (like +e12, etc).

When working with SVD, you usually care about the resulting matrix U, the matrix representing the rows, because it is a replacement of your initial dataset.

In our data, the total of first five variables is around 99.5%. If we drop the sixth variable and on, we won't lose much information.

See the slice of old and new data (the first two rows and columns). The new data is generated by only using 5 variables.

The output is really close. It means that you could drop the last five components and use U as a substitute for the original dataset.
One of the difficult issues to consider is determining how many columns to keep. As a general rule, you should consider solutions from 70 to 99 percent of the original information. It really depends on how important it is for you to be able to reconstruct the original dataset.

Next Lesson >

Exercise 4.1

SVD on Homes Database
Using homes.csv, try to do the following:

Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.

Page updated

Google Sites

Report abuse