Linear Regression
Data science practitioners perceive linear regression as a simple, understandable, yet effective algorithm for estimations, and, in its logistic regression version, for classification as well.
Linear regression is a statistical model that defines the relationship between a target variable and a set of predictive features by using the following formula:
y = bx + a
a is the value of y when x is zero and b is a coefficient that expresses the relationship between x and y.
Using more variables
When using a single variable for predicting y, you use simple linear regression, but when working with many variables, you use multiple linear regression.
The following example tries to guess Boston housing prices using a linear regression. The example also tries to determine which variables influence the result more, so the example standardizes the predictors.
The regression class in Scikit-learn is part of the linear_model module. Having preciously scaled the X variable, you have no other preparations or special parameters to decide when using algorithm.
Now that the algorithm is fitted, you can use the score method to report the R^2 measure, which is a measure that ranges from 0 to 1 and points out how using a particular regression model is better in predicting y than using a simple mean would be.
A score of 0.74 means that the model has fit the larger part of the information you wanted to predict and that only 26 percent of it remains unexplained.
To understand what drives the estimates in the multiple regression model, you have to look at the coefficients_ attribute, which is an array containing the regression beta coefficients.
The zip function will generate an iterable of both attributes, and you can print it for reporting.
DIS is the weighted distances to five employment centers. It shows the major absolute unit change. (A house that's too far from people's interests (such as work) lowers the value).
As a contrast, AGE (building age) and INDUS (whether nonretail activities are available in the area) don't influence the result as much because the absolute value of their beta coefficients is lower than DIS.
Exercise 5.1
SVD on Homes Database
Using homes.csv, try to do the following:
Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.