Linear models, such as linear and logistic regression, are actually linear combinations that sum your features and provide a simple but effective model. In most situations they offer a good approximation of the complex reality they represent. Even though they're characterized by a high bias, using a large number of observations can improve their coefficients and make them more competitive when compared to complex algorithms.
The next few sections relies on the Boston dataset. The problem relies on regression, and the data originally has ten variables to explain the different housing prices in Boston during the 1970s. The dataset also has implicit ordering. Fortunately, order doesn't influence most algorithms because they learn the data as a whole. When an algorithm learns in a progressive manner, ordering can interfere with effective model building.
By using seed (to fix a preordinated sequence of random numbers) and shuffle from the random package (to shuffle the index), you can reindex the dataset.
Converting the array of predictors and the target variable into a pandas DataFrame helps support the series of explorations and operations on data.
The best way to spot possible transformations is by graphical exploration, and using a scatterplot can tell you a lot about two variables. You need to make the relationship between the predictors and the target outcome as linear as possible, so you should try various combinations such as the following.
Logarithmic transformation can help in such conditions. However, your values should range from zero to one, such as percentages, as demonstrated in this example. In other cases, other useful transformations for your x variable could include x**2, x**3, 1/x, 1/x**2, 1/x**3, and sqrt(x). The key is to try them and test the result.
The code prints the F score, a measure to evaluate how a feature is predictive in a machine learning problem, both the original and the transformed feature. The score for the transformed feature is a great improvement over the untransformed one.
In a linear combination, the model reacts to how a variable changes in an independent way with respect to changes in the other variables. In statistics, this kind of model is a main effects model.
The following example shows how to test and detect interactions in the Boston dataset. The first task is to load a few helper classes, as shown here:
The code reinitializes the pandas DataFrame using only the predictor variables. A for loop matches the different predictors and creates a new variable containing each interaction. The mathematical formulation of an interaction is simply a multiplication.
The code starts by printing the baseline R^2 score for the regression; then it reports the top ten interactions whose addition to the mode increase the score.
The code tests the specific addition of each interaction to the model using a 10 folds cross-validation. The code records the change in the R^2 measure into a stack that an application can order and explore later.
SVD on Homes Database
Using homes.csv, try to do the following:
Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.