Using Nonlinear Transformations

Linear models, such as linear and logistic regression, are actually linear combinations that sum your features and provide a simple but effective model. In most situations they offer a good approximation of the complex reality they represent. Even though they're characterized by a high bias, using a large number of observations can improve their coefficients and make them more competitive when compared to complex algorithms.

Doing variable transformations

The next few sections relies on the Boston dataset. The problem relies on regression, and the data originally has ten variables to explain the different housing prices in Boston during the 1970s. The dataset also has implicit ordering. Fortunately, order doesn't influence most algorithms because they learn the data as a whole. When an algorithm learns in a progressive manner, ordering can interfere with effective model building.

By using seed (to fix a preordinated sequence of random numbers) and shuffle from the random package (to shuffle the index), you can reindex the dataset.

> In the code, random.seed(0) creates a replicable shuffling operation, and shuffle(new_index) creates the new shuffled index used to reorder the data. After that, the code prints the X and y shapes as well as the list of dataset variable names.

Converting the array of predictors and the target variable into a pandas DataFrame helps support the series of explorations and operations on data.

The best way to spot possible transformations is by graphical exploration, and using a scatterplot can tell you a lot about two variables. You need to make the relationship between the predictors and the target outcome as linear as possible, so you should try various combinations such as the following.

Notice that you can approximate the cloud of points by using a curved line rather than a straight line. In particular, when LSTAT is around 5, the target seems to vary between values of 20 to 50. As LSTAT increases, the target decreases to 10, reducing the variation.

Logarithmic transformation can help in such conditions. However, your values should range from zero to one, such as percentages, as demonstrated in this example. In other cases, other useful transformations for your x variable could include x**2, x**3, 1/x, 1/x**2, 1/x**3, and sqrt(x). The key is to try them and test the result.

The code prints the F score, a measure to evaluate how a feature is predictive in a machine learning problem, both the original and the transformed feature. The score for the transformed feature is a great improvement over the untransformed one.

> The F score is useful for variable selection. You can also use it to assess the usefulness of a transformation because both f_regression and f_classif are themselves based on linear models, and are therefore sensitive to every effective transformation used to make variable relationships more linear.

Creating interactions between variables

In a linear combination, the model reacts to how a variable changes in an independent way with respect to changes in the other variables. In statistics, this kind of model is a main effects model.

The following example shows how to test and detect interactions in the Boston dataset. The first task is to load a few helper classes, as shown here:

The code reinitializes the pandas DataFrame using only the predictor variables. A for loop matches the different predictors and creates a new variable containing each interaction. The mathematical formulation of an interaction is simply a multiplication.

The code starts by printing the baseline R^2 score for the regression; then it reports the top ten interactions whose addition to the mode increase the score.

The code tests the specific addition of each interaction to the model using a 10 folds cross-validation. The code records the change in the R^2 measure into a stack that an application can order and explore later.

< Prev. Lesson

Next Lesson >

Exercise 4.1

SVD on Homes Database
Using homes.csv, try to do the following:

Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.

Page updated

Google Sites

Report abuse