5. Regression

Predict new cases (Prediction formula)
Select useful inputs (Sequential selection)
1. Sequential selection – Forward
  - Forward selection creates a sequence of models of increasing complexity
2. Sequential selection – Backward
  - Backward selection creates a sequence of models of decreasing complexity
3. Sequential selection – Stepwise
Optimize complexity (Best model from sequence)
- Model Fit versus Complexity (Evaluate each sequence step)
- Select Model with Optimal Validation Fit (Choose simplest optimal model)

4.1.1. Data Assessment
- Select the Data Partition node
- Select Exported Data from the Data Partition node Properties panel
- Select the "TRAIN" data port and select "Explore"

The display below emphasizes the Sample Statistics table in which the column heading for Percent Missing is expanded

The defaults of the Impute node are as follows:
- For interval inputs, replace any missing values with the mean of the nonmissing values
- For categorical inputs, replace any missing values with the most frequent category
- With these settings, each input with missing values generates a new input

4.1.3 Missing Indicators
- Use the steps below to create missing value indicators. The settings for missing value indicators are found in the Score property group:
  - Select Indicator Variables -> Type -> Unique
  - Select Indicator Variables -> Role -> Input

Run the Impute node and review the Results window. Three inputs had missing values
With all of the missing values imputed, the entire training data set is available for building the logistic regression model
In addition, a method is in place for scoring new cases with missing values

To build a simple regression model
The Regression node can create several types of regression models, including linear and logistic
The type of default regression type is determined by the target’s measurement level

The initial lines of the Output window summarize the roles of variables used (or not) by the Regression node

This section give more information about the model, including the training data set name, target variable name, number of target categories, and most importantly, the number of model parameters

The Type 3 Analysis tests the statistical significance of adding the indicated input to a model that already contains other listed inputs
A value near 0 in the Pr > ChiSq (associated p-value) column approximately indicates a significant input; a value near 1 indicates an extraneous input

The Fit Statistics window
If the decision predictions are of interest, model fit can be judged by misclassification
If estimate predictions are the focus, model fit can be assessed by average squared error
There appears to be some discrepancy between the values of these two statistics in the train and validation data
- This indicates a possible overfit of the model
It can be mitigated by using an input selection procedure

Select Selection Model -> "Stepwise" on the Regression node Properties panel
The Regression node is now configured to use stepwise selection to choose inputs for the model
Run the Regression node and view the results
Maximize the Output window
Hold down the Ctrl key and press the G key. The Go to Line window appears
Enter 79 in the Enter line number field and click OK. Scroll down one page from line 79

The output next compares the model fit in Step 1 with the model fit in Step 0
The objective functions of both models are multiplied by 2 and differenced
The difference is assumed to have a chi-squared distribution with one degree of freedom
The hypothesis that the two models are identical is tested
A large value for the chi-squared statistic makes this hypothesis unlikely

Step 1 adds one input to the intercept-only model
The input and corresponding parameter are chosen to produce the largest decrease in the objective function
To estimate the values of the model parameters, the modeling algorithm makes an initial guess for their values
The initial guess is combined with the training data measurements in the objective function
Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters
The algorithm decides whether changing the values of the initial parameter estimates can decrease the value of the objective function
If so, the parameter estimates are changed to decrease the value of the objective function and the process iterates
The algorithm continues iterating

The hypothesis test is summarized
The output summarizes an analysis of the statistical significance of individual model effects
An analysis of individual parameter estimates is made

The summary shows the step in which each input was added and the statistical significance of each input in the final eight-input model

For convenience, the output from Step 8 is repeated
An excerpt from the analysis of individual parameter estimates is shown
The parameter with the largest standardized estimate (in absolute value) is GiftTimeLast

5.1.1. Iteration Plot
- Select View -> Model -> Iteration Plot
- The Iteration Plot window shows (by default) average squared error (training and validation) from the model that is selected in each step of the stepwise selection process

Select Select Chart -> Misclassification Rate
The iteration plot shows that the model with the smallest misclassification rate occurs in Step 3
If your analysis objective requires decision predictions, the predictions from the Step 3 model are as accurate as the predictions from the final Step 8 model

Enter 1.0 as the Entry Significance Level value
- The Entry Significance value enables any input in the model
Enter 0.5 as the Stay Significance Level value
- The Stay Significance value keeps any input in the model with a p-value less than 0.5
Change the Maximum Number of Steps value to a large value, such as 30
- This enables stepwise regression to run more than zero steps, but no more than 30 steps. If this value remains at 0, an intercept-only model is the result

The iteration plot shows that the smallest average squared errors occur in Steps 4 or 12

The iteration plot shows that the smallest validation misclassification rates occur at Step 3

5.1.3. Best Sequence Model
- If the predictions are decisions, use the following setting:
  - Select Selection Criterion -> Validation Misclassification
- If your predictions are estimates (or rankings), use the following setting:
  - Select Selection Criterion -> Validation Error

The continuing demonstration assumes validation error selection criteria. Validation error, also known as Error Function, equals negative log-likelihood for logistic regression models and error sum of squares (SSE) for linear regression models

The vertical blue line shows the model with the optimal validation error (Step 12)

The selected model at iteration 12 is due to having minimum Error Function
Error Function is a statistic that is calculated using the likelihood and thus does not existfor tree-based models.

Although not all the p-values are less than 0.05, the model seems to have a better validation average squared error (and misclassification) than the model that is selected using the default Significance Level settings

Google Sites

Report abuse