Forward selection creates a sequence of models of increasing complexity
Sequential selection – Backward
Backward selection creates a sequence of models of decreasing complexity
Sequential selection – Stepwise
Optimize complexity (Best model from sequence)
Model Fit versus Complexity (Evaluate each sequence step)
Select Model with Optimal Validation Fit (Choose simplest optimal model)
SAS Applications - Regression
1.1 Managing Missing Values
4.1.1. Data Assessment
Select the Data Partition node
Select Exported Data from the Data Partition node Properties panel
Select the "TRAIN" data port and select "Explore"
The display below emphasizes the Sample Statistics table in which the column heading for Percent Missing is expanded
4.1.2 Imputation
The defaults of the Impute node are as follows:
For interval inputs, replace any missing values with the mean of the nonmissing values
For categorical inputs, replace any missing values with the most frequent category
With these settings, each input with missing values generates a new input
4.1.3 Missing Indicators
Use the steps below to create missing value indicators. The settings for missing value indicators are found in the Score property group:
Select Indicator Variables -> Type -> Unique
Select Indicator Variables -> Role -> Input
Run the Impute node and review the Results window. Three inputs had missing values
With all of the missing values imputed, the entire training data set is available for building the logistic regression model
In addition, a method is in place for scoring new cases with missing values
4.2. Running the Regression Node
To build a simple regression model
The Regression node can create several types of regression models, including linear and logistic
The type of default regression type is determined by the target’s measurement level
The initial lines of the Output window summarize the roles of variables used (or not) by the Regression node
This section give more information about the model, including the training data set name, target variable name, number of target categories, and most importantly, the number of model parameters
The Type 3 Analysis tests the statistical significance of adding the indicated input to a model that already contains other listed inputs
A value near 0 in the Pr > ChiSq (associated p-value) column approximately indicates a significant input; a value near 1 indicates an extraneous input
The Fit Statistics window
If the decision predictions are of interest, model fit can be judged by misclassification
If estimate predictions are the focus, model fit can be assessed by average squared error
There appears to be some discrepancy between the values of these two statistics in the train and validation data
This indicates a possible overfit of the model
It can be mitigated by using an input selection procedure
4.3. Selecting Inputs
Select Selection Model -> "Stepwise" on the Regression node Properties panel
The Regression node is now configured to use stepwise selection to choose inputs for the model
Run the Regression node and view the results
Maximize the Output window
Hold down the Ctrl key and press the G key. The Go to Line window appears
Enter 79 in the Enter line number field and click OK. Scroll down one page from line 79
The output next compares the model fit in Step 1 with the model fit in Step 0
The objective functions of both models are multiplied by 2 and differenced
The difference is assumed to have a chi-squared distribution with one degree of freedom
The hypothesis that the two models are identical is tested
A large value for the chi-squared statistic makes this hypothesis unlikely
Step 1 adds one input to the intercept-only model
The input and corresponding parameter are chosen to produce the largest decrease in the objective function
To estimate the values of the model parameters, the modeling algorithm makes an initial guess for their values
The initial guess is combined with the training data measurements in the objective function
Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters
The algorithm decides whether changing the values of the initial parameter estimates can decrease the value of the objective function
If so, the parameter estimates are changed to decrease the value of the objective function and the process iterates
The algorithm continues iterating
The hypothesis test is summarized
The output summarizes an analysis of the statistical significance of individual model effects
An analysis of individual parameter estimates is made
The summary shows the step in which each input was added and the statistical significance of each input in the final eight-input model
For convenience, the output from Step 8 is repeated
An excerpt from the analysis of individual parameter estimates is shown
The parameter with the largest standardized estimate (in absolute value) is GiftTimeLast
5.1. Optimizing Complexity
5.1.1. Iteration Plot
Select View -> Model -> Iteration Plot
The Iteration Plot window shows (by default) average squared error (training and validation) from the model that is selected in each step of the stepwise selection process
Select Select Chart -> Misclassification Rate
The iteration plot shows that the model with the smallest misclassification rate occurs in Step 3
If your analysis objective requires decision predictions, the predictions from the Step 3 model are as accurate as the predictions from the final Step 8 model
5.1.2. Full Model Selection
Select Use Selection Default -> "No" from the Regression node Properties panel
Select Selection Options
Enter 1.0 as the Entry Significance Level value
The Entry Significance value enables any input in the model
Enter 0.5 as the Stay Significance Level value
The Stay Significance value keeps any input in the model with a p-value less than 0.5
Change the Maximum Number of Steps value to a large value, such as 30
This enables stepwise regression to run more than zero steps, but no more than 30 steps. If this value remains at 0, an intercept-only model is the result
Select View -> Model -> Iteration Plot
The iteration plot shows that the smallest average squared errors occur in Steps 4 or 12
The iteration plot shows that the smallest validation misclassification rates occur at Step 3
5.1.3. Best Sequence Model
If the predictions are decisions, use the following setting:
If your predictions are estimates (or rankings), use the following setting:
Select Selection Criterion -> Validation Error
The continuing demonstration assumes validation error selection criteria. Validation error, also known as Error Function, equals negative log-likelihood for logistic regression models and error sum of squares (SSE) for linear regression models
The vertical blue line shows the model with the optimal validation error (Step 12)
The selected model at iteration 12 is due to having minimum Error Function
Error Function is a statistic that is calculated using the likelihood and thus does not existfor tree-based models.
Although not all the p-values are less than 0.05, the model seems to have a better validation average squared error (and misclassification) than the model that is selected using the default Significance Level settings