Extract a sample of data that is relevant to the business problem
Data Preparation
Manipulate the data to put it in a form suitable for formal modeling
Model Construction
Apply the appropriate data-mining technique (regression, classification trees, k-means) to accomplish the desired data-mining task (prediction, classification, clustering, etc.)
Model Assessment
Evaluate models by comparing performance on appropriate data sets
Data Sampling
Simple Random Sampling
Simple random sampling is the most commonly used sampling method by randomly choosing some records from a population (denoted by n)
Stratified Sampling
Consider an example population, which has pre-existing segments of same or different sizes that are already classified into a distinct number of subgroups.
For example, if 1,000 random candidates are to be picked from across the country for a sporting event, it might be a good idea to pick them proportionately from each state.
Systematic Sampling
Systematic sampling is based on a fixed rule, like picking every fifth or seventh observation from a given population
Variables
Numeric Variables (Quantitative)
Continuous
A continuous variable can take any value between two limits.
For example, a height variable can be anything between 0.8 and 2.2 meters. It can take continuous values such as 1.1m or 1.51m etc
Discrete
These numeric variables take values in steps only. They can take only an integer or some predefined values between the given limits
For example, the number of children in a family can only be 1, 2, 3, and 4 but never be 1.5 or 2.34
Categorical or Non-numeric Variables (Qualitative)
Non-numeric, qualitative, categorical, and binary variables are the type of variables that represent quality or a characteristic field
Examples are shirt sizes expressed as S, M, L, XL, and XXL, or distance, which is expressed as near and far.
It can as well be a Boolean value like a "pass or a fail" or a "yes or no" field.
Types of Variables
Independent Variables:
Active
Manipulated independent variables that are given to a group of participants, within a specific period of time during the study
Attribute
Measured independent variables are variables that cannot be manipulated. The variables cannot be systematically changed during the study
E.g. Gender, ethnic group
Dependent Variables
The dependent variable is assumed to measure or assess the effect of the independent variable
It is thought of as the presumed outcome or criterion
E.g. Test scores, ratings
Extraneous Variables
Are also knowns as nuisance variables.
The variable are not of interest in the study but could influence the dependent variable
E.g. Temperature, time
Data Cleaning and Preparation
Major Tasks in Data Pre-processing:
Data cleaning
Data exploration must always be done when receiving new data
Data cleaning tasks:
Business and sanity check on every attribute
Handling missing data
Identify outliers
Correct inconsistent/wrong data
Resolve redundancy cause by integration
Treatment of missing data / outliers
Delete the records with missing data / outliers
Ignore the data sample with missing data
Data imputation (Fill in the missing value)
Global Constant
E.g. all missing value impute to “unknown” or 0
Attribute mean (or median, mode)
All missing value impute with computed value of present data
Prediction model
Train a prediction model to predict the most probable value
Need training data and additional modeling
Data integration (Combine data from disparate sources into meaningful and valuable information etc.)
Data transformation ( Transform to the same dollar sign or transform data from nominal to categorical (address to district) etc.)
Data reduction (To reduce parameters or records (rows) etc.)
Data Mining:
Finding patterns or relationships among elements of the data
Unsupervised learning: Clustering / K-mean
Predictive Analysis:
Finding a pattern (from historical data) so that an opportunity outcome can be identified before it occurred
Supervised learning: Classification / Regression
SAS Applications
1. Accessing Prepared Data
1.1. Creating a SAS Enterprise Miner Project
A SAS Enterprise Miner project contains materials that are related to a particular analysis task
These materials include analysis process flows, intermediate analysis data sets, and analysis results
Select File -> New -> Project from the main menu
Location of the project: Physical location where the project folder is created
1.2. Creating a SAS Library
A SAS library connects SAS Enterprise Miner with the raw data sources, which are the basis of analysis
A library can link to a directory on the SAS Foundation Server, a relational database, or even an Excel workbook
Select File -> New -> Library from the main menu
Path -> "C:\SDBA\Data_PA" etc.
1.3. Creating a SAS Enterprise Miner Diagram
A SAS Enterprise Miner diagram workspace contains and displays the steps that are involved in the analysis
To define a diagram, user need to specify its name
Select File -> New -> Diagram from the main menu
1.4. SAS Data Source
A data source is a link between an existing SAS table and SAS Enterprise Miner
To define a data source, you need to select the analysis table and define metadata that is appropriate for your analysis task
1.5. Defining a Data Source
A data source links SAS Enterprise Miner to an existing analysis table
To specify a data source, user need to define a SAS library and know the name of the table that you link to SAS Enterprise Miner
File -> New -> Data Source
Source: "SAS Table"
Table: Select the SAS table that want to make available to SAS Enterprise Miner
1.6. Defining Column Metadata
After a data set is specified, the next task is to set the column metadata
User needs to know the modeling role and proper measurement level of each variable in the source data set
Select "Advanced" -> "Customize"
Class Levels Count Threshold value: 2 (Only binary numeric variables are treated as categorical variables)
Reject Levels Count Threshold value: 100 (Only character variables with more than 100 distinct values are rejected)
Select "Next" -> "Label" check box
1.7. Finalizing the Data Source Specification
Click Next to proceed to Decision Configuration
2. Assaying Prepared Data
2.1. Accessing the Explore Window
Open the Data Sources folder in the Project Panel and right-click and "explore" the data source of interest
The Data Source Option menu appears
2.2. Changing the Explore Sample Size
The Sample Method property indicates that the sample is drawn from the top (first 2000 rows) of the dataset
"Sample Method" value field: Random
"Fetch Size" property: Max
The random sample has distributional properties and gives user an idea about the general characteristics of the variables. If the goal is to examine the data for potential problems, it is wise to examine the entire data set
2.3. Creating a Histogram for a Single Variable
The primary purpose is to create statistical analysis plots
Select Actions -> Plot from the Explore window menu. The Chart Wizard appears, and it is at the Select a Chart Type step
Select "Histogram". Histograms are useful for exploring the distribution of values in a variable
Click "Next". The Chart Wizard proceeds to the next step, "Select Chart Roles"
To draw a histogram, one variable must be selected to have the role X
Select "Role -> X" for the DemAge variable etc.
Select "Finish". The Explore window is filled with a histogram of the DemAge variable
Variable descriptions, rather than variable names, are used to label the axes of plots in the Explore window
From the histogram, we can observe that:
Age has a minimum value of 0 and a maximum value of 87
The mode occurs in the ninth bin, which ranges between approximately 70 and 78. And "Frequency" tells us that there are approximately 1400 observations in this range
2.4. Changing the Graph Properties for a Histogram
Right-click in the data area of the Age histogram and select "Graph Properties" from the Option menu. The Properties - Histogram window appears
Enter "87" in the Number of X Bins field
Because Age is integer-valued and the original distribution plot had a maximum of 87, there is one bin per possible Age value
2.5. Adding a “Missing” Bin to a Histogram
Not all observations appear in the histogram for Age. There are many observations with missing values for this variable
Right-click on the graph and select "Graph Properties" from the Option menu
Select the "Show Missing Bin" check box
With the missing value bin added, it is easy to see that nearly a quarter of the observations are missing an "Age" value
2.6. Adding Plots to the Explore Window
Select "Actions -> Plot" from the Explore window menu
Select "Pie"
The chart shows an equal number of cases for TARGET_B=0 (top) and TARGET_B=1 (bottom)
Select "Window -> Tile" to simultaneously view all sub-windows of the Explore window
2.7. Changing the Explore Window Sampling Defaults
To change the preference settings of SAS Enterprise Miner to use a random sample or all of the data source data in the Explore window
Select "Options" -> "Preferences" from the main menu
Select "Sample Method": Random
Select "Fetch Size": Max
2.8. Modifying and Correcting Source Data
Tools to modify the source data for your analysis
To use the "Replacement" node to modify incorrect or improper values for a variable
The process flow reads the raw data and replaces the unwanted values of the observations. However, user must specify which variables have unwanted values and what are the correct values
2.9. Changing the Replacement Node Properties
Modify the default settings of the Replacement node
To replace improper values with missing values
Select the "Default Limits Method" property and select "None" from the Options menu
Select the "Replacement Values" property and select "Missing" from the Options menu
Select "User Specified" as the Limit Method value for DemMedIncome
Enter "1" as the Replacement Lower Limit value
Any DemMedIncome values that fall below the lower limit of 1 are set to missing. All other values of this variable do not change
2.10. Running the Analysis and Viewing the Results
Right-click the Replacement node and click "Run" from the Option menu
Select "Results" to review the analysis outcome
The "Total Replacement Counts" window shows that 2357 observations were modified by the Replacement node.
The "Interval Variables" window summarizes the replacement that was conducted.
The "Output" window provides the same information as the Total Replacement Counts window and the Interval Variables window
2.11. Examining Exported Data
Select the Replacement node in your process flow diagram
Select "Exported Data" -> "..."
The Exported Data - Replacement window appears
This window lists the types of data sets that can be exported from a SAS Enterprise Miner process flow node.
As indicated, only a "Train" data set exists at this stage of the analysis
Select the "TRAIN" table from the Exported Data - Replacement window
Click "Explore" to access the Explore window again
Scroll completely to the right in the data table:
A new column is added to the analysis data: Replacement: Median Income Region.
Notice that the values of this variable match the Median Income Region variable, except when the original variables value equals zero
The replaced zero value is represented by a period, which indicates a missing value