1. Data Mining and Predictive Analytics

Data-Mining Process
1. Data Sampling
  - Extract a sample of data that is relevant to the business problem
2. Data Preparation
  - Manipulate the data to put it in a form suitable for formal modeling
3. Model Construction
  - Apply the appropriate data-mining technique (regression, classification trees, k-means) to accomplish the desired data-mining task (prediction, classification, clustering, etc.)
4. Model Assessment
  - Evaluate models by comparing performance on appropriate data sets

Data Sampling
1. Simple Random Sampling
  - Simple random sampling is the most commonly used sampling method by randomly choosing some records from a population (denoted by n)
2. Stratified Sampling
  - Consider an example population, which has pre-existing segments of same or different sizes that are already classified into a distinct number of subgroups.
  - For example, if 1,000 random candidates are to be picked from across the country for a sporting event, it might be a good idea to pick them proportionately from each state.
3. Systematic Sampling
  - Systematic sampling is based on a fixed rule, like picking every fifth or seventh observation from a given population

Variables
1. Numeric Variables (Quantitative)
  - Continuous
    - A continuous variable can take any value between two limits.
    - For example, a height variable can be anything between 0.8 and 2.2 meters. It can take continuous values such as 1.1m or 1.51m etc
  - Discrete
    - These numeric variables take values in steps only. They can take only an integer or some predefined values between the given limits
    - For example, the number of children in a family can only be 1, 2, 3, and 4 but never be 1.5 or 2.34
2. Categorical or Non-numeric Variables (Qualitative)
  - Non-numeric, qualitative, categorical, and binary variables are the type of variables that represent quality or a characteristic field
  - Examples are shirt sizes expressed as S, M, L, XL, and XXL, or distance, which is expressed as near and far.
  - It can as well be a Boolean value like a "pass or a fail" or a "yes or no" field.
3. Types of Variables
  - Independent Variables:
    - Active
      - Manipulated independent variables that are given to a group of participants, within a specific period of time during the study
    - Attribute
      - Measured independent variables are variables that cannot be manipulated. The variables cannot be systematically changed during the study
      - E.g. Gender, ethnic group
  - Dependent Variables
    - The dependent variable is assumed to measure or assess the effect of the independent variable
    - It is thought of as the presumed outcome or criterion
    - E.g. Test scores, ratings
  - Extraneous Variables
    - Are also knowns as nuisance variables.
    - The variable are not of interest in the study but could influence the dependent variable
    - E.g. Temperature, time

Data Cleaning and Preparation
- Major Tasks in Data Pre-processing:
  1. Data cleaning
    - Data exploration must always be done when receiving new data
    - Data cleaning tasks:
      1. Business and sanity check on every attribute
      2. Handling missing data
      3. Identify outliers
      4. Correct inconsistent/wrong data
      5. Resolve redundancy cause by integration
    - Treatment of missing data / outliers
      - Delete the records with missing data / outliers
      - Ignore the data sample with missing data
      - Data imputation (Fill in the missing value)
        Global Constant
        E.g. all missing value impute to “unknown” or 0
        Attribute mean (or median, mode)
        All missing value impute with computed value of present data
        Prediction model
        Train a prediction model to predict the most probable value
        Need training data and additional modeling
  2. Data integration (Combine data from disparate sources into meaningful and valuable information etc.)
  3. Data transformation ( Transform to the same dollar sign or transform data from nominal to categorical (address to district) etc.)
  4. Data reduction (To reduce parameters or records (rows) etc.)

Data Mining:
- Finding patterns or relationships among elements of the data
- Unsupervised learning: Clustering / K-mean
Predictive Analysis:
- Finding a pattern (from historical data) so that an opportunity outcome can be identified before it occurred
- Supervised learning: Classification / Regression

SAS Applications

1. Accessing Prepared Data

1.1. Creating a SAS Enterprise Miner Project
- A SAS Enterprise Miner project contains materials that are related to a particular analysis task
- These materials include analysis process flows, intermediate analysis data sets, and analysis results
- Select File -> New -> Project from the main menu
- Location of the project: Physical location where the project folder is created
1.2. Creating a SAS Library
- A SAS library connects SAS Enterprise Miner with the raw data sources, which are the basis of analysis
- A library can link to a directory on the SAS Foundation Server, a relational database, or even an Excel workbook
- Select File -> New -> Library from the main menu
- Path -> "C:\SDBA\Data_PA" etc.
1.3. Creating a SAS Enterprise Miner Diagram
- A SAS Enterprise Miner diagram workspace contains and displays the steps that are involved in the analysis
- To define a diagram, user need to specify its name
- Select File -> New -> Diagram from the main menu
1.4. SAS Data Source
- A data source is a link between an existing SAS table and SAS Enterprise Miner
- To define a data source, you need to select the analysis table and define metadata that is appropriate for your analysis task
1.5. Defining a Data Source
- A data source links SAS Enterprise Miner to an existing analysis table
- To specify a data source, user need to define a SAS library and know the name of the table that you link to SAS Enterprise Miner
- File -> New -> Data Source
- Source: "SAS Table"
- Table: Select the SAS table that want to make available to SAS Enterprise Miner
1.6. Defining Column Metadata
- After a data set is specified, the next task is to set the column metadata
- User needs to know the modeling role and proper measurement level of each variable in the source data set
- Select "Advanced" -> "Customize"
  - Class Levels Count Threshold value: 2 (Only binary numeric variables are treated as categorical variables)
  - Reject Levels Count Threshold value: 100 (Only character variables with more than 100 distinct values are rejected)
- Select "Next" -> "Label" check box

1.7. Finalizing the Data Source Specification
- Click Next to proceed to Decision Configuration

2. Assaying Prepared Data

2.1. Accessing the Explore Window
- Open the Data Sources folder in the Project Panel and right-click and "explore" the data source of interest
- The Data Source Option menu appears

2.2. Changing the Explore Sample Size
- The Sample Method property indicates that the sample is drawn from the top (first 2000 rows) of the dataset
- "Sample Method" value field: Random
- "Fetch Size" property: Max
- The random sample has distributional properties and gives user an idea about the general characteristics of the variables. If the goal is to examine the data for potential problems, it is wise to examine the entire data set

2.3. Creating a Histogram for a Single Variable
- The primary purpose is to create statistical analysis plots
- Select Actions -> Plot from the Explore window menu. The Chart Wizard appears, and it is at the Select a Chart Type step

Select "Histogram". Histograms are useful for exploring the distribution of values in a variable
Click "Next". The Chart Wizard proceeds to the next step, "Select Chart Roles"
- To draw a histogram, one variable must be selected to have the role X
Select "Role -> X" for the DemAge variable etc.
Select "Finish". The Explore window is filled with a histogram of the DemAge variable
- Variable descriptions, rather than variable names, are used to label the axes of plots in the Explore window
From the histogram, we can observe that:
- Age has a minimum value of 0 and a maximum value of 87
- The mode occurs in the ninth bin, which ranges between approximately 70 and 78. And "Frequency" tells us that there are approximately 1400 observations in this range

2.4. Changing the Graph Properties for a Histogram
- Right-click in the data area of the Age histogram and select "Graph Properties" from the Option menu. The Properties - Histogram window appears
- Enter "87" in the Number of X Bins field
- Because Age is integer-valued and the original distribution plot had a maximum of 87, there is one bin per possible Age value

2.5. Adding a “Missing” Bin to a Histogram
- Not all observations appear in the histogram for Age. There are many observations with missing values for this variable
- Right-click on the graph and select "Graph Properties" from the Option menu
- Select the "Show Missing Bin" check box
  - With the missing value bin added, it is easy to see that nearly a quarter of the observations are missing an "Age" value

2.6. Adding Plots to the Explore Window
- Select "Actions -> Plot" from the Explore window menu
- Select "Pie"
  - The chart shows an equal number of cases for TARGET_B=0 (top) and TARGET_B=1 (bottom)

Select "Window -> Tile" to simultaneously view all sub-windows of the Explore window

2.7. Changing the Explore Window Sampling Defaults
- To change the preference settings of SAS Enterprise Miner to use a random sample or all of the data source data in the Explore window
- Select "Options" -> "Preferences" from the main menu
- Select "Sample Method": Random
- Select "Fetch Size": Max

2.8. Modifying and Correcting Source Data
- Tools to modify the source data for your analysis
- To use the "Replacement" node to modify incorrect or improper values for a variable
- The process flow reads the raw data and replaces the unwanted values of the observations. However, user must specify which variables have unwanted values and what are the correct values

2.9. Changing the Replacement Node Properties
- Modify the default settings of the Replacement node
- To replace improper values with missing values
  - Select the "Default Limits Method" property and select "None" from the Options menu
  - Select the "Replacement Values" property and select "Missing" from the Options menu
  - Select "User Specified" as the Limit Method value for DemMedIncome
    - Enter "1" as the Replacement Lower Limit value
    - Any DemMedIncome values that fall below the lower limit of 1 are set to missing. All other values of this variable do not change

2.10. Running the Analysis and Viewing the Results
- Right-click the Replacement node and click "Run" from the Option menu
- Select "Results" to review the analysis outcome
  - The "Total Replacement Counts" window shows that 2357 observations were modified by the Replacement node.
  - The "Interval Variables" window summarizes the replacement that was conducted.
  - The "Output" window provides the same information as the Total Replacement Counts window and the Interval Variables window

2.11. Examining Exported Data
- Select the Replacement node in your process flow diagram
- Select "Exported Data" -> "..."
- The Exported Data - Replacement window appears
  - This window lists the types of data sets that can be exported from a SAS Enterprise Miner process flow node.
  - As indicated, only a "Train" data set exists at this stage of the analysis
- Select the "TRAIN" table from the Exported Data - Replacement window
- Click "Explore" to access the Explore window again

Scroll completely to the right in the data table:
- A new column is added to the analysis data: Replacement: Median Income Region.
- Notice that the values of this variable match the Median Income Region variable, except when the original variables value equals zero
- The replaced zero value is represented by a period, which indicates a missing value

Google Sites

Report abuse