Landscape

Tools

    • Scikit-learn, great entry point
    • TensorFlow, more complex, distributed, powerful, by Google

Landscape

Definition

    • Program
    • Learns from Experience E (training set, training instance or sample)
    • With respect to Task T
    • And performance measure P
    • Improves from experience

Why ML

    • Automatically adapting to change
    • Deals with complex problem beyond traditional approach
    • Help human learn (data mining)

Types

    • By human supervision
      • Supervised : labeled data
      • Unsupervised
      • Semisupervised
        • Photo-hosting
      • Reinforcement
        • Agent
        • Observes environment
        • Perform actions
        • Get reward or penalty
        • Learn policy over time
    • On the fly
      • Online
        • Out of core learning: learn huge data set in steps
        • Learning rate
          • May learn fast but also forget old experience fast
        • Challenge: garbage in garbage out, so monitor performance closely
      • Batch
    • Results
      • Instance based : compare new data points to known data (find value for some close known instance and average)
      • Model based : build a model
        • Model selection
        • Measure model: utility (or fitness) function, how good / cos function : how bad
        • Model parameters (θ)
        • Training: obtain model parameters
        • Inference: make predictions with model

Data terminology

    • Label: desired solution in supervised learning
    • Features, predictors: values of data

Tasks

    • Classification
    • Regression: predict a target value
    • Clustering
    • Anomaly detection
    • Visualization & dimensional reduction
      • Feature extraction: Merge several features into one (car mileage and age to wear n tear)
    • Associated rule learning
      • Ppl buy a also buys b

What can go wrong

    • Data
      • Not enough (unreasonable effectiveness of data)
      • Non-representative data
        • Sampling noise: noise get into
        • Sampling bias: method problem
        • Cleaning data
          • Remove bad instance or fix errors?
          • Some instance miss feature:ignore feature?Ignore instance? Fill in values?
      • Irrelevant features
        • Feature engineering
          • Selection
          • Extraction: combine features
          • Gather new data
      • Over fitting : perform well on training data poor on predictions
        • Reason
          • Model pick up noise
          • Model too complex
        • Fix
          • Choose simpler model
          • Regularization:constraining model
            • Hyper-parameter: parameter to training algorithm
          • More data
          • Reduce noise
      • Under fitting
        • Select more powerful model
        • Feature engineering
        • Reduce constraints

Testing & Validating

    • Testing: split data to training and testing set (commonly 80/20)
      • Generalization error (out of sample error): error rate on new cases
      • Over fitting: training error is low, generalization error high
    • Validation: if use multiple model and hyper-parameter on same training and testing set to get best model and parameter, it may overfit that set. Keep a further amount of data for validation
      • Cross validation to save data