Bing's Tech Notes

Landscape

Tools

Scikit-learn, great entry point
TensorFlow, more complex, distributed, powerful, by Google

Landscape

Definition

Program
Learns from Experience E (training set, training instance or sample)
With respect to Task T
And performance measure P
Improves from experience

Why ML

Automatically adapting to change
Deals with complex problem beyond traditional approach
Help human learn (data mining)

Types

By human supervision
- Supervised : labeled data
- Unsupervised
- Semisupervised
  - Photo-hosting
- Reinforcement
  - Agent
  - Observes environment
  - Perform actions
  - Get reward or penalty
  - Learn policy over time
On the fly
- Online
  - Out of core learning: learn huge data set in steps
  - Learning rate
    - May learn fast but also forget old experience fast
  - Challenge: garbage in garbage out, so monitor performance closely
- Batch
Results
- Instance based : compare new data points to known data (find value for some close known instance and average)
- Model based : build a model
  - Model selection
  - Measure model: utility (or fitness) function, how good / cos function : how bad
  - Model parameters (θ)
  - Training: obtain model parameters
  - Inference: make predictions with model

Data terminology

Label: desired solution in supervised learning
Features, predictors: values of data

Tasks

Classification
Regression: predict a target value
Clustering
Anomaly detection
Visualization & dimensional reduction
- Feature extraction: Merge several features into one (car mileage and age to wear n tear)
Associated rule learning
- Ppl buy a also buys b

What can go wrong

Data
- Not enough (unreasonable effectiveness of data)
- Non-representative data
  - Sampling noise: noise get into
  - Sampling bias: method problem
  - Cleaning data
    - Remove bad instance or fix errors?
    - Some instance miss feature:ignore feature?Ignore instance? Fill in values?
- Irrelevant features
  - Feature engineering
    - Selection
    - Extraction: combine features
    - Gather new data
- Over fitting : perform well on training data poor on predictions
  - Reason
    - Model pick up noise
    - Model too complex
  - Fix
    - Choose simpler model
    - Regularization:constraining model
      - Hyper-parameter: parameter to training algorithm
    - More data
    - Reduce noise
- Under fitting
  - Select more powerful model
  - Feature engineering
  - Reduce constraints

Testing & Validating

Testing: split data to training and testing set (commonly 80/20)
- Generalization error (out of sample error): error rate on new cases
- Over fitting: training error is low, generalization error high
Validation: if use multiple model and hyper-parameter on same training and testing set to get best model and parameter, it may overfit that set. Keep a further amount of data for validation
- Cross validation to save data

Google Sites

Report abuse