Landscape
Tools
- Scikit-learn, great entry point
- TensorFlow, more complex, distributed, powerful, by Google
Landscape
Definition
- Program
- Learns from Experience E (training set, training instance or sample)
- With respect to Task T
- And performance measure P
- Improves from experience
Why ML
- Automatically adapting to change
- Deals with complex problem beyond traditional approach
- Help human learn (data mining)
Types
- By human supervision
- Supervised : labeled data
- Unsupervised
- Semisupervised
- Photo-hosting
- Reinforcement
- Agent
- Observes environment
- Perform actions
- Get reward or penalty
- Learn policy over time
- On the fly
- Online
- Out of core learning: learn huge data set in steps
- Learning rate
- May learn fast but also forget old experience fast
- Challenge: garbage in garbage out, so monitor performance closely
- Batch
- Online
- Results
- Instance based : compare new data points to known data (find value for some close known instance and average)
- Model based : build a model
- Model selection
- Measure model: utility (or fitness) function, how good / cos function : how bad
- Model parameters (θ)
- Training: obtain model parameters
- Inference: make predictions with model
Data terminology
- Label: desired solution in supervised learning
- Features, predictors: values of data
Tasks
- Classification
- Regression: predict a target value
- Clustering
- Anomaly detection
- Visualization & dimensional reduction
- Feature extraction: Merge several features into one (car mileage and age to wear n tear)
- Associated rule learning
- Ppl buy a also buys b
What can go wrong
- Data
- Not enough (unreasonable effectiveness of data)
- Non-representative data
- Sampling noise: noise get into
- Sampling bias: method problem
- Cleaning data
- Remove bad instance or fix errors?
- Some instance miss feature:ignore feature?Ignore instance? Fill in values?
- Irrelevant features
- Feature engineering
- Selection
- Extraction: combine features
- Gather new data
- Feature engineering
- Over fitting : perform well on training data poor on predictions
- Reason
- Model pick up noise
- Model too complex
- Fix
- Choose simpler model
- Regularization:constraining model
- Hyper-parameter: parameter to training algorithm
- More data
- Reduce noise
- Reason
- Under fitting
- Select more powerful model
- Feature engineering
- Reduce constraints
Testing & Validating
- Testing: split data to training and testing set (commonly 80/20)
- Generalization error (out of sample error): error rate on new cases
- Over fitting: training error is low, generalization error high
- Validation: if use multiple model and hyper-parameter on same training and testing set to get best model and parameter, it may overfit that set. Keep a further amount of data for validation
- Cross validation to save data