Structuring Machine Learning Projects
April 2019
Module 1: Introduction to ML Strategy
Best ways to use resources (e.g. data collection, model choice etc.)
Othogonalization
Tune in maximum space
- Fit training set well on cost function (bigger network, adam)
- Fit dev set well on cost function (regularization, bigger training set)
- Fit test set well on cost function (bigger dev set)
- Performs well in the real world (change dev set or cost function)
Single number evaluation metric
Precision - of images recognized by casts what % are actually cats
Recall - What % of actual cats are correctly recognized
F1 score - combine these "average" of P and R) = (2 / (1/p) + (1/r)) "harmonic mean"
Average scores across regions, for example.
Satisficing and optimizing metrics
Take into account run time. e.g. accuracy - (0.5 x running time)
Maximize algorithm subject to run time < 100 ms
Wake words/Trigger words - max accuracy subject to < 1 false positive every 24 hours.
Train/dev/test distributions
dev ~= hold out cross validation set
If have multiple regions randomly shuffle into dev/test sets (get data in the future and consider important to do well on).
Size of dev and test sets
Old way: 70 / 30 split (train test)
or 60 train 20 dev 20 test
If have 1,000,000 data 98% train 1% dev 1% test
When to change dev/test sets and metrics
Even if error is higher doesn't serve images not required
1 / mdev sum w^i L(ypred /= y). w^i is 10 of a type of image no required.
- Define a metric to evaluate classifiers
- How to do well on this metric
Why human-level performance?
Bayes optimal error - theoretical limit of classifying
If performance if less than human performance:
- Get labelled data from humans
- Why did a person get this right?
- Better analysis of bias/variance
Avoidable bias
If Human error is 1% and train error is 8% focus on bias (underfits)
If Human error is 7% and train error is 8% focus on variance (overfits)
Human-level error as a proxy for Bayes error.
Difference between training error and human error (avoidable bias)
Difference between training error and dev error (variance)
Understanding human-level performance
Surpassing human-level performance
Online advertising
Product recommendation
Logistics (predicting transit time)
Loan approvals
Improving your model performance
Fit the training set pretty well (low avoidable bias): Train bigger model, longer/better optimization algorithm, NN architecture/hyperparameter search
Training set performance generalizes pretty well to the dev/test set (low variance): More data, Regularization, NN architecture/hyperparameter search
Quiz
Error = 100% - Accuracy
Module 2: Introduction to ML Strategy (2)
Carrying out error analysis
e.g. identify dogs in pictures of cats
Get ~100 mislabel dev set examples. Count up how many are dogs. If 50% are dogs then look into them.
Create a spread sheet to look at mislabeled in dev set (e.g. Dog, Great Cats, Blurry, Comments)
Incorrectly labeled examples
Get ~100 mislabel dev set examples where a cat was identified but it was not a cat.
Apply same process to your dev and test sets to make sure they come from the same distribution.
Build your first system quickly, then iterate
speech recognition issues: noisy background, far from microphone
Setup dev/test set and metric.
Build initial system quickly
Use Bias/Variance analysis and error analysis to prioritize next steps.
Training and testing on difference distributions
Model does well on mobile images
option 1: Combine 200k images from web and 10k images from mobiel app. (train 205k, dev 2.5k, test 2,5k)
option 2: train on web and app then dev and test on mobile app
Speech recognition examples: training 500k (purchased data, smart speaker control, voice keyboard), dev/test 20k (speech activated)
Bias and Variance with mismatched data distribution
Assume humans get 0% error. Training error 1%, dev error 10%.
Training-dev set: Same distribution as training set but not used for training.
Training, training-dev, dev, test
Human level error -> training set error (avoidable bias) -> dev set error (data mismatch) -> test error (overfitting)
Compete to other application e.g. rear-view mirror speech data compared to general speed recognition
Addressing data mismatch
Understand difference between training and dev/test sets (e.g. noisy car noise). Take data + car noise and create synthesized in-car audio.
Making training data more similar or collect more data similar to dev/test sets
Transfer learning
Train on image recognition. Swap out last layer with radiology image and diagnosis (fine tuning). Or multiple layers.
Especially if new problem only has few images.
Makes sense when:
- Task A and B have the same input x
- More data for Task A than Task B
- Low level features from A could be helpful for learning B
Multi-task learning
e.g. self driving car needs to identify pedestrians, cars, stop signs, traffic lights. y^(i) is (4, 1)
Y is a 4, m matrix.
Last output layer has 4 units (pedestrians, etc.)
Loss: y^hat (4, 1) 1/m su. sum L(y^, y)
One image can have multiple labels
Makes sense when:
- Training on a set of tasks that could benefit from having share lower-features
- Amount of data you have for each task is quite similar
- Can train a big enough NN to do well on all the tasks
What is end-to-end deep learning?
Audio -> features -> phonemes -> words -> transcript
Audio -> transcript (if you have a big enough data)
Face recognition -> detect face -> zoom in and feed face to a NN. Do in two steps as have a lot of data for each.
Image -> bones > age compared to image -> age (need a lot of data)
Whether to use end-to-end deep learning?
pros:
- Let the data speak
- Less hand designing
Cons:
- Need large amount of data
- Excludes potentially useful hand-designed components
Do you have sufficient data to learn a function of the complexity needed to map x to y?
image (radar/lidar) -> (DL) cars -> (motion planning) route -> (control) steering
-> (DL) pedestrian