April 2019
Best ways to use resources (e.g. data collection, model choice etc.)
Tune in maximum space
Precision - of images recognized by casts what % are actually cats
Recall - What % of actual cats are correctly recognized
F1 score - combine these "average" of P and R) = (2 / (1/p) + (1/r)) "harmonic mean"
Average scores across regions, for example.
Take into account run time. e.g. accuracy - (0.5 x running time)
Maximize algorithm subject to run time < 100 ms
Wake words/Trigger words - max accuracy subject to < 1 false positive every 24 hours.
dev ~= hold out cross validation set
If have multiple regions randomly shuffle into dev/test sets (get data in the future and consider important to do well on).
Old way: 70 / 30 split (train test)
or 60 train 20 dev 20 test
If have 1,000,000 data 98% train 1% dev 1% test
Even if error is higher doesn't serve images not required
1 / mdev sum w^i L(ypred /= y). w^i is 10 of a type of image no required.
Bayes optimal error - theoretical limit of classifying
If performance if less than human performance:
If Human error is 1% and train error is 8% focus on bias (underfits)
If Human error is 7% and train error is 8% focus on variance (overfits)
Human-level error as a proxy for Bayes error.
Difference between training error and human error (avoidable bias)
Difference between training error and dev error (variance)
Online advertising
Product recommendation
Logistics (predicting transit time)
Loan approvals
Fit the training set pretty well (low avoidable bias): Train bigger model, longer/better optimization algorithm, NN architecture/hyperparameter search
Training set performance generalizes pretty well to the dev/test set (low variance): More data, Regularization, NN architecture/hyperparameter search
Error = 100% - Accuracy
e.g. identify dogs in pictures of cats
Get ~100 mislabel dev set examples. Count up how many are dogs. If 50% are dogs then look into them.
Create a spread sheet to look at mislabeled in dev set (e.g. Dog, Great Cats, Blurry, Comments)
Get ~100 mislabel dev set examples where a cat was identified but it was not a cat.
Apply same process to your dev and test sets to make sure they come from the same distribution.
speech recognition issues: noisy background, far from microphone
Setup dev/test set and metric.
Build initial system quickly
Use Bias/Variance analysis and error analysis to prioritize next steps.
Model does well on mobile images
option 1: Combine 200k images from web and 10k images from mobiel app. (train 205k, dev 2.5k, test 2,5k)
option 2: train on web and app then dev and test on mobile app
Speech recognition examples: training 500k (purchased data, smart speaker control, voice keyboard), dev/test 20k (speech activated)
Assume humans get 0% error. Training error 1%, dev error 10%.
Training-dev set: Same distribution as training set but not used for training.
Training, training-dev, dev, test
Human level error -> training set error (avoidable bias) -> dev set error (data mismatch) -> test error (overfitting)
Compete to other application e.g. rear-view mirror speech data compared to general speed recognition
Understand difference between training and dev/test sets (e.g. noisy car noise). Take data + car noise and create synthesized in-car audio.
Making training data more similar or collect more data similar to dev/test sets
Train on image recognition. Swap out last layer with radiology image and diagnosis (fine tuning). Or multiple layers.
Especially if new problem only has few images.
Makes sense when:
e.g. self driving car needs to identify pedestrians, cars, stop signs, traffic lights. y^(i) is (4, 1)
Y is a 4, m matrix.
Last output layer has 4 units (pedestrians, etc.)
Loss: y^hat (4, 1) 1/m su. sum L(y^, y)
One image can have multiple labels
Makes sense when:
Audio -> features -> phonemes -> words -> transcript
Audio -> transcript (if you have a big enough data)
Face recognition -> detect face -> zoom in and feed face to a NN. Do in two steps as have a lot of data for each.
Image -> bones > age compared to image -> age (need a lot of data)
pros:
Cons:
Do you have sufficient data to learn a function of the complexity needed to map x to y?
image (radar/lidar) -> (DL) cars -> (motion planning) route -> (control) steering
-> (DL) pedestrian