Structuring Machine Learning Projects

April 2019

Module 1: Introduction to ML Strategy

Best ways to use resources (e.g. data collection, model choice etc.)

Othogonalization

Tune in maximum space

  1. Fit training set well on cost function (bigger network, adam)
  2. Fit dev set well on cost function (regularization, bigger training set)
  3. Fit test set well on cost function (bigger dev set)
  4. Performs well in the real world (change dev set or cost function)

Single number evaluation metric

Precision - of images recognized by casts what % are actually cats

Recall - What % of actual cats are correctly recognized

F1 score - combine these "average" of P and R) = (2 / (1/p) + (1/r)) "harmonic mean"

Average scores across regions, for example.

Satisficing and optimizing metrics

Take into account run time. e.g. accuracy - (0.5 x running time)

Maximize algorithm subject to run time < 100 ms

Wake words/Trigger words - max accuracy subject to < 1 false positive every 24 hours.

Train/dev/test distributions

dev ~= hold out cross validation set

If have multiple regions randomly shuffle into dev/test sets (get data in the future and consider important to do well on).

Size of dev and test sets

Old way: 70 / 30 split (train test)

or 60 train 20 dev 20 test

If have 1,000,000 data 98% train 1% dev 1% test

When to change dev/test sets and metrics

Even if error is higher doesn't serve images not required

1 / mdev sum w^i L(ypred /= y). w^i is 10 of a type of image no required.

  1. Define a metric to evaluate classifiers
  2. How to do well on this metric

Why human-level performance?

Bayes optimal error - theoretical limit of classifying

If performance if less than human performance:

  • Get labelled data from humans
  • Why did a person get this right?
  • Better analysis of bias/variance

Avoidable bias

If Human error is 1% and train error is 8% focus on bias (underfits)

If Human error is 7% and train error is 8% focus on variance (overfits)

Human-level error as a proxy for Bayes error.

Difference between training error and human error (avoidable bias)

Difference between training error and dev error (variance)

Understanding human-level performance

Surpassing human-level performance

Online advertising

Product recommendation

Logistics (predicting transit time)

Loan approvals

Improving your model performance

Fit the training set pretty well (low avoidable bias): Train bigger model, longer/better optimization algorithm, NN architecture/hyperparameter search

Training set performance generalizes pretty well to the dev/test set (low variance): More data, Regularization, NN architecture/hyperparameter search

Quiz

Error = 100% - Accuracy

Module 2: Introduction to ML Strategy (2)

Carrying out error analysis

e.g. identify dogs in pictures of cats

Get ~100 mislabel dev set examples. Count up how many are dogs. If 50% are dogs then look into them.

Create a spread sheet to look at mislabeled in dev set (e.g. Dog, Great Cats, Blurry, Comments)

Incorrectly labeled examples

Get ~100 mislabel dev set examples where a cat was identified but it was not a cat.

Apply same process to your dev and test sets to make sure they come from the same distribution.

Build your first system quickly, then iterate

speech recognition issues: noisy background, far from microphone

Setup dev/test set and metric.

Build initial system quickly

Use Bias/Variance analysis and error analysis to prioritize next steps.

Training and testing on difference distributions

Model does well on mobile images

option 1: Combine 200k images from web and 10k images from mobiel app. (train 205k, dev 2.5k, test 2,5k)

option 2: train on web and app then dev and test on mobile app

Speech recognition examples: training 500k (purchased data, smart speaker control, voice keyboard), dev/test 20k (speech activated)

Bias and Variance with mismatched data distribution

Assume humans get 0% error. Training error 1%, dev error 10%.

Training-dev set: Same distribution as training set but not used for training.

Training, training-dev, dev, test

Human level error -> training set error (avoidable bias) -> dev set error (data mismatch) -> test error (overfitting)

Compete to other application e.g. rear-view mirror speech data compared to general speed recognition

Addressing data mismatch

Understand difference between training and dev/test sets (e.g. noisy car noise). Take data + car noise and create synthesized in-car audio.

Making training data more similar or collect more data similar to dev/test sets

Transfer learning

Train on image recognition. Swap out last layer with radiology image and diagnosis (fine tuning). Or multiple layers.

Especially if new problem only has few images.

Makes sense when:

  • Task A and B have the same input x
  • More data for Task A than Task B
  • Low level features from A could be helpful for learning B

Multi-task learning

e.g. self driving car needs to identify pedestrians, cars, stop signs, traffic lights. y^(i) is (4, 1)

Y is a 4, m matrix.

Last output layer has 4 units (pedestrians, etc.)

Loss: y^hat (4, 1) 1/m su. sum L(y^, y)

One image can have multiple labels

Makes sense when:

  • Training on a set of tasks that could benefit from having share lower-features
  • Amount of data you have for each task is quite similar
  • Can train a big enough NN to do well on all the tasks

What is end-to-end deep learning?

Audio -> features -> phonemes -> words -> transcript

Audio -> transcript (if you have a big enough data)

Face recognition -> detect face -> zoom in and feed face to a NN. Do in two steps as have a lot of data for each.

Image -> bones > age compared to image -> age (need a lot of data)

Whether to use end-to-end deep learning?

pros:

  • Let the data speak
  • Less hand designing

Cons:

  • Need large amount of data
  • Excludes potentially useful hand-designed components

Do you have sufficient data to learn a function of the complexity needed to map x to y?

image (radar/lidar) -> (DL) cars -> (motion planning) route -> (control) steering

-> (DL) pedestrian