Applied Machine Learning in Python

October 2018

Module 1: Fundamentals of Machine Learning - Intro to Scikit Learn

ML - Learn by labelled examples (training data). ML should generalize the problem. ML learns from experience.

Introduction to machine learning with python book by Andreas MĂĽller is recommended

Key concepts in ML

Supervised learning (predict target values from labelled data). Classification (target values are discrete class), Regression (target values are continuous values).

Classification example: Data item (x) and Target Value (y). f: x -> y.

Training labels are given by human judges (explicit). Search engine: user clicking on a link (implicit)

Unsupervised learning (unlabeled data). Find structure (groups) or outliers.

ML workflow: Represent the learning problem (type of classifier e.g. image pixels with k-nearest neighbor classifier) -> evaluation (good vs. bad e.g. % correct predictions) -> optimization (see for the settings/parameters that give the best classifier. e.g. range of value k parameter in k-nearest neighbor classifier).

Feature representations e.g. E-mail features (word count), picture (pixels), sea creatures (size, stripes etc.), apple (weight, height, color, taste).

Classifiers have trade-off (e.g. speed vs. accuracy)

Python tools for ML

scikit-learn http://scikit-learn.org/stable/modules/classes.html

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Notebook 1: A simple classification task

I downloaded the notebooks and the util file is here.

Shift + Tab to see doc!

Defaults is 75% / 25% in train / test split.

from sklearn.neighbors import KNeighborsClassifier # instance based/memory based (remember the label example in the training data). Estimator (general class, have fit method to update knn)

#X features
#y labels

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # random_state in RNG

knn = KNeighborsClassifier(n_neighbors = 5)

knn.fit(X_train, y_train)

knn.score(X_test, y_test)

fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]

Clean up/filter data.

Feature space (e.g. lemons should have the same features).

k-NN: training set X_train with labels y_train and given a new instance x_test to be classified.

  1. Find similar instances (X_NN) to x_test that are in X_train.
  2. Get the labels y_NN for the instances in X_NN
  3. Predict the label for x_test by combining the labels y_NN e.g. majority vote (with k > 1 choose most nearest neighbors)

Chose a query point and find the nearest neighbor.

Decision boundary - point other side get attributed to different classes.

knn settings:

  • distance metrics (euclidean)
  • how many nn's (five)
  • optional weighting (ignored)
  • method for aggregating the classes of neighbor points

In this case a n_neighbors parameter of 1 was best (1 nearest neighbor).

In this case training on 80% gave best results but smoothed off after 50%.

Optional reading of http://approximatelycorrect.com/2016/11/07/the-foundations-of-algorithmic-bias/

Module 2: Supervised Machine Learning

Supervised = using label to predict labels.

Objectives:

  • Algorithms learn by estimating their parameters from data to make new predictions
  • Strengths and weakness of supervised learning methods
  • How to apply specific algorithms in scikit-learn
  • Overfitting and how to avoid it

Feature representation : e.g. fruit (mass, width, height, ...)

Data instances/samples/examples (X): e.g. one row

Target value (y): label

Training and test sets: (75/25)

Model/Estimator:

  • Model fitting: trained model
  • Training is estimating model parameters

Evaluation scores (e.g. knn.score)

Classification and regression take training instances and learn a mapping to a target value (e.g. 0 or 1 for classification).

Multi-class (e.g. fruits).

Multi-label (e.g. label topics in a website).

Regression (e.g. predict sale price of house given n rooms, location etc.).

predictions: K-nn (sensitive to changes in training data), Linear model fit using least-squares (stable bu potentially inaccurate prediction).

Optimal model is in-between overfitting and under-fitting.

Over-fitting and under-fitting

Models should be general (predict new previously unseen data; generalization).

Assumptions: future data has same properties as training set; doesn't work with overfitting (lack of training data). Model can be too complex for the training data.

Higher variance (captures local extrema)/ overfitting

Under-fitting. Doesn't fit to training data.

Datasets

from sklearn.datasets import make_regression # to make data for regression
from sklearn.datasets import make_classification # to make two classifications data, for example
from sklearn.datasets import make_blobs # to make clustered data

K-NN: Classification and regression

K-nn can be used for regression. Very jagged line though (if k =1), better for higher values of k.

from sklearn.neighbors import KNeighborsRegressor

Use R^2 to score model.

Model fitting. Metric: distance function between data point (default is Minkowski distance with power parameter p=2 (eiclidean).

Linear regression: Leasts squares

Can have multiple parameters e.g. Yprice = 212000 + 109*Xtax - 2000*Xage.

Estimate coefficient or parameters for the model

y = w0x0 + x1x1 + wnxn + b. w is feature weights/model coefficients. b is constant bias term.

Least-squares (ordinary) minimizes the mean squared error (sum of squared differences (RSS; residual sum of squares) between predict target and actual target values).

algorithm optimizes an objective function to minimize some kind of loss function (penalty value for incorrect predictions) of predicted values vs. actual target values.

from sklearn.linear_model import LinearRegression
# coef_. the _ means it's a model output

Linear regression: Ridge, Lasso and Polynomial Regression

Ridge regression - add a penalty for large variations in w. Get new parameters of w and b. Then ordinary regression. This is called regularization which prevents over-fitting. Uses L2 regularization. Control this with alpha parameter (higher means more regularization). Good if many small/medium affect.

from sklearn.linear_model import Ridge
# specify alpha parameter e.g. 20.0

If features have different scales the ridge applies more fairly. (feature normalization: fast convergence in learning) e.g. regularized regress, k-NN, SVM, NN.

Scale data by Min and Max.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Data leakage. Use same scaler on test set.

Regularization is important for small training data compared to number of features.

Lasso regression - uses a L1 regularization. Minimizes the absolute values. Sparse solutions and weights go to 0 for some features. alpha default is 1.0. Good if only a few variables with medium/large affect. Help some relationship between features (e.g. top 5).

from sklearn.linear_model import Lasso

Polynomial features - 2d points to 5 dimensions (x0, x1, x0^2, x0x1, x1^2) (degree two) (non-linear basis functions). Beware of high degree = more complex = over-fitting.

from sklearn.linear_model import PolynomialFeatures

Logistic regression

Binary value (0 or 1). Positive class (> 0.5), negative class (< 0.5).

Can do logistic regression between two clusters. (S-shaped sheet).

from sklearn.linear_model import LogisticRegression

Can do regularization by C parameters (L2 by default, C=1).

Linear Classifiers: Support Vector Machines

sign function (two outcomes: +1 or -1). dot product of weights and x: w.x + b. b is a bias term.

Classifier margin: width a decision boundary area can be increased before hitting a data point.

Best classifier with have the biggest margin (SVM with linear kernal: LSVM).

from sklearn.svm import SVC
# kernal = 'linear'
# C is regularized parameter, 1.0

Smaller C give more regular boundary.

Linear models: Pros and Cons

+ Simple and easy to train

+ fast

+ scale well to large data

+ works well with spare data

+ predictions are easy to interpret

  • - for low dimensional data other models may have superior generalizations
  • - for classification, data may not be linearly separable (need

Multi-class classification

e.g. type of fruit (4). Get 4 linear models

Kernelized SVM

e.g. linear classifier is impossible for multiple overlapping clusters.

Convert data to hyper-dimensional to then use linear classifier.

Transform to another dimension e.g xi^2. e.g. generate a parabola on a 1d.

For 2d: 1 - (xo^2 + x1^2)).

RBF: Radial Basis Function kernel. exp ( - gamma . |x - x'|^2). Small gamma wide kernal.

from sklearn.svc import SVC
# kernal = 'rbf'
# kernal = 'poly', degree=3

+ well on a range of dataset

+ versatille

  • - slow
  • - careful normalization and parameter tuning (gamma and C)
  • - no probability estimates
  • - difficult to interpret

Cross-validation

Run multiple train-test-split.

k-fold (k=5 or k=10). split into 5 parts and validate 5 models (20% testing splits)

from sklearn.model_selection import cross_val_score
# default cv = 3 #k-fold

Can cause issue though e.g. can miss an entire class.

Stratified cross-validation - equal number of class in training.

Validation curves show sensitivity to changes in an important parameter (gamma and C).

from sklearn.model_selection import validation_curve

Decision trees

Regression and classification. Can help with influential features.

Remove as many classifications as possible to start. More specific questions later.

Root node is top question. Answer is arrow. Leaf node is bottom classification.

Ideally separate one class from another.

Feature that leads to the most obvious split.

from sklearn.tree import DecisionTreeClassifier
# can set max_depth (pre-pruning) or max_leaf_nodes or min_samples_leaf

Feature importance value (0 not used in prediction, 1 predicts the target perfectly). Ideally do this over a few train-test-split.

+ easily visulized and interpreted

+ no need to normalize of scale

+ work well with datasets using a mixture of feature types

  • - can overfit
  • - need an ensemble of trees for better generalization.

paper on A Few Useful Things to Know about Machine Learning by Pedro Domingos.

popular sci article by Ed Yong on Genetic Test of Autism Refuted.

Other methods:

from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

Module 3: Evaluation

Evaluation and selection

Accuracy (% of correct targets) not always best metric.

Represent -> Train -> Evaluate -> Refine.

Reduce false-negative predictions could be one choice e.g. detect cancel cells.

Accuracy = #correct predictions / #total instances

Imbalance class (e.g. many normal transactions and only a few fraudulent). Here is where accuracy fails. accuracy of dummy is statistically (e.g. 1 value in one class and 99 values in other class 99% accurate).

Dummy classifier to see how good model is based on class imbalance. Provide a null accuracy baseline.

from sklearn.dummy import DummyClassifier

Constant strategy will always predict the positive class.

If classifier accuracy close to null:

  • Ineffective, erroneous or missing features
  • Poor choice of kernel or hyperparameter
  • Large class imbalance (don't use accuracy metric).

Can use Dummy Regressors to predict mean of training target

TN (true negative), FP (false positive; type 1 error), FN (false negative; type 2 error), TP (true positive). Called confusion matrix.

from sklearn.metrics import confusion_matrix

Confusion Matrices & Basic Evaluation Metrics

Accuracy = TN + TP / (TN + TP + FN + FP)

Classification error = FP + FN / (TN + TP + FN + FP) = 1 - Accuracy

Recall (true positive rate/sensitivity/probability of detection/hit rate) = TP / (TP + FN)

Precision = TP / (TP + FP): don't want FP's. About N % correctly labelled as positive

False positive rate (Specificity/false rate) = FP / (TN + FP). Positive region has found N % of TP.

Trade off between precision and recall

Recall:

  • Tumor detection
  • Paired with human expert to filter out FP

Precision:

  • Search engine ranking

F1 score = Precision x Recall / (precision + recall) = 2TP / (2TP + FN + FP). Add a beta parameter to adjust if you prefer recall or precision. beta = 0.5 for precision-oriented (FP hurt performance more than FN). Beta = 2 for recall-oriented (FN hurt performance more than FP).

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score classification_report

Classifier Decision Functions

Uncertainty (decision_function - how confidently the classifier predicts the + (-) class (large magnitude + (-) values)

predict_proba - class 1 if threshold > 0.5. Higher threshold results in a more conservative classifier. Not all models provide realistic probability estimates.

Can vary decision threshold to adjust precision and recall. As precision goes up recall goes down. Precision-recall curve.

Precision-Recall and ROC Curves

Steepness of P-R curves is important: maximize precision while maximizing recall

ROC curve: TP rate (HR) (Y axis), FP rate (MR) (X axis). AUC: Area Under Curve

Multi-Class Evaluation

Collection of true vs. prediction. Confusion matrices (e.g. 10 x 10 for digits dataset).

Overall evaluation metrics are averages across classes.

Multi-label classification is another problem.

Micro vs. Macro Average precision. Macro: Average across classes (e.g. 3 classes even in one class has more instances). Micro: Each instance has equal weight. Aggregate across all classes.

If micro-average is << macro-average examine the larger class for poor evaluation

If macro-averages is << micro-average examine the smaller class for poor evaluation.

Regression Evaluation

Predicted larger or smaller than predicted.

r^2 score. MAE, MSE, median_absolute_error (robust to outliers).

Can compare to dummy regressors.

Further reading on controlled experiments on the web Kohavi (2007).

Model Selection: Optimizing Classifiers for Different Evaluation Metrics

Accuracy is default score.

Add scoring parameter in code.

from sklearn.model_selection import GridSearchCV

searches over specified parameters values for an estimator.

from sklearn.metrics.scorer import SCORES

Decision boundaries can changes based on different estimates.

Datasets:

  • Training set (model build)
  • Validation set (model selection)
  • Test set (final evaluation)

Accuracy may not be the right metrics (FP and FN may need to be treated differently. e.g. tumor detection and fraud detection).

Module 4: Supervised Machine Learning Part 2

Naive Bayes Classifier

Probabilistic models of how the data might have been generated. Assumed each feature is independent.

Efficient learning and prediction. May not work well in general

Bernoulli: binary features (word present/absent)

Multinominal: discreet features e.g. word counts

Gaussian: Continuous (assume gaussian distribution).

Decision boundary is parabolic.

Can use partial fit if data doesn't' fit into memory.

Related to linear models:

  • + easy to understand.
  • + efficient parameter estimate
  • + good with high dimensional data
  • + good baseline against sophisticated model
  • - assume features are independent
  • - confidence estimates are not very accurate

Random Forests

Ensemble (multi models).

Ensemble of trees.

Many decision trees have better generalization.

Ensemble of trees should be diverse.

Random choosing of data in training. Random feature splits.

Boostraping. Selection with replacement.

Sensitivity to the max_features parameters. When it's 1 it leads to a more diverse forrest. When < the number of features they will be quite similar.

Make a prediction for every tree. Regression in them the mean of the prediction. For classification gives probability for each class. Probabilities are then averaged across tress. Predict the class with the highest probability.

Decision boundary has box like shape but local changes.

  • + widely good and good performance.
  • + doesn't require extensive parameter tuning
  • + handles a mixture of feature types
  • + easily parallel-able.
  • - results and difficult for humans to understand
  • - may not be good for high-dimensional tasks.

n_estimators: number of trees (default is 10); max_features; max_depth (splitting of trees); n_jobs (how many cores to use).

Gradient Boosted Decision Trees

Build a series of small decision trees. Each tree attempts to correct errors from the previous tree.

Learning rate: hows the errors are correction. e.g. high more corrected

  • + good accuracy off the shelf.
  • + prediction is fast and not memory intensive
  • + doesn't require normalization of features
  • + handles a mixture of feature types
  • - difficult to interpret
  • - tuning of learning_rate.
  • - training requires lots of CPU
  • - not recommened from text classification and other problems with high dimension sparse features

n_estimators (small decision trees); learning_rate (if small need higher n_estimators; max_depth

Neural Networks

Family of algorithms for deep learning.

Review:

Linear regression : output: y = input features: b + w1x1 + wnxn

Logistic regression: y = logistic (b + w1x1 + wnxn) - constrained to be between 0 and 1.

Multi-layer Perceptron with one hidden layer and tanh activation function (feed forward neural networks).

output y = v0h0 + vnhn. (h is hidden layer: tanh(w0ix0 + wnixn); v is output of h;

Each box in the hidden layer is called a hidden unit and it computer a non-linear weighted sum of the input features.

Need more training data and more CPU.

Activation functions: tanh; rectified linear unit; logistic

hidden_layer_sizes (how many hidden units). solver (lbfgs algorithm). alpha (regularization parameter e.g. l2 penalty).

from sklearn.neural_network import MLPRegressor

Increasing alpha constrains the model and makes then more general.

  • + state-of-the-art models. capture complex features
  • - significant training, data and customization.
  • - careful preprocessing of data needed
  • - difficult when features are different types

hidden_layer_sizes ([100, 100] gives two layers of 100 units. alpha (controls weight on fit). activation (e.g. 'relu').

Further reading of neural networks by Carter and Tanz (2013)

http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.66517&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Deep Learning

feature engineering - finding right features to use.

Feature learning/feature extraction.

Convolution: filter for a specific pattern.

Subsample e.g pooling: translated or rotated feature.

First feature finds edges and blobs. Second layer create features from the first later features

  • + feature detection reducing need for guesswork.
  • - require huge amount of training data.
  • - hard to interpret

Keras

Lasagne

use

TensorFlow

Theano

Can use GPU

Deep learning in a nutshell Tim Dettmers .

Assisting Pathologists in Detecting Cancer with Deep Learning, google AI blog

Data Leakage

Training data contains information on something you are trying to predict. e.g. using label as a feature. Including test data in training data.

Use future data e.g. length user stayed on site.

Some very obvious features e.g. surgery on a condition.

Are the features highly correlated with the target?

The treachery of leakage by Colin Fraser.

Leakage in data mining Kaufman et al (2011)

Data leakage in a ML competition

Rules of ML by Martin Zinkevich

Unsupervised learning

Data without labels.

Capture structure and information.

Clustering e.g. number of product pages browsed (basic users) vs. number of advanced features used (advanced users). Then recommend products. Find groups in the data. Assign every point in the data to one of the groups.

Transformation (processes that extract or compute information. e.g. density estimate.

sklearn.neighbors to use kernel density.

Dimensionally reduction and Manifold Learning

Use fewer features in your data.

e.g. PCA.

manifold learning algorithms. PCA struggles with this data. Multi-dimensional scaling (MDS) attempts to find a distance-preserving low-dimensional projection.

t-SNE find 2d representation of higher dimensional data and preserving information about neighbors. See https://github.com/lvdmaaten/bhtsne and http://lvdmaaten.github.io/tsne/

Clustering

Divide data into groups

Hard clustering: each data point belongs to exactly one cluster.

Soft clustering: each data point is assigned a weight, score or probability of membership for each cluster.

k-means: choose k and it'll find k nearest points.

It'll guess a cluster center then update each cluster by replacing it with the mean of all points assigned to that cluster. See https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Doesn't work well for irregular, complex clusters.

k-medoids can work with categorial features

Alogomatarive clusters - e.g. reach 3 clusters. Ward (default), average and complete linkage to merge clusters.

Can visualize using a dendrogram. Can help figure out right number of clusters.

DBSCAN: Density Based Spatial Clustering of Applications and Noise. Don't need to specific clusters in advanced. Identifies noise points (-1).

Multiple clustering can be applied. Hard to know how many clusters.

Distill (2016) How to Use t-SNE Effectively

Gleesen (2017) How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms