October 2018
Taught by Kevyn Collins-Thompson
ML - Learn by labelled examples (training data). ML should generalize the problem. ML learns from experience.
Introduction to machine learning with python book by Andreas MĂĽller is recommended
Supervised learning (predict target values from labelled data). Classification (target values are discrete class), Regression (target values are continuous values).
Classification example: Data item (x) and Target Value (y). f: x -> y.
Training labels are given by human judges (explicit). Search engine: user clicking on a link (implicit)
Unsupervised learning (unlabeled data). Find structure (groups) or outliers.
ML workflow: Represent the learning problem (type of classifier e.g. image pixels with k-nearest neighbor classifier) -> evaluation (good vs. bad e.g. % correct predictions) -> optimization (see for the settings/parameters that give the best classifier. e.g. range of value k parameter in k-nearest neighbor classifier).
Feature representations e.g. E-mail features (word count), picture (pixels), sea creatures (size, stripes etc.), apple (weight, height, color, taste).
Classifiers have trade-off (e.g. speed vs. accuracy)
scikit-learn http://scikit-learn.org/stable/modules/classes.html
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
I downloaded the notebooks and the util file is here.
Shift + Tab to see doc!
Defaults is 75% / 25% in train / test split.
from sklearn.neighbors import KNeighborsClassifier # instance based/memory based (remember the label example in the training data). Estimator (general class, have fit method to update knn)
#X features
#y labels
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # random_state in RNG
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]
Clean up/filter data.
Feature space (e.g. lemons should have the same features).
k-NN: training set X_train with labels y_train and given a new instance x_test to be classified.
Chose a query point and find the nearest neighbor.
Decision boundary - point other side get attributed to different classes.
knn settings:
In this case a n_neighbors parameter of 1 was best (1 nearest neighbor).
In this case training on 80% gave best results but smoothed off after 50%.
Optional reading of http://approximatelycorrect.com/2016/11/07/the-foundations-of-algorithmic-bias/
Supervised = using label to predict labels.
Objectives:
Feature representation : e.g. fruit (mass, width, height, ...)
Data instances/samples/examples (X): e.g. one row
Target value (y): label
Training and test sets: (75/25)
Model/Estimator:
Evaluation scores (e.g. knn.score
)
Classification and regression take training instances and learn a mapping to a target value (e.g. 0 or 1 for classification).
Multi-class (e.g. fruits).
Multi-label (e.g. label topics in a website).
Regression (e.g. predict sale price of house given n rooms, location etc.).
predictions: K-nn (sensitive to changes in training data), Linear model fit using least-squares (stable bu potentially inaccurate prediction).
Optimal model is in-between overfitting and under-fitting.
Models should be general (predict new previously unseen data; generalization).
Assumptions: future data has same properties as training set; doesn't work with overfitting (lack of training data). Model can be too complex for the training data.
Higher variance (captures local extrema)/ overfitting
Under-fitting. Doesn't fit to training data.
from sklearn.datasets import make_regression # to make data for regression
from sklearn.datasets import make_classification # to make two classifications data, for example
from sklearn.datasets import make_blobs # to make clustered data
K-nn can be used for regression. Very jagged line though (if k =1), better for higher values of k.
from sklearn.neighbors import KNeighborsRegressor
Use R^2 to score model.
Model fitting. Metric: distance function between data point (default is Minkowski distance with power parameter p=2 (eiclidean).
Can have multiple parameters e.g. Yprice = 212000 + 109*Xtax - 2000*Xage.
Estimate coefficient or parameters for the model
y = w0x0 + x1x1 + wnxn + b. w is feature weights/model coefficients. b is constant bias term.
Least-squares (ordinary) minimizes the mean squared error (sum of squared differences (RSS; residual sum of squares) between predict target and actual target values).
algorithm optimizes an objective function to minimize some kind of loss function (penalty value for incorrect predictions) of predicted values vs. actual target values.
from sklearn.linear_model import LinearRegression
# coef_. the _ means it's a model output
Ridge regression - add a penalty for large variations in w. Get new parameters of w and b. Then ordinary regression. This is called regularization which prevents over-fitting. Uses L2 regularization. Control this with alpha parameter (higher means more regularization). Good if many small/medium affect.
from sklearn.linear_model import Ridge
# specify alpha parameter e.g. 20.0
If features have different scales the ridge applies more fairly. (feature normalization: fast convergence in learning) e.g. regularized regress, k-NN, SVM, NN.
Scale data by Min and Max.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
Data leakage. Use same scaler on test set.
Regularization is important for small training data compared to number of features.
Lasso regression - uses a L1 regularization. Minimizes the absolute values. Sparse solutions and weights go to 0 for some features. alpha default is 1.0. Good if only a few variables with medium/large affect. Help some relationship between features (e.g. top 5).
from sklearn.linear_model import Lasso
Polynomial features - 2d points to 5 dimensions (x0, x1, x0^2, x0x1, x1^2) (degree two) (non-linear basis functions). Beware of high degree = more complex = over-fitting.
from sklearn.linear_model import PolynomialFeatures
Binary value (0 or 1). Positive class (> 0.5), negative class (< 0.5).
Can do logistic regression between two clusters. (S-shaped sheet).
from sklearn.linear_model import LogisticRegression
Can do regularization by C parameters (L2 by default, C=1).
sign function (two outcomes: +1 or -1). dot product of weights and x: w.x + b. b is a bias term.
Classifier margin: width a decision boundary area can be increased before hitting a data point.
Best classifier with have the biggest margin (SVM with linear kernal: LSVM).
from sklearn.svm import SVC
# kernal = 'linear'
# C is regularized parameter, 1.0
Smaller C give more regular boundary.
+ Simple and easy to train
+ fast
+ scale well to large data
+ works well with spare data
+ predictions are easy to interpret
e.g. type of fruit (4). Get 4 linear models
e.g. linear classifier is impossible for multiple overlapping clusters.
Convert data to hyper-dimensional to then use linear classifier.
Transform to another dimension e.g xi^2. e.g. generate a parabola on a 1d.
For 2d: 1 - (xo^2 + x1^2)).
RBF: Radial Basis Function kernel. exp ( - gamma . |x - x'|^2). Small gamma wide kernal.
from sklearn.svc import SVC
# kernal = 'rbf'
# kernal = 'poly', degree=3
+ well on a range of dataset
+ versatille
Run multiple train-test-split.
k-fold (k=5 or k=10). split into 5 parts and validate 5 models (20% testing splits)
from sklearn.model_selection import cross_val_score
# default cv = 3 #k-fold
Can cause issue though e.g. can miss an entire class.
Stratified cross-validation - equal number of class in training.
Validation curves show sensitivity to changes in an important parameter (gamma and C).
from sklearn.model_selection import validation_curve
Regression and classification. Can help with influential features.
Remove as many classifications as possible to start. More specific questions later.
Root node is top question. Answer is arrow. Leaf node is bottom classification.
Ideally separate one class from another.
Feature that leads to the most obvious split.
from sklearn.tree import DecisionTreeClassifier
# can set max_depth (pre-pruning) or max_leaf_nodes or min_samples_leaf
Feature importance value (0 not used in prediction, 1 predicts the target perfectly). Ideally do this over a few train-test-split.
+ easily visulized and interpreted
+ no need to normalize of scale
+ work well with datasets using a mixture of feature types
paper on A Few Useful Things to Know about Machine Learning by Pedro Domingos.
popular sci article by Ed Yong on Genetic Test of Autism Refuted.
Other methods:
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
Accuracy (% of correct targets) not always best metric.
Represent -> Train -> Evaluate -> Refine.
Reduce false-negative predictions could be one choice e.g. detect cancel cells.
Accuracy = #correct predictions / #total instances
Imbalance class (e.g. many normal transactions and only a few fraudulent). Here is where accuracy fails. accuracy of dummy is statistically (e.g. 1 value in one class and 99 values in other class 99% accurate).
Dummy classifier to see how good model is based on class imbalance. Provide a null accuracy baseline.
from sklearn.dummy import DummyClassifier
Constant strategy will always predict the positive class.
If classifier accuracy close to null:
Can use Dummy Regressors to predict mean of training target
TN (true negative), FP (false positive; type 1 error), FN (false negative; type 2 error), TP (true positive). Called confusion matrix.
from sklearn.metrics import confusion_matrix
Accuracy = TN + TP / (TN + TP + FN + FP)
Classification error = FP + FN / (TN + TP + FN + FP) = 1 - Accuracy
Recall (true positive rate/sensitivity/probability of detection/hit rate) = TP / (TP + FN)
Precision = TP / (TP + FP): don't want FP's. About N % correctly labelled as positive
False positive rate (Specificity/false rate) = FP / (TN + FP). Positive region has found N % of TP.
Trade off between precision and recall
Recall:
Precision:
F1 score = Precision x Recall / (precision + recall) = 2TP / (2TP + FN + FP). Add a beta parameter to adjust if you prefer recall or precision. beta = 0.5 for precision-oriented (FP hurt performance more than FN). Beta = 2 for recall-oriented (FN hurt performance more than FP).
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score classification_report
Uncertainty (decision_function
- how confidently the classifier predicts the + (-) class (large magnitude + (-) values)
predict_proba - class 1 if threshold > 0.5. Higher threshold results in a more conservative classifier. Not all models provide realistic probability estimates.
Can vary decision threshold to adjust precision and recall. As precision goes up recall goes down. Precision-recall curve.
Steepness of P-R curves is important: maximize precision while maximizing recall
ROC curve: TP rate (HR) (Y axis), FP rate (MR) (X axis). AUC: Area Under Curve
Collection of true vs. prediction. Confusion matrices (e.g. 10 x 10 for digits dataset).
Overall evaluation metrics are averages across classes.
Multi-label classification is another problem.
Micro vs. Macro Average precision. Macro: Average across classes (e.g. 3 classes even in one class has more instances). Micro: Each instance has equal weight. Aggregate across all classes.
If micro-average is << macro-average examine the larger class for poor evaluation
If macro-averages is << micro-average examine the smaller class for poor evaluation.
Predicted larger or smaller than predicted.
r^2 score. MAE, MSE, median_absolute_error (robust to outliers).
Can compare to dummy regressors.
Further reading on controlled experiments on the web Kohavi (2007).
Accuracy is default score.
Add scoring parameter in code.
from sklearn.model_selection import GridSearchCV
searches over specified parameters values for an estimator.
from sklearn.metrics.scorer import SCORES
Decision boundaries can changes based on different estimates.
Datasets:
Accuracy may not be the right metrics (FP and FN may need to be treated differently. e.g. tumor detection and fraud detection).
Probabilistic models of how the data might have been generated. Assumed each feature is independent.
Efficient learning and prediction. May not work well in general
Bernoulli: binary features (word present/absent)
Multinominal: discreet features e.g. word counts
Gaussian: Continuous (assume gaussian distribution).
Decision boundary is parabolic.
Can use partial fit if data doesn't' fit into memory.
Related to linear models:
Ensemble (multi models).
Ensemble of trees.
Many decision trees have better generalization.
Ensemble of trees should be diverse.
Random choosing of data in training. Random feature splits.
Boostraping. Selection with replacement.
Sensitivity to the max_features
parameters. When it's 1 it leads to a more diverse forrest. When < the number of features they will be quite similar.
Make a prediction for every tree. Regression in them the mean of the prediction. For classification gives probability for each class. Probabilities are then averaged across tress. Predict the class with the highest probability.
Decision boundary has box like shape but local changes.
n_estimators
: number of trees (default is 10); max_features
; max_depth
(splitting of trees); n_jobs
(how many cores to use).
Build a series of small decision trees. Each tree attempts to correct errors from the previous tree.
Learning rate: hows the errors are correction. e.g. high more corrected
n_estimators
(small decision trees); learning_rate
(if small need higher n_estimators
; max_depth
Family of algorithms for deep learning.
Review:
Linear regression : output: y = input features: b + w1x1 + wnxn
Logistic regression: y = logistic (b + w1x1 + wnxn) - constrained to be between 0 and 1.
Multi-layer Perceptron with one hidden layer and tanh activation function (feed forward neural networks).
output y = v0h0 + vnhn. (h is hidden layer: tanh(w0ix0 + wnixn); v is output of h;
Each box in the hidden layer is called a hidden unit and it computer a non-linear weighted sum of the input features.
Need more training data and more CPU.
Activation functions: tanh; rectified linear unit; logistic
hidden_layer_sizes
(how many hidden units). solver
(lbfgs algorithm). alpha
(regularization parameter e.g. l2 penalty).
from sklearn.neural_network import MLPRegressor
Increasing alpha constrains the model and makes then more general.
hidden_layer_sizes
([100, 100] gives two layers of 100 units. alpha
(controls weight on fit). activation
(e.g. 'relu').
Further reading of neural networks by Carter and Tanz (2013)
feature engineering - finding right features to use.
Feature learning/feature extraction.
Convolution: filter for a specific pattern.
Subsample e.g pooling: translated or rotated feature.
First feature finds edges and blobs. Second layer create features from the first later features
use
Can use GPU
Deep learning in a nutshell Tim Dettmers .
Assisting Pathologists in Detecting Cancer with Deep Learning, google AI blog
Training data contains information on something you are trying to predict. e.g. using label as a feature. Including test data in training data.
Use future data e.g. length user stayed on site.
Some very obvious features e.g. surgery on a condition.
Are the features highly correlated with the target?
The treachery of leakage by Colin Fraser.
Leakage in data mining Kaufman et al (2011)
Data leakage in a ML competition
Rules of ML by Martin Zinkevich
Data without labels.
Capture structure and information.
Clustering e.g. number of product pages browsed (basic users) vs. number of advanced features used (advanced users). Then recommend products. Find groups in the data. Assign every point in the data to one of the groups.
Transformation (processes that extract or compute information. e.g. density estimate.
sklearn.neighbors
to use kernel density.
Use fewer features in your data.
e.g. PCA.
manifold learning algorithms. PCA struggles with this data. Multi-dimensional scaling (MDS) attempts to find a distance-preserving low-dimensional projection.
t-SNE find 2d representation of higher dimensional data and preserving information about neighbors. See https://github.com/lvdmaaten/bhtsne and http://lvdmaaten.github.io/tsne/
Divide data into groups
Hard clustering: each data point belongs to exactly one cluster.
Soft clustering: each data point is assigned a weight, score or probability of membership for each cluster.
k-means: choose k and it'll find k nearest points.
It'll guess a cluster center then update each cluster by replacing it with the mean of all points assigned to that cluster. See https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Doesn't work well for irregular, complex clusters.
k-medoids can work with categorial features
Alogomatarive clusters - e.g. reach 3 clusters. Ward (default), average and complete linkage to merge clusters.
Can visualize using a dendrogram. Can help figure out right number of clusters.
DBSCAN: Density Based Spatial Clustering of Applications and Noise. Don't need to specific clusters in advanced. Identifies noise points (-1).
Multiple clustering can be applied. Hard to know how many clusters.
Distill (2016) How to Use t-SNE Effectively
Gleesen (2017) How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms