Machine Learning


Get into the habit of collecting representative examples for each error you fix and you will be able to improve both measures. IBM Watson NLP Unit Testing

Based on Data:
Lots of labeled car/motorcycle images and no-other: Supervised Learning
small labeled car/motorcycle images but lots of unlabeled car/motorcycle images: semi-supervised
unsupervised feature learning

Courses

Machine Learning - Ng - Stanford
Pattern Analysis (lecture notes). Neural Networks -  Ricardo Gutierrez Osuna - TAMU
Datamining PennState 2010

Conference
http://nips.cc/   NIPS conference

Naive Bayes === Maximum Likelihood

In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood.
prior probability: e.g. prob(red) = 20/60, prob(green) = 40/60     -- if we don't know anything else, this would be our guess
likelihood: based on the # of neighbors, that belong to each class. prob(red)= 3/20  prob(green) = 1/40
Draw a circle around (test node)X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.
To form a posterior probability using the so-called Bayes' rule
posterior prob = apriori * likelihood
posterior probability of X being green ~ prior probability of X being green * likelihood of X given green = 40/60 * 1/40
for red 2/60 * 3/20


Confusion matrix.


Predicted
class
Cat Dog Rabbit
Actual class
Cat 5 3 0
Dog 2 3 1
Rabbit 0 2 11












Lecture notes based on the videos from Stanford course on Machine Learning video series: Lecture 1 | Machine Learning (Stanford)


LECTURE 2
-----------------

m: # training examples
x: "input" variable/features
y: "output" variable/target variable
(x, y) one training example
ith training example (ith row in table) (x(i), y(i))
training set -> learning algorthm -> hypothesis h: input new living area, output estimated price
For this example h = θ + 
x1 size (ft2) x2 @ bedrooms
linear hypothesis class    h(x) = hθ(x) = θ0 + θ1*x1 + θ2*x2    to be concise in notation x0 = 1, n # features    h(x) = Σ(i=0 to n) θi xi = θT x, θ's are parameters and the purpose is to learn right parameters
minθ 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2     minimize the error for the training set  that we already know their output value, 1/2 is for conveniece which will reduce our future formulas
j(θ) = 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2 
minθ j(θ)
start with some θ (θ = 0 vector) keep changing θ to reduce j(θ)
-------------
Gradient decent: θi := θi - α  ∂/∂θi  j(θ)   (:= replace the value)

 ∂/∂θi  j(θ) = Σ(k=0 to m) (hθ(x(k)) - y(k))2 xi(k)
α is the parameter of the learning algorithm called learning rate. controls how large a step you take
Repeat until convergence the following steps
θi := θi - α Σ(k=0 to m) (hθ(x(k)) - y(k))2 xi(k) = ∂/∂θi j(θ)       least squares is bell shaped has no local minima
as you approach local minimum gradient converges to zero
after multiple iterations of the gradient decent (Taking steps started from an arbitrary initial point) we get the least square fit. ==> linear regression, linear fit. The gradient of a function(derivative) gives the direction of the steepest descent. ==> Batch Gradient Descent : on every step look at all the samples. 
Stochastic Gradient Descent: 
 
Repeat{
for k = 1 to m{
θi := θi - α hθ(x(k)) - y(k))2 xi(k) 
//update all parameters for all values of i
}
} to update parameters: only look at one training example each time. (the derivative w.r.t. just the first training example)
 


    

LECTURE 3
-----------------

Linear Regression
Locally Weighted Regression
Probabilistic Interpretation
Logistic Regression (First classification algorithm)
Digression (Preceptron)
Newton's method
(x(i), y(i)): ith training example
hθ(x(i)) = Σ(i=0 to n) θi xi = θT x, x0 = 1   n parameters
j(θ) = 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2   m training set
closed form solution: θ = (XTX)-1XTY

back to previous example
X1 : size of house
y : price of house
--
X1 : size of house
X2 : size^2
h = θ0 + θ1 X1 + θ2 X2 = θ0 + θ1 X1 + θ2 X1^2 

uNDERFITTING  :
oVERFITTING    : 

Parametric Learning Algorithm: fixed # of parameters that fits to the data (θ's)
Non-parametric : # of params grows with m (the size of training set)
Locally weighted regression LWR (Loess/Lowes)

evaluate hypothesis h at certain position x
LR (linear regression) fit θ to minimize least squares and return θT x
min_θ Σ(i=0 to m) ( y(i) - θT x(i))2 
Locally Weighted Linear Regression: look at point x, lokk at datapoints, only takes data that aare in the little vicinity of x: where I want to evaluate the hypothesis, in its vicinity take a linear regression and where we hit that line fro our x is going to be our hypothesis value (predicted value).
======
LWR: fit θ to minimize  Σ(i=0 to m) w(i) ( y(i) - θT x(i))2 , w(i): weights e.g. = exp(- (x(i)-x)2/2)  [looks like a gaussian dist. but has nothing to do with it! just looks like it no assumtions of nay kind on anything is assumed to be gaussian]. If a training example x(i) where x is close to it.
 |x(i)-x| is small, then exp(0) is one from the formula then w(i) ~ 1
 |x(i)-x| is large, then w(i) ~ 0 :pay more attention to the points more close by more accurately

To make the formula more detailed there is a parameter τ (bandwith) exp(- (x(i)-x)2/(2τ2)) it' not variance of a gaussian just a parameter, control how fast weights fall off with distance.  (width of the bell)

====
probabilistic interpretaton: assume hypothesis is a linear combination plus an error ε(i) (additional features that we dont capture , function is not as linear as we think or random noise). Assume y(i) θT x(i) + ε(i) .  ε(i) ~ N(0, σ2) normal distributed. 
This implies that density for gaussian