Machine Learning

Label propagation on the Manifold for one-shot learning: propagate labels on the manifold(connected samples) and instead of marking the nearest one (nearest neighbor) to mark it as blue mark it as red where the nearest propagated label is red: link

One shot learning Andrew Ng
Triplet Loss, for training choose samples where d(A,P) ~ d(A, N)

==========> Today I learned that
- one shot learning can be done via using embedding of classes taken from e.g. word2vec instead of multiclass and label propagation in manifold
  multi-million face recognition can be done using facenet and triplet loss as done in Google and taught by Ng in a single-shot learning lecture through siamese network


cross entropy:
the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution  p.
H(p,q) = - Sigma_x y log(y')

- The optimal number of bits for each symbol transmission is called entropy. For one class is equal to the one over log of its probability.  log(1/p)  the more probable the less bits the less probable the more bits. The overall entropy is the expected number of bits for all classes =  Sigma pi log(1/pi)
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y . This is optimal, in that we can't encode the symbols using fewer bits on average.

if we assume that seeing a Toyota is 128x as likely as seeing a Tesla, then we'd give the Toyota symbol 7 less bits than the Tesla symbol:


Cross entropy is the number of bits we'll need if we encode symbols from  y  using the predicted tool  ŷ  . which is always more than entropy itself.
cross entropy = Sigma_x y log(1 / y')       
We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols, and use the new number of bits

KL Divergence is simply the difference between cross entropy and entropy: Sigma [ pi log(1/p'i) -  pi log (1/pi) ]
It measures the number of extra bits we'll need on average if we encode symbols from y according to ŷ ; you can think of it as a bit tax for encoding symbols from y with an inappropriate distribution ŷ . It's never negative, and it's 0 only when y and ŷ are the same.

KL Divergence helps us to measure just how much information we lose when we choose an approximation. link

It may be tempting to think of KL Divergence as a distance metric, however we cannot use KL Divergence to measure the distance between two distributions. The reason for this is that KL Divergence is not symmetric.    Sigma [ pi log(1/p'i) -  pi log (1/pi) ]   =  pi Sigma [ log(1/p'i) - log (1/pi) ] = pi Sigma [log (1/p'i) / (1/pi) ] = pi Sigma log (pi/p'1)      (not symmetric)    Intuitively this makes sense as in each of these cases we're doing a very different form of approximation.

The key point here is that we can use KL Divergence as an objective function to find the optimal value for any approximating distribution we can come up with

If you are familiar with neural networks, you may have guessed where we were headed after the last section. Neural networks, in the most general sense, are function approximators. This means that you can use a neural network to learn a wide range of complex functions. The key to getting neural networks to learn is to use an objective function that can inform the network how well it's doing. You train neural networks by minimizing the loss of the objective function.

As we've seen, we can use KL divergence to minimize how much information loss we have when approximating a distribution. Combining KL divergence with neural networks allows us to learn very complex approximating distribution for our data. A common approach to this is called a "Variational Autoencoder" which learns the best way to approximate the information in a data set. Here is a great tutorial that dives into the details of building variational autoencoders.

Even more general is the area of Variational Bayesian Methods. In other posts we've seen how powerful Monte Carlo simulations can be to solve a range of probability problems. While Monte Carlo simulations can help solve many intractable integrals needed for Bayesian inference, even these methods can be very computationally expensive. Variational Bayesian method, including Variational Autoencoders, use KL divergence to generate optimal approximating distributions, allowing for much more efficient inference for very difficult integrals. To learn more about Variational Inference check out the Edward library for python.

Note that minimizing cross entropy is the same as minimizing the KL divergence from ŷ to y . (They're equivalent up to an additive constant, the entropy of y , which doesn't depend on ŷ .)

- From one perspective, minimizing cross entropy lets us find a ŷ that requires as few extra bits as possible when we try to encode symbols from y using ŷ .
- From another perspective, minimizing cross entropy is equivalent to minimizing the negative log likelihood of our data, which is a direct measure of the predictive power of our model.

here it shows that
we can look at it as maximizing the likelihood of model: L({y(n), y'(n)})
since log is monotonic its like maximizing log of likelihood
its like minimizing negative of log likelihood

SGD vs gradient descent:
 in SGD, we are using the cost gradient of 1 example at each iteration, instead of using the sum of the cost gradient of ALL examples.
Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

Ensemble models
generate multiple Bootstraped samplings (sample with replacement) use the majority vote or averaging concepts to get the final prediction. it’s done mainly to reduce the variance. (This is how bagging works.)
random forest actually uses this concept but it goes a step ahead to further reduce the variance by randomly choosing a subset of features as well for each bootstrapped sample to make the splits while training.

Boosting is a sequential technique in which, the first algorithm is trained on the entire dataset and the subsequent algorithms are built by fitting the residuals of the first algorithm, thus giving higher weight to those observations that were poorly predicted by the previous model.

It relies on creating a series of weak learners each of which might not be good for the entire dataset but is good for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.

It’s really important to note that boosting is focused on reducing the bias. This makes the boosting algorithms prone to overfitting. Thus, parameter tuning becomes a crucial part of boosting algorithms to make them avoid overfitting. Some examples of boosting are XGBoost, GBM, ADABOOST, etc.

AdaBoost (Adaptive Boosting)
1. Use boosting (sampling with replacement) to generate a of bag data.     BAGGING: create boosted bags of data.
2. train a (simple) model and pass the model on training data to get errors (we call these models weak learners)
3. Use the errors as weights to boost another bag of data such that sampling with replacement puts more weights on data that had higher errors
4. Go to 2.
In this way you get simple models where each is good at identifying a special segment of the data
By ensembling such as majority vote or averaging combine the results.

In stacking multiple layers of machine learning models are placed one over another where each of the models passes their predictions to the model in the layer above it and the top layer model takes decisions based on the outputs of the models in layers below it.

Based on Data:
Lots of labeled car/motorcycle images and no-other: Supervised Learning
small labeled car/motorcycle images but lots of unlabeled car/motorcycle images: semi-supervised
unsupervised feature learning


Machine Learning - Ng - Stanford
Pattern Analysis (lecture notes). Neural Networks -  Ricardo Gutierrez Osuna - TAMU
Datamining PennState 2010

Conference   NIPS conference

Naive Bayes === Maximum Likelihood

In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood.
prior probability: e.g. prob(red) = 20/60, prob(green) = 40/60     -- if we don't know anything else, this would be our guess
likelihood: based on the # of neighbors, that belong to each class. prob(red)= 3/20  prob(green) = 1/40
Draw a circle around (test node)X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.
To form a posterior probability using the so-called Bayes' rule
posterior prob = apriori * likelihood
posterior probability of X being green ~ prior probability of X being green * likelihood of X given green = 40/60 * 1/40
for red 2/60 * 3/20

Confusion matrix.

Cat Dog Rabbit
Actual class
Cat 5 3 0
Dog 2 3 1
Rabbit 0 2 11

Lecture notes based on the videos from Stanford course on Machine Learning video series: Lecture 1 | Machine Learning (Stanford)


m: # training examples
x: "input" variable/features
y: "output" variable/target variable
(x, y) one training example
ith training example (ith row in table) (x(i), y(i))
training set -> learning algorthm -> hypothesis h: input new living area, output estimated price
For this example h = θ + 
x1 size (ft2) x2 @ bedrooms
linear hypothesis class    h(x) = hθ(x) = θ0 + θ1*x1 + θ2*x2    to be concise in notation x0 = 1, n # features    h(x) = Σ(i=0 to n) θi xi = θT x, θ's are parameters and the purpose is to learn right parameters
minθ 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2     minimize the error for the training set  that we already know their output value, 1/2 is for conveniece which will reduce our future formulas
j(θ) = 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2 
minθ j(θ)
start with some θ (θ = 0 vector) keep changing θ to reduce j(θ)
Gradient decent: θi := θi - α  ∂/∂θi  j(θ)   (:= replace the value)

 ∂/∂θi  j(θ) = Σ(k=0 to m) (hθ(x(k)) - y(k))2 xi(k)
α is the parameter of the learning algorithm called learning rate. controls how large a step you take
Repeat until convergence the following steps
θi := θi - α Σ(k=0 to m) (hθ(x(k)) - y(k))2 xi(k) = ∂/∂θi j(θ)       least squares is bell shaped has no local minima
as you approach local minimum gradient converges to zero
after multiple iterations of the gradient decent (Taking steps started from an arbitrary initial point) we get the least square fit. ==> linear regression, linear fit. The gradient of a function(derivative) gives the direction of the steepest descent. ==> Batch Gradient Descent : on every step look at all the samples. 
Stochastic Gradient Descent: 
for k = 1 to m{
θi := θi - α hθ(x(k)) - y(k))2 xi(k) 
//update all parameters for all values of i
} to update parameters: only look at one training example each time. (the derivative w.r.t. just the first training example)



Linear Regression
Locally Weighted Regression
Probabilistic Interpretation
Logistic Regression (First classification algorithm)
Digression (Preceptron)
Newton's method
(x(i), y(i)): ith training example
hθ(x(i)) = Σ(i=0 to n) θi xi = θT x, x0 = 1   n parameters
j(θ) = 1/2 Σ(i=0 to m) (hθ(x(i)) - y(i))2   m training set
closed form solution: θ = (XTX)-1XTY

back to previous example
X1 : size of house
y : price of house
X1 : size of house
X2 : size^2
h = θ0 + θ1 X1 + θ2 X2 = θ0 + θ1 X1 + θ2 X1^2 


Parametric Learning Algorithm: fixed # of parameters that fits to the data (θ's)
Non-parametric : # of params grows with m (the size of training set)
Locally weighted regression LWR (Loess/Lowes)

evaluate hypothesis h at certain position x
LR (linear regression) fit θ to minimize least squares and return θT x
min_θ Σ(i=0 to m) ( y(i) - θT x(i))2 
Locally Weighted Linear Regression: look at point x, lokk at datapoints, only takes data that aare in the little vicinity of x: where I want to evaluate the hypothesis, in its vicinity take a linear regression and where we hit that line fro our x is going to be our hypothesis value (predicted value).
LWR: fit θ to minimize  Σ(i=0 to m) w(i) ( y(i) - θT x(i))2 , w(i): weights e.g. = exp(- (x(i)-x)2/2)  [looks like a gaussian dist. but has nothing to do with it! just looks like it no assumtions of nay kind on anything is assumed to be gaussian]. If a training example x(i) where x is close to it.
 |x(i)-x| is small, then exp(0) is one from the formula then w(i) ~ 1
 |x(i)-x| is large, then w(i) ~ 0 :pay more attention to the points more close by more accurately

To make the formula more detailed there is a parameter τ (bandwith) exp(- (x(i)-x)2/(2τ2)) it' not variance of a gaussian just a parameter, control how fast weights fall off with distance.  (width of the bell)

probabilistic interpretaton: assume hypothesis is a linear combination plus an error ε(i) (additional features that we dont capture , function is not as linear as we think or random noise). Assume y(i) θT x(i) + ε(i) .  ε(i) ~ N(0, σ2) normal distributed. 
This implies that density for gaussian