Gradient boosting Conjugate, beta distribution hypothesis testing SVM collaborative filtering L1 vs L2 reg SVM vs logistic regression how to avoid overfitting in decision tree to find threshold to cut off (cross validation) CRF vs markov models bagging boosting SGD vs gradient descent: normal batch gradient descent: usually
takes the mean squared error of all the training samples when it is
updating the weights of the network. for every gradient descent update
in the weights, you have to cycle through every training sample. Stochastic gradient descent: data is shuffled at each epoch. updates the weight parameters after evaluation the cost function after each sample. This
is easily implemented by a minor variation of the batch gradient
descent code in Python, by simply shifting the update component into the
sample loop Where does the
“stochastic” part come in? The stochastic component is in the selection
of the random selection of training sample. minibatch gradient descent the cost function (and therefore gradient)
is averaged over a small number of samples, from around 10500. This is
opposed to the SGD batch size of 1 sample, and the BGD size of all the training samples. What’s
the benefit of doing it this way? First, it smooths out some of the
noise in SGD, but not all of it, thereby still allowing the “kick” out
of local minimums of the cost function. Second, the minibatch size is
still small, thereby keeping the performance benefits of SGD. bootstrapping: is any test or metric that relies on random sampling with replacement Bootstrap aggregating (Bagging): Bootstrapping the data plus using the aggregate to make a decision is called Bagging. Flat 1. Generate multiple Bootstraped samplings (sample with replacement) 2. Use the majority vote or average to get the final prediction. it’s done to reduce the variance e.g. on decision trees random forest
actually uses this concept but it goes a step ahead to further reduce
the variance by randomly choosing a subset of features as well for each
bootstrapped sample to make the splits while training. (bootstraped
samples + random features) Sequential Boosting: A
sequential technique: the first algorithm is trained on the entire
dataset and the subsequent algorithms are built by fitting the residuals
of the first algorithm, thus giving higher weight to those observations
that were poorly predicted by the previous model. It relies on creating a series of weak learners each of which might not be good for the entire dataset but is good for some part of the dataset. Thus, each model actually boosts the performance of the ensemble. Focused on reducing the bias. This makes the boosting algorithms prone to overfitting. Thus, parameter tuning becomes a crucial part of boosting algorithms to make them avoid overfitting. Some examples of boosting are XGBoost, GBM, ADABOOST, etc. AdaBoost (Adaptive Boosting) 1. Create a sequence of weak learners (boosting). At each iteration get n samples based on weights (BAGGING: create boosted bags of data) 2. train a (simple) model and pass the model on training data to get errors (we call these models weak learners) 3.
Use the errors as weights to boost another bag of data such that
sampling with replacement puts more weights on data that had higher
errors 4. Go to 2. In this way you get simple models where each is good at identifying a special segment of the data By ensembling such as majority vote or averaging combine the results. A tree with one node and two leaves is called a stump So we get a forest of stumps rather than trees stumps votes are weighted for final resulyt iterate over features and see which feature does the best classification itself. for continuous variables pick one with maximum entropy (?) similar to decision tree pick one with best gini index in random forest each tree has equal vote as others order of tree gerenaration doesnt matter Stacked In
stacking multiple layers of machine learning models are placed one over
another where each of the models passes their predictions to the model
in the layer above it and the top layer model takes decisions based on
the outputs of the models in layers below it. Label propagation on the Manifold for oneshot learning: propagate labels on the manifold(connected samples) and instead of marking the nearest one (nearest neighbor) to mark it as blue mark it as red where the nearest propagated label is red: link Triplet Loss, for training choose samples where d(A,P) ~ d(A, N) ==========> Today I learned that  one shot learning can be done via using embedding of classes taken from e.g. word2vec instead of multiclass and label propagation in manifold multimillion face recognition can be done using facenet and triplet loss as done in Google and taught by Ng in a singleshot learning lecture through siamese network regularization cross entropy: the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p. H(p,q) =  Sigma_x y log(y')  The optimal number of bits for each symbol transmission is called entropy. For one class is equal to the one over log of its probability. log(1/p) the more probable the less bits the less probable the more bits. The overall entropy is the expected number of bits for all classes = Sigma pi log(1/pi) If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y . This is optimal, in that we can't encode the symbols using fewer bits on average. if we assume that seeing a Toyota is 128x as likely as seeing a Tesla, then we'd give the Toyota symbol 7 less bits than the Tesla symbol: btoyota=log1128ptesla=log1ptesla+log1128=btesla−7 Cross entropy is the number of bits we'll need if we encode symbols from y using the predicted tool ŷ . which is always more than entropy itself. cross entropy = Sigma_x y log(1 / y') We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols, and use the new number of bits KL Divergence is simply the difference between cross entropy and entropy: Sigma [ pi log(1/p'i)  pi log (1/pi) ] It measures the number of extra bits we'll need on average if we encode symbols from y according to ŷ ; you can think of it as a bit tax for encoding symbols from y with an inappropriate distribution ŷ . It's never negative, and it's 0 only when y and ŷ are the same. KL Divergence helps us to measure just how much information we lose when we choose an approximation. link It may be tempting to think of KL Divergence as a distance metric, however we cannot use KL Divergence to measure the distance between two distributions. The reason for this is that KL Divergence is not symmetric. Sigma [ pi log(1/p'i)  pi log (1/pi) ] = pi Sigma [ log(1/p'i)  log (1/pi) ] = pi Sigma [log (1/p'i) / (1/pi) ] = pi Sigma log (pi/p'1) (not symmetric) Intuitively this makes sense as in each of these cases we're doing a very different form of approximation. The key point here is that we can use KL Divergence as an objective function to find the optimal value for any approximating distribution we can come up with If you are familiar with neural networks, you may have guessed where we were headed after the last section. Neural networks, in the most general sense, are function approximators. This means that you can use a neural network to learn a wide range of complex functions. The key to getting neural networks to learn is to use an objective function that can inform the network how well it's doing. You train neural networks by minimizing the loss of the objective function. As we've seen, we can use KL divergence to minimize how much information loss we have when approximating a distribution. Combining KL divergence with neural networks allows us to learn very complex approximating distribution for our data. A common approach to this is called a "Variational Autoencoder" which learns the best way to approximate the information in a data set. Here is a great tutorial that dives into the details of building variational autoencoders. Even more general is the area of Variational Bayesian Methods. In other posts we've seen how powerful Monte Carlo simulations can be to solve a range of probability problems. While Monte Carlo simulations can help solve many intractable integrals needed for Bayesian inference, even these methods can be very computationally expensive. Variational Bayesian method, including Variational Autoencoders, use KL divergence to generate optimal approximating distributions, allowing for much more efficient inference for very difficult integrals. To learn more about Variational Inference check out the Edward library for python. Note that minimizing cross entropy is the same as minimizing the KL divergence from ŷ to y . (They're equivalent up to an additive constant, the entropy of y , which doesn't depend on ŷ .)  From one perspective, minimizing cross entropy lets us find a ŷ that requires as few extra bits as possible when we try to encode symbols from y using ŷ .  From another perspective, minimizing cross entropy is equivalent to minimizing the negative log likelihood of our data, which is a direct measure of the predictive power of our model. here it shows that we can look at it as maximizing the likelihood of model: L({y(n), y'(n)}) since log is monotonic its like maximizing log of likelihood its like minimizing negative of log likelihood Based on Data: Lots of labeled car/motorcycle images and noother: Supervised Learning small labeled car/motorcycle images but lots of unlabeled car/motorcycle images: semisupervised unsupervised feature learning Conference http://nips.cc/ NIPS conference Textbooks UCLA2013 Pattern Recognition and Machine Learning  2007 $31 Bishop Machine Learning: a Probabilistic Perspective 2012  $43 murphy Bayesian Reasoning and Machine Learning Data Mining Algorithms In R (with code in R) Mining of Massive Datasets A Programmer's Guide to Data Mining (python) Data Mining and Analysis: Fundamental Concepts and Algorithms Theory and Applications for Advanced Text Mining Naive Bayes === Maximum LikelihoodIn the Bayesian analysis, the final classification is produced by
combining both sources of information, i.e., the prior and the
likelihood. prior probability: e.g. prob(red) = 20/60, prob(green) = 40/60  if we don't know anything else, this would be our guess likelihood: based on the # of neighbors, that belong to each class. prob(red)= 3/20 prob(green) = 1/40 Draw a circle around (test node)X which encompasses a number (to be chosen a
priori) of points irrespective of their class labels. Then we calculate
the number of points in the circle belonging to each class label. To form a posterior probability using the socalled Bayes' rule posterior prob = apriori * likelihood posterior probability of X being green ~ prior probability of X being green * likelihood of X given green = 40/60 * 1/40 for red 2/60 * 3/20 Lecture notes based on the videos from Stanford course on Machine Learning video series: Lecture 1  Machine Learning (Stanford)

m: # training examples
x: "input" variable/features
y: "output" variable/target variable
(x, y) one training example
ith training example (ith row in table) (x^{(i)}, y^{(i)})
training set > learning algorthm > hypothesis h: input new living area, output estimated price
For this example h = θ +
x1 size (ft2) x2 @ bedrooms
linear hypothesis class h(x) = h_{θ}(x) = θ_{0} + θ_{1}*x_{1} + θ_{2}*x_{2} to be concise in notation x_{0} = 1, n # features h(x) = Σ_{(i=0 to n)} θ_{i} x_{i} = θ^{T} x, θ's are parameters and the purpose is to learn right parameters
min_{θ} 1/2 Σ_{(i=0 to m)} (h_{θ}(x^{(i)})  y^{(i)})^{2} minimize the error for the training set that we already know their output value, 1/2 is for conveniece which will reduce our future formulas
j(θ) = 1/2 Σ_{(i=0 to m)} (h_{θ}(x^{(i)})  y^{(i)})^{2}
min_{θ} j(θ)
start with some θ (θ = 0 vector) keep changing θ to reduce j(θ)

Gradient decent: θi := θi  α ∂/∂θi j(θ) (:= replace the value)
∂/∂θi j(θ) = Σ_{(k=0 to m)} (h_{θ}(x^{(k)})  y^{(k)})^{2} . x_{i}^{(k)}
α is the parameter of the learning algorithm called learning rate. controls how large a step you take
Repeat until convergence the following steps
θi := θi  α Σ_{(k=0 to m)} (h_{θ}(x^{(k)})  y^{(k)})^{2} . x_{i}^{(k) }= ∂/∂θ_{i} j(θ) least squares is bell shaped has no local minima
as you approach local minimum gradient converges to zero
after multiple iterations of the gradient decent (Taking steps started from an arbitrary initial point) we get the least square fit. ==> linear regression, linear fit. The gradient of a function(derivative) gives the direction of the steepest descent. ==> Batch Gradient Descent : on every step look at all the samples.
_{Stochastic Gradient Descent: }
LECTURE 3 
Linear Regression
(x^{(i)}, y^{(i)}): ith training example
h_{θ}(x^{(i)}) = Σ_{(i=0 to n)} θ_{i} x_{i} = θ^{T} x, x_{0} = 1 n parameters
j(θ) = 1/2 Σ_{(i=0 to m)} (h_{θ}(x^{(i)})  y^{(i)})^{2} m training set
closed form solution: θ = (X^{T}X)^{1}X^{T}Y
back to previous example
X1 : size of house
y : price of house

X1 : size of house
X2 : size^2
h = θ0 + θ1 X1 + θ2 X2 = θ0 + θ1 X1 + θ2 X1^2
uNDERFITTING :
oVERFITTING :
Parametric Learning Algorithm: fixed # of parameters that fits to the data (θ's)
Nonparametric : # of params grows with m (the size of training set)
evaluate hypothesis h at certain position x
Locally Weighted Linear Regression: look at point x, lokk at datapoints, only takes data that aare in the little vicinity of x: where I want to evaluate the hypothesis, in its vicinity take a linear regression and where we hit that line fro our x is going to be our hypothesis value (predicted value). ======
LWR: fit θ to minimize Σ_{(i=0 to m)} w^{(i)} ( y^{(i) }^{ }θ^{T} x^{(i)})^{2} , w^{(i)}: weights e.g. = exp( (x^{(i)}x)^{2}/2) [looks like a gaussian dist. but has nothing to do with it! just looks like it no assumtions of nay kind on anything is assumed to be gaussian]. If a training example x^{(i) }where x is close to it.
To make the formula more detailed there is a parameter τ (bandwith) exp( (x^{(i)}x)^{2}/(2τ^{2})) it' not variance of a gaussian just a parameter, control how fast weights fall off with distance. (width of the bell)
====
probabilistic interpretaton: assume hypothesis is a linear combination plus an error ε^{(i)} (additional features that we dont capture , function is not as linear as we think or random noise). Assume y^{(i) }= θ^{T} x^{(i)} + ε^{(i)} . ε^{(i) }~ N(0, σ^{2}) normal distributed.
This implies that density for gaussian
