mlclass

http://holehouse.org/mlclass/

http://www.ml-class.org/course/auth/welcome

http://www.reddit.com/r/mlclass/

http://cs229.stanford.edu/materials.html

http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=ufldl

http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

http://171.64.93.201/ClassX/system/users/web/pg/view_subject.php?subject=CS229_FALL_2011_2012

http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

more features -> overfitting of training set

Regularization: +lambda*sum(theta*2) -> underfitting training set

introducing regularization -> minimizing theta

large lambda -> underfitting

regularization _ linear |logistic regression is convex

Matrix multiplication X(m*n) *Y(n*q) = Z(n*q) :

The number of columns of X must agree with the number of rows of Y.

X*Y = Ytranspose*X

Neural Networks

Theta(j) matrix of weigths from layer j to layer j+1

If network has S(j) units in layer j, S(j+1) units in layer j+1

then Theta(j) matrix has dimention s(j+1) * [ s(j) + 1]

Suppose you have a neural network with one hidden layer, and that there are m input features and k hidden nodes in the hidden layer. Theta(1) 1 0, Theta(1) 1 1, through Theta(1) 1 m are weights connecting inputs 0 through m to the first hidden node. Think of Theta(1) sub 1 as the vector of input weights for that node.

Theta(1) 2 0, Theta(1) 2 1, through Theta(1) 2 m are the weights for the inputs coming in to the second hidden node, or the vector Theta(1) sub 2.

This keeps going through Theta(1) k 0 to Theta(1) k m, the vector Theta(1) sub k.

Collectively, you can think of Theta(1) as a k x (m+1) matrix of weights connecting all of the inputs (including input 0, which is always 1) to all of the hidden nodes.

Θ(2) are the weights of the 2nd hidden layer for each neuron in the 3rd layer, Θ(1) 2 x is the xth weight leading to the second neuron in the second hidden layer.

Θ isn't a vector, but a matrix. Each row of this matrix contain the input weights needed to compute one node of the next layer.

So Θ(1) 1 x is the row needed to compute a(2) 1:

a(2) 1 1 = Θ(1) 1 0 * a(1) 0 + Θ(1) 1 1 * a(1) 1 + ... + Θ(1) 1 m * a(1) m

In other words: Each row of Θ contain the (transposed) Theta vector of the corresponding classifier (node) in the next layer.

Exercise #3

M=5000 training examples. each example has N=400 points(20*20 pixels)

each example is a row in X matrix

y - vector 5000 elements. value of element is 10,1,2,3,4,5,6,7,8,9

So there are 10 classes so we will train 10 separate logistic regression classifiers

Theta is matrix of size N=10 *400 ( number of classes * number of features)

Backpropagation in NN

......

https://bitbucket.org/sunng

http://dudarev.com/wiki/ml-class-logistic-regression.html

http://swizec.com/blog/first-steps-with-octave-and-machine-learning/swizec/2865

http://swizec.com/blog/i-suck-at-implementing-neural-networks-in-octave/swizec/2929

http://swizec.com/blog/i-think-i-finally-understand-what-a-neural-network-is/swizec/2891

https://github.com/gafiatulin/ml-class

https://github.com/SaveTheRbtz/ml-class

https://github.com/merwan/ml-class

https://github.com/peterwilliams97/ml_class

https://github.com/gkokaisel/MachineLearning/tree/master/SVM/mlclass-ex6

MODEL SELECTION

Which order of poninom to choose for the model?

Training data 20% Cross validation 20% Test 60%

Train error (cost function) is decreasing as size of polinom is increasing

Cross validation error (cost function, aka average square error) as function of polinom degree:

it has the shape of parabola

Cross Validation (CV) error is big for small polinom order and big polinom order

For small polinom order we have underfit this CV error is a BIAS (underfit); trainerror = CV error

For big polinom order we have overfit (training error is small) ; cross vatidation error is BIG this error is VARIANCE (overfit)

LEARNING CURVES: J(train), J(cross-validation) versus training set size

If lerning algorithm suffering from high BIAS (high error) (underfitting) getting more data cannot help

If lerning algorithm suffering from high variance (overfitting) getting more data can help

Blog to read:

http://mechanistician.blogspot.com/2009_05_01_archive.html

http://carlos.bueno.org/2011/10/fair-coin.html From unfair coin

http://www-formal.stanford.edu/jmc/modality.html

Octave

http://en.wikipedia.org/wiki/GNU_Octave

http://guioctave.com

http://www.floss4science.com/resources-for-learning-gnu-octave/

http://www.gnu.org/software/octave/doc/interpreter/

http://www.outsch.org/2011/01/29/qtoctave-0-10-1-for-windows/

http://codebright.wordpress.com/2011/10/07/linear-algebra-review-and-numpy/