This e-book can be found on machinelearningmastery.com and cost @$37.
This is part of my study note.
This book contains the following linear algorithms: simple linear regression, multivariate linear regression, logistic regression, and perceptron.
For each one, there are three basic questions: what it is , how to do it and when to use it.
- is a simple prediction method. To predict the result from an input. The formula is y = ax + c where x is the independent variable, y is the dependent variable, a and c are constants.
When given pairs of data (x1, y1), (x2, y2), ....., a and c can be worked out. After that, a Y will be given for any new X input - prediction. Sound simple! This has been used in science and engineering for hundreds of years.
a = E(x - x_mean) * (y - y_mean) / E(x - x_mean)**2
c = y_mean - a * x_mean
- this is more realistic as real problems tend to have more than just one variable. This is also a prediction method and often used for optimisation such as weights of neural networks. (multivariate means multi-variable, x1, x2, x3, .... Actually multivariate tends to mean different values of the same variable! Not sure if this is correct.)
y = a1* x1 + a2 * x2 + a3 * x3 + ... + c
error = (expected - predicted)
l_rate and epochs are hyperparameters - designed (chosen) by human and can be adjusted to get the best result. l_rate is the rate (percentage) of the error used for adjusting the weights (a1, a2, a3...) If l-rate = 1.0, then use the entire error for calculation to update, if 0.1, then just 10% of the error.
a = a - l_rate * error * x; c = c - l_rate * error
1 epoch is one training process once all input data have been used. So epochs means the number of loops (times) the entire set of training data used for training. Usually, higher epochs lead to more chances of adjusting weights, hence better models. But over-fitting could happen.
a (t+1) = a(t) - l-rate * error * x(t)
t: time step, t+1, a step later.
dataset example: (x1, x2, x3,..., y)
yhat = 1 / (1 + exp(-x)) ---- x could be a1*x1 + a2*x2 + a3*x3+....+ c
a = a + l_rate * (y - yhat) * yhat * (1 - yhat)
yhat is the predicted result and x1, x2, .. are the variables. y is the expected result.
dataset example: (x1, x2, x3,..., y)
- 2-class classification, 1 neuron neural network, not using sigmoid function, but 1.0 or 0
activation = bias + E(wi * xi)
prediction = 1.0 IF activation >= 0.0 ELSE 0.0
w = w + l_rate * (expected - predicted) * x
epochs - An epoch is a process when all the data records have been used for training once. Usually multiple epochs are used to train a model (learning algorithm) and the data records are randomly entered.
Presently, if anything people don't like about AI and machine learning, most people probably would say "It is a black box and the outcome is uninterpretable." That is true with all models except this one! This decision tree model is easily traceable and interpretable for its output.
gini index = (1 - SUM(proportion_i)**2) * group size / total samples - _i: each and every class involvedproportion = count(class value) / count(rowsInGroup)4 sets of contrived data: (smoker, moderate-exercise, weight, heart-disease)(1, 0, 125, 1)(1, 0, 65, 0)(0, 1, 80, 0)(1, 0, 80, 0)(1, 1, 90, 1)this tree would have three nodes - one for each factor but which is the most important one to be the root?look at one at a time to get its gini number. then compare.(1, 0, 125, 1)(1, 0, 65, 0)(0, 1, 80, 0)(1, 0, 80, 0)(1, 1, 90, 1)smoker? Yes - (1, 0, 125, 1); (1, 0, 65, 0); (1, 0, 80, 0); (1, 1, 90, 1) -- 2 are in heart-disease, 2 are not. No - (0, 1, 80, 0) -- 1 in "no" group - leave gini(yes group) = (1 - p(h_d_1)**2 - p(p(h_d_0)**2 ) = (1 - (2/4)**2 - (2/4)**2) = (1-0.5) =0.5 gini(no_smoker group) = (1 - p(h_d_1)**2 - p(p(h_d_0)**2 ) = (1 - 0 - (1/1)**2) = 0 gini(smoker) = 0.5 * 4/5 + 0.0 * 1/5 = 0.4exercise? Yes - (0, 1, 80, 0); (1, 1, 90, 1) --- gini = (1-(1/2)**2 - (1//2)**2) = 0.5 No - (1, 0, 125, 1); (1, 0, 65, 0); (1, 0, 80, 0) --- gini = (1-(2/3)**2 - (1/3)**2) = 4/9 gini (exercise) = 0.5 * 2/5 + 4/9 * 3/5 = 7/15weight? sort and find mid-value(1, 0, 65, 0) 65-80 --> 72.5(0, 1, 80, 0) 80 - 80 -->80(1, 0, 80, 0) 80 - 90 --> 85(1, 1, 90, 1) 90 - 125 --> 107.5(1, 0, 125, 1) Yes(w<72.5) gini = (1-1-0)=0 ; No (w>=72.5) gini = (1-(2/4)**2 -(2/4)**2) = 0.5 --> gini(72.5) = 0.5*4/5=0.4 Yes(<80) gini =0 No(>=80)=0.5 --> gini(80) = 0.4 Yes(<85) gini = (1-(3/3)**2-0)=0; No (w>==85) =(1 - (1/2)**2-0) = 0.75 -> gini(85) = 0.75 * 2/5 = 0.3 Yes(<107.5) gini = (1-(3/4)**2 - (1/4)**2) = 6/16; No(>=107.5) = (1-1**2)=0 --> gini(107.5) = (6/16) * (4/5) = 0.3so compare gini(107.5) and gini(85) have the lower value of 0.3, therefore the root is one of them. gini(weight) = 0.3gini(smoker) = 0.4gini(exercise) =7/15 >6/15=0.4so the root is weight, then followed by smoker and then weightA probability-based method. P(class | data) = P(data | class) * P(class) / P(data)
the model training actually uses: P(class | data) = P(data | class) * P(class)
hence called "naive".
mean = sum()/count,
stdev = sqrt(sum(x - xmean)**2/(count-1)),
sumaarise the whole column,
summarise by class,
then use p(x) = exp(-(x-xmean)**2/(2stdev**2))/(sqrt(2Pi) * stdev)
P(class = 0 | X1; X2) = P(X1 | class = 0) P(X2 | class = 0) P(class = 0)
Example:
If history does repeat itself, you can predict the outcome easier! KNN uses data from the past that is most similar to the current data for prediction.
Either using THE most similar OR K neighboours closest to the data.
What is meant by "being similar"? What is the measure of similarity? The answer is the Euclidean distance ( = sqrt(sum(x1i - x2i)**2))
for a given data record, calculate its distance to each record in the memory and then choose k nearest to predict and average or most frequent (mode).
Instead of using the original training dataset which needs to be in memory during operation, one set of records (codebook) can be calculated from the originals and used for Euclidean distance calculation. Once the codebook is done, the rest is the same as KNN - k neighbours with the closest distance.
To obtain this set of codebook - (1) randomly initialise a set first, then train it with the originals to minimise the distance between them (codebook - original), then codebook will be "similar enough to the originals" and also a smaller set!
- This is the foundation of supervised learning! It was reported recently that some one invented a model training method without back-propagation. I am going to read it and place a summary underneath.
A couple of questions will pop up straight away if this is the first time you see this term. What is to be propagated back? Where is it "back" to?
An ANN (artificial neural network) consists of layers of artificial neurons. A deep ANN will have multiple layers between the input and the output layers. In operation, the input is fed into the input layer, then feed-forward to the next layer, then next layer, all the way to the output layer where the human could get the result. ALL layers between the input and the output layers are called hidden layers because they (their operations) are not visible to human. When passing data to the next layer, two operations need to happen: a linear operation and a non-linear activation, a shown below.
z = w0 + w1*x1 + w2*x2 + w3*x3 + .... wn*xn
yhat = f(z)
where w0, w1, w2, w3, ...wn are the constants known as weights, they are the features of the neuron concerned.
x1, x2, x3, ... xn are the input from previous layer or from outside (as for the input layer) - they will be fed into the neuron concerned after weighting
yhat is the output of the neuron concerned and f(z) is the activation function - how the neuron decides to activate or not to activate at all. (no picture, can you imagine? Because this is a study note, I can do without the image of the network for now.)
That process is called feed-forward, where previous output is the input for next layer.
The final outcome, coming out of the output layer, yhat, may be different from the true (actual, correct ) output. So the error is used to indicate how well the prediction is, how good the model is, how well the algorithm has learned.
This error, together with the gradient of the activation function (f) will be used to work out how much each of x1, x2, x3, ... xn has contributed to that error. That attributed error is passed backward such that the neuron can adjust its w0, w1, w2, w3, ... wn to minimise the error contribution next time. The error attribution is carried out backward till the last set of weights.
transfer derivative(output) = output * (1 - output) for Sigmoid activation. For other functions, ....
Remember error_j = (expected_j - output_j) * transfer derivative(output_j) - error_j is the error of neuron j from the output layer
(I could not understand why it is not error = (expected - output) / transfer derivative(output) but for today, just remember the correct answer till otherwise)
error_k = (weight_k * error_j) * transfer-derivative(output)
for neuron k in the hidden layer prior to the output layer, its error contribution is error_k, due to its weight_k and input value, error_j is the output layer neuron j. This error_k is then the output error relative to its feeding layer l.
With this error, then adjustment to the weight is weight_k = weight_k + learning_rate * error_k * input
error_L = (weight_L * error_k) * derivative(ofthelayer's activation) -- weight_L* input_L
to be completed later
There will be many records (datasets, data samples) for training. The more the better to improve the performance of the model. After training, naturally one needs to test the model before even considering deployment or further improvement. The test data should be unseen by the model during training.
Dataset preparation is to divide the entire dataset into train and test, two subsets.
In this e-book, two methods are explained: train_test_split() and cross_validation_split().
Train_test_split is easier. Decide the ratio between them, for example, 8000 records for training and 2000 records for testing. But the reliability is not as high as the next method.
Cross validation is also called k-fold cross validation. This methods offers better reliability compared with the simple train_test method. Decide the value for k (e.g. 3 for smaller dataset, 10 for larger ones) then divide the records into k groups. Leave them out and not to use them if any remainder records left after k group. The k groups are of equal size.
Then train the model with k-1 groups and test it with the one that is held out. For example, k =10, 9 groups will be used for training, 1 will be for testing.
Thus repeated training the model for k time, so that every group of data had a chance of being used for test! Fairness!
Use the mean (average) measuring metric as the final one.
Dataset splitting is the second thing to do for data preprocessing - dataset splitting. First, the data inside the set need to be checked and processed according to requirement. In production code, you don't need to code the functions yourself as there are libary functions for that. For example, sklearn.model_selection.train_test_split(*arrays, **options).
This is the first step of preprocessing of model building, before dataset preparation.
To be completed. (missing data, normalise, standardise...)
4 main simple metrics: accuracy, confusion matrix, mae (mean absolute error) and rmse (root mean squared error). As an extension I did the F1 score, recall and precision based on confusion matrix.
The first two are for classification problems where the prediction (result) is categorical - classes, e.g. dog, cat or parrot, and the last two are for regression where the data and prediction (result) are continuous data, e.g. 2019.5.
Accuracy = theCorrectPredictions / theTotalPredictions * 100. For example, out of 100 predictions, the model has predicted 50 correctly, then 50% accuracy.
Confusion matrix: it calculates the TruePositive, TrueNegative, FalsePositive and FalseNegative. This matrix is always a square.
Predicted
Actual Positive Negative
Positive 60 10
Negative 5 75
Accuracy = (TP + TN)/(total) = (60 + 75) / (60+10+5_75) = 135/150 =
Precision = TP/(TP+FP) = 60 /(60+10) = 6/7
Recall = TP/(TP+FN) = 60 /(60 + 5) = 60/65 = 12/13
F1 score = 2 * precision * recall / (precision + recall) = 2 * 6/ * 12/13 / (6/7 + 12/13) =
100% is the best.
Predicted
Actual Positive Negative
Positive TP FP FlasePositive - falsely predicted Positive
Negative FN TN
MAE = sum(abs((actual_i - predicted_i)) / total_predictions
RMSE = sqrt( sum((actual_i - predicted_i)**2)/total )
0 is the best.