Filing Cabinet - Machine Learning Algorithm

- is a simple prediction method. To predict the result from an input. The formula is y = ax + c where x is the independent variable, y is the dependent variable, a and c are constants.

When given pairs of data (x1, y1), (x2, y2), ....., a and c can be worked out. After that, a Y will be given for any new X input - prediction. Sound simple! This has been used in science and engineering for hundreds of years.

a = E(x - x_mean) * (y - y_mean) / E(x - x_mean)**2

c = y_mean - a * x_mean

2. Multivariate linear regression

- this is more realistic as real problems tend to have more than just one variable. This is also a prediction method and often used for optimisation such as weights of neural networks. (multivariate means multi-variable, x1, x2, x3, .... Actually multivariate tends to mean different values of the same variable! Not sure if this is correct.)

y = a1* x1 + a2 * x2 + a3 * x3 + ... + c

error = (expected - predicted)

l_rate and epochs are hyperparameters - designed (chosen) by human and can be adjusted to get the best result. l_rate is the rate (percentage) of the error used for adjusting the weights (a1, a2, a3...) If l-rate = 1.0, then use the entire error for calculation to update, if 0.1, then just 10% of the error.

a = a - l_rate * error * x; c = c - l_rate * error

1 epoch is one training process once all input data have been used. So epochs means the number of loops (times) the entire set of training data used for training. Usually, higher epochs lead to more chances of adjusting weights, hence better models. But over-fitting could happen.

a (t+1) = a(t) - l-rate * error * x(t)

t: time step, t+1, a step later.

dataset example: (x1, x2, x3,..., y)

3. Logistic regression, which is a 2-class classification method.

yhat = 1 / (1 + exp(-x)) ---- x could be a1*x1 + a2*x2 + a3*x3+....+ c

a = a + l_rate * (y - yhat) * yhat * (1 - yhat)

yhat is the predicted result and x1, x2, .. are the variables. y is the expected result.

dataset example: (x1, x2, x3,..., y)

4. perceptron

- 2-class classification, 1 neuron neural network, not using sigmoid function, but 1.0 or 0

activation = bias + E(wi * xi)

prediction = 1.0 IF activation >= 0.0 ELSE 0.0

w = w + l_rate * (expected - predicted) * x

epochs - An epoch is a process when all the data records have been used for training once. Usually multiple epochs are used to train a model (learning algorithm) and the data records are randomly entered.

Nonlinear algorithms

5. CART - Classification and Regression Trees - Decision Trees

Presently, if anything people don't like about AI and machine learning, most people probably would say "It is a black box and the outcome is uninterpretable." That is true with all models except this one! This decision tree model is easily traceable and interpretable for its output.

First, build the tree from the training data, it is a rooted binary tree
1. find a dataset (record, sample data) that divides the entire set of records into TWO groups so that when asked if the answer is yes you use ONLY ONE group and if no, you ONLY use THE OTHER group of data.
2. Use GINI index to find that crucial pint of division - the lowest gini number is the key!
3. For each sub-group, repeat the process - calculate gini index and find the lowest one to divide.
Use the tree to test
1. simply ask a series of questions and follow the tree branch to get the answer
Use the tree to predict
Gini index, root node, branch, sub-node, leaf.

gini index = (1 - SUM(proportion_i)**2) * group size / total samples              - _i: each and every class involved

proportion = count(class value) / count(rowsInGroup)

4 sets of contrived data: (smoker, moderate-exercise, weight, heart-disease)

(1,  0,  125, 1)

(1,  0,  65, 0)

(0,  1,  80, 0)

(1,  0,  80, 0)

(1,  1,  90, 1)

this tree would have three nodes - one for each factor but which is the most important one to be the root?

look at one at a time to get its gini number. then compare.

(1,  0,  125, 1)

(1,  0,  65, 0)

(0,  1,  80, 0)

(1,  0,  80, 0)

(1,  1,  90, 1)

smoker? Yes - (1,  0,  125, 1); (1,  0,  65, 0); (1,  0,  80, 0); (1,  1,  90, 1) -- 2 are in heart-disease, 2 are not.

        No -  (0,  1,  80, 0) -- 1 in "no" group - leave

        gini(yes group) = (1 - p(h_d_1)**2 - p(p(h_d_0)**2 ) = (1 - (2/4)**2 - (2/4)**2) = (1-0.5) =0.5

        gini(no_smoker group) = (1 - p(h_d_1)**2 - p(p(h_d_0)**2 ) = (1 - 0 - (1/1)**2) = 0

        gini(smoker) = 0.5 * 4/5 + 0.0 * 1/5 = 0.4

exercise? Yes - (0,  1,  80, 0); (1,  1,  90, 1) --- gini = (1-(1/2)**2 - (1//2)**2) = 0.5

          No  - (1,  0,  125, 1); (1,  0,  65, 0); (1,  0,  80, 0)  --- gini = (1-(2/3)**2 - (1/3)**2) = 4/9

        gini (exercise) = 0.5 * 2/5 + 4/9 * 3/5 = 7/15

weight?

sort and find mid-value

(1,  0,  65, 0)  65-80 --> 72.5

(0,  1,  80, 0)  80 - 80 -->80

(1,  0,  80, 0)  80 - 90 --> 85

(1,  1,  90, 1)  90 - 125 --> 107.5

(1,  0,  125, 1)

    Yes(w<72.5) gini = (1-1-0)=0 ;   No (w>=72.5)  gini = (1-(2/4)**2 -(2/4)**2) = 0.5  --> gini(72.5) = 0.5*4/5=0.4

    Yes(<80) gini =0   No(>=80)=0.5  --> gini(80) = 0.4

    Yes(<85) gini = (1-(3/3)**2-0)=0;  No (w>==85) =(1 - (1/2)**2-0) = 0.75  -> gini(85) = 0.75 * 2/5 = 0.3

    Yes(<107.5) gini = (1-(3/4)**2 - (1/4)**2) = 6/16; No(>=107.5) = (1-1**2)=0 --> gini(107.5) = (6/16) * (4/5) = 0.3

so compare gini(107.5) and gini(85) have the lower value of 0.3, therefore the root is one of them.

gini(weight) = 0.3

gini(smoker) = 0.4

gini(exercise) =7/15 >6/15=0.4

so the root is weight, then followed by smoker and then weight

6. Naive Bayes

A probability-based method. P(class | data) = P(data | class) * P(class) / P(data)

the model training actually uses: P(class | data) = P(data | class) * P(class)

hence called "naive".

mean = sum()/count,

stdev = sqrt(sum(x - xmean)**2/(count-1)),

sumaarise the whole column,

summarise by class,

then use p(x) = exp(-(x-xmean)**2/(2stdev**2))/(sqrt(2Pi) * stdev)

P(class = 0 | X1; X2) = P(X1 | class = 0) P(X2 | class = 0) P(class = 0)

Example:

7. K-Nearest Neighbours

If history does repeat itself, you can predict the outcome easier! KNN uses data from the past that is most similar to the current data for prediction.

Either using THE most similar OR K neighboours closest to the data.

What is meant by "being similar"? What is the measure of similarity? The answer is the Euclidean distance ( = sqrt(sum(x1i - x2i)**2))

for a given data record, calculate its distance to each record in the memory and then choose k nearest to predict and average or most frequent (mode).

8. Learning Vector Quantisation

Instead of using the original training dataset which needs to be in memory during operation, one set of records (codebook) can be calculated from the originals and used for Euclidean distance calculation. Once the codebook is done, the rest is the same as KNN - k neighbours with the closest distance.

To obtain this set of codebook - (1) randomly initialise a set first, then train it with the originals to minimise the distance between them (codebook - original), then codebook will be "similar enough to the originals" and also a smaller set!

9. Back-propagation

- This is the foundation of supervised learning! It was reported recently that some one invented a model training method without back-propagation. I am going to read it and place a summary underneath.

A couple of questions will pop up straight away if this is the first time you see this term. What is to be propagated back? Where is it "back" to?

An ANN (artificial neural network) consists of layers of artificial neurons. A deep ANN will have multiple layers between the input and the output layers. In operation, the input is fed into the input layer, then feed-forward to the next layer, then next layer, all the way to the output layer where the human could get the result. ALL layers between the input and the output layers are called hidden layers because they (their operations) are not visible to human. When passing data to the next layer, two operations need to happen: a linear operation and a non-linear activation, a shown below.

z = w0 + w1*x1 + w2*x2 + w3*x3 + .... wn*xn

yhat = f(z)

where w0, w1, w2, w3, ...wn are the constants known as weights, they are the features of the neuron concerned.

x1, x2, x3, ... xn are the input from previous layer or from outside (as for the input layer) - they will be fed into the neuron concerned after weighting

yhat is the output of the neuron concerned and f(z) is the activation function - how the neuron decides to activate or not to activate at all. (no picture, can you imagine? Because this is a study note, I can do without the image of the network for now.)

That process is called feed-forward, where previous output is the input for next layer.

The final outcome, coming out of the output layer, yhat, may be different from the true (actual, correct ) output. So the error is used to indicate how well the prediction is, how good the model is, how well the algorithm has learned.

This error, together with the gradient of the activation function (f) will be used to work out how much each of x1, x2, x3, ... xn has contributed to that error. That attributed error is passed backward such that the neuron can adjust its w0, w1, w2, w3, ... wn to minimise the error contribution next time. The error attribution is carried out backward till the last set of weights.

transfer derivative(output) = output * (1 - output) for Sigmoid activation. For other functions, ....

Remember error_j = (expected_j - output_j) * transfer derivative(output_j) - error_j is the error of neuron j from the output layer

(I could not understand why it is not error = (expected - output) / transfer derivative(output) but for today, just remember the correct answer till otherwise)

error_k = (weight_k * error_j) * transfer-derivative(output)

for neuron k in the hidden layer prior to the output layer, its error contribution is error_k, due to its weight_k and input value, error_j is the output layer neuron j. This error_k is then the output error relative to its feeding layer l.

With this error, then adjustment to the weight is weight_k = weight_k + learning_rate * error_k * input

error_L = (weight_L * error_k) * derivative(ofthelayer's activation) -- weight_L* input_L

to be completed later

Dataset preparation

There will be many records (datasets, data samples) for training. The more the better to improve the performance of the model. After training, naturally one needs to test the model before even considering deployment or further improvement. The test data should be unseen by the model during training.

Dataset preparation is to divide the entire dataset into train and test, two subsets.

In this e-book, two methods are explained: train_test_split() and cross_validation_split().

Train_test_split is easier. Decide the ratio between them, for example, 8000 records for training and 2000 records for testing. But the reliability is not as high as the next method.

Cross validation is also called k-fold cross validation. This methods offers better reliability compared with the simple train_test method. Decide the value for k (e.g. 3 for smaller dataset, 10 for larger ones) then divide the records into k groups. Leave them out and not to use them if any remainder records left after k group. The k groups are of equal size.

Then train the model with k-1 groups and test it with the one that is held out. For example, k =10, 9 groups will be used for training, 1 will be for testing.

Thus repeated training the model for k time, so that every group of data had a chance of being used for test! Fairness!

Use the mean (average) measuring metric as the final one.

Dataset splitting is the second thing to do for data preprocessing - dataset splitting. First, the data inside the set need to be checked and processed according to requirement. In production code, you don't need to code the functions yourself as there are libary functions for that. For example, sklearn.model_selection.train_test_split(*arrays, **options).

Data preparation

This is the first step of preprocessing of model building, before dataset preparation.

To be completed. (missing data, normalise, standardise...)

Metrics - to measure how well the model performs

4 main simple metrics: accuracy, confusion matrix, mae (mean absolute error) and rmse (root mean squared error). As an extension I did the F1 score, recall and precision based on confusion matrix.

The first two are for classification problems where the prediction (result) is categorical - classes, e.g. dog, cat or parrot, and the last two are for regression where the data and prediction (result) are continuous data, e.g. 2019.5.

Accuracy = theCorrectPredictions / theTotalPredictions * 100. For example, out of 100 predictions, the model has predicted 50 correctly, then 50% accuracy.

Confusion matrix: it calculates the TruePositive, TrueNegative, FalsePositive and FalseNegative. This matrix is always a square.

Predicted

Actual Positive Negative

Positive 60 10

Negative 5 75

Accuracy = (TP + TN)/(total) = (60 + 75) / (60+10+5_75) = 135/150 =

Precision = TP/(TP+FP) = 60 /(60+10) = 6/7

Recall = TP/(TP+FN) = 60 /(60 + 5) = 60/65 = 12/13

F1 score = 2 * precision * recall / (precision + recall) = 2 * 6/ * 12/13 / (6/7 + 12/13) =

100% is the best.

Predicted

Actual Positive Negative

Positive TP FP FlasePositive - falsely predicted Positive

Negative FN TN

MAE = sum(abs((actual_i - predicted_i)) / total_predictions

RMSE = sqrt( sum((actual_i - predicted_i)**2)/total )

0 is the best.

mat=[[3, 0, 1], [2, 4, 1], [2, 0, 5]] ## the confusion_matrixtP={}fP={}for i in range(len(mat)): for j in range(len(mat[0])): if i == j: # true pos if j not in tP: tP[j] =0 tP[j] += mat[i][j] else: # false positive if i not in fP: fP[i] = 0 fP[i] += mat[i][j]fN={}for i in range(len(mat)): for j in range(len(mat[0])): if i == j: continue else: if j not in fN: fN[j] =0 fN[j] += mat[i][j]For each class, get its tp, fp, fn, then calculate F1. this applies to 3 or more classes too.

Google Sites

Report abuse