After extracting and reducing features, we used them to train classification models. We identified the best feature + classifiers combination by testing the accuracy of their predictions.
Logistic regression transforms its input, using the logistic sigmoid function, to a probability value. It takes any real number and turns it into a value between 0 and 1. Input values (X1, X2, …,Xn) are the features. The features are combined linearly using weights/coefficient values (β1, β2, …,βn) plus the bias or intercept term β0 to obtain the formula:
t = β0 +β1X1+ β2X2+…βnXn
t is then used to calculate the probability of an element being classified as 1 given the observed features, using:
σ(t) = 1 / 1 + e^-t.
KNN relies on the assumption that similar examples should be classified the same. Examples are represented as points on a graph, and similarity is dependent on the distance to other examples in the graph. To put it simply, it gives the new example the same classification as the "closest" training sample.
Random Forest is built on top of many Decision Trees (classifies via yes/no questions) with different subsets of features and/or training sets. The predictions of each decision tree are saved, and the final predicted values are based on what the majority of the decision trees voted.
The SVM tries to build the best hyperplane splitting the data points from different variables. It can also classify non-linear data by using kernels, which add an extra dimension. In our code, it is called Support Vector Classifier (SVC).
A neural network is composed of a large number of interconnected neurons that work in unison to solve specific problems. A CNN is a neural network whose entries are images that allow us to encode certain properties in the architecture to recognize specific elements in the images. They process images as tensors, which are matrices of numbers with additional dimensions.