NB, LDA and kNN

Introduction to NB

Bayes theorem

Mathematical content: Bayes theorem

Naive Bayes

Mathematical content: Naïve Bayes

Introduction to NB

The Naive Bayes (NB) method is a classification method that is essentially based on the Bayes theorem which is one of the most fundamental theorems in probability.

In order to understand what an NB classifier does, we have to recall that a classifier is the probability of a class happening if we know what feature is chosen. In other words, we have to find the probability of all classes condition to the features of interest e.g., P(C|x).

We always can find a prior distribution for classes (their probabilities), and also the likelihood of an observation that belongs to a particular class.

The Bayes theorem gives us a way to find the posterior distribution (probability) of class C when a feature x is observed based on the prior and likelihood.

Bayes theorem

Let us ask the following question: what is the probability of happening a specific class C if we are given a feature X? In other words, what is P(C|X)?

Actually, this is a question when we want to do any classification. If the probability of one class is greater than the other ones, then it is more likely that the individual with these features belongs to that class. In NB for answering this question, we use the Bayes theorem.

For a given feature, P(C|X) is proportional to the probability of the given class C times the probability of the feature condition to the fact that it is labeled by C. Let's explain it more. We know what is the probability of any class C happening i.e., P(C). But if we want to know the probability of the class C given we have a particular set of features X, we have to modify this probability to the probability of observing x when it belongs to class C. In more mathematical words, P(C|X) is proportional to P(x|C) x P(C). P(C) is called prior as it is the prior probability of a class. P(X|C) is called the likelihood as it is the likelihood of happening x given it belongs to class C. P(C|X) is called posterior as it is the posterior modification after observation x. So by this terminology posterior is a modification of the prior when it is multiplied by the likelihood.

Mathematical content: Bayes theorem

NB-1-math.pdf

Naive Bayes

But NB is making an additional assumption that all features are independent (hence Naive). This assumption, even though cannot be true in most cases, is very useful. That helps to find the likelihood in a much easier way by just multiplying the likelihoods of the single features of a given sample.

Mathematical content: Naïve Bayes

NB-2-math.pdf

Example

Consider the example of the introduction section. We have three features denoted by

x=education,

y=employment,

z=marriage

all belonging to {0,1}.

In this step we find the probabilities we need. It means we have to find the likelihood probabilities.

Then using this probability and the prior we can choose the class for any feature. The chosen label is highlighted by green.

Introduction to LDA

Linear Discriminant Analysis (LDA) is used as a dimensionality reduction technique mostly in the pre-processing step for pattern-classification and ML applications. Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from data. It’s a simple and powerful method for classification predictive modeling problems. The goal is to project a data set onto a lower-dimensional space with good class-separability in order to avoid overfitting (“ and the curse of dimensionality”) and also reduce computational costs. The general LDA approach is very similar to Principal Component Analysis, but in addition to finding the component axes that maximize the variance of our data (PCA), we are interested in the axes that maximize the separation between multiple classes (LDA).

Definition

Linear Discriminant Analysis is a dimension reduction method to maximize the separability of two (or more) labeled n-dimensional data.

It consists of statistical properties of the data, calculated for each class. For a single input variable this includes:

The mean value for each class.
The variance calculated across all classes.

Here is a picture to illustrate how an LDA method works: by finding the centroids that maximize the differences between and minimizes the variance.

Introduction to kNN

kNN or k nearest neighbor is one of the most popular methods due to its simplicity to implement. As we will discuss it almost does not need any sophisticated talent to classify the sample. It also uses the whole data set for the classification. It is also one of the most used methods for feature selection and feature importance.

Definition

Here we illustrate 3 and 5 nearest neighbor methods. The data are labeled red and blue. Considering the new data that is the black point, we want to label it properly. If we use 3-NN then one can see among the 3 nearest neighbors red have the majority so we can label it by red. While if we use 5-NN then the label will be blue. So k, that is a hyper-parameter must be found very carefully by using validation methods.

kNN

kNN can be used for both classification and regression. It works as follows:

For classification, we consider the space of features as a subset of R^n. We assume that the data is labeled. Now for a new sample, we consider and given the k nearest neighbor and then label the new sample data by the label among those k nearest samples that have the majority.
For regression, similar to the previous case, we consider any sample belongs to the n-dimensional space of the feature values. Then for any given new sample, we find again the k nearest neighbor and associate the average of the values of those nearest neighbors to the new sample.

However, this method needs careful data cleaning (unlike e.g., trees). That essentially is important since outlier can have an extreme impact and also non-normalizing of the data can result in degenerate results. The reason essentially is that we have to normalize the data is that otherwise one dimension (one feature), can have a much larger scale compared to the others and for that can deviate the prediction towards the dimension with larger values.

Slides

NB-LDA-kNN-site.pdf

Page updated

Google Sites

Report abuse