Linear Discriminant Analysis

Introduction

Usually we use euclidean distance. Three cons:
- depends on unit
- not consider predictors variation
- not consider correlation
So, we use statistical distance here

Let's take 2-d example:

Seeing picture on the left, we had X and y and color represents different class.

can't hard just only depend on X or Y to identify the class.

But when we use both X and Y, we kind of get pictures on the left to use an line to seperate the X and Y when we use new axes from LDA

So, we define a ratio is distance between to class to sum of their standard deviation.

Goal : maximize the ratio (distance between to class to sum of their standard deviation)

Method :

Assumptions:

Multivariate normal distribution
the correlation structure between the different predictors within a class is the same across class
Use EDA to identify whether data fits the assumptions or not.

Performance: as classification, we use accuracy, auc and roc to evaluate its effect.

Situation: probabilities of encountering a record for classification in the future is not equal for the different class

Situation: misclassification costs are not symmetrical

Pros:

Easy to calculations
Compare to other classification model, LDA can tell the contributions of each predictors

Cons:

Page updated

Google Sites

Report abuse