SMOTE

SMOTE

oversampling minority class with synthetic data.

In machine learning the training data can be imbalanced when the target is to detect rare scenarios, such as fraud, rare disease, natural disaster, etc.

The the percentage of frauds can be a lot lower than that of normal transactions, say 1%. 

When train a model with the imbalanced data, it might be difficult because the measurement indicator doesnt work well.

For example, 

  the prediction accuracy = (True Positive + True Negative) / (TP + FP + TN + FN)

  because the number of true positives is very small, the accuracy is already 99% if the model simply predicts everything as negative.

Also because the number of true positives is too small that its hard for the model to learn about it. The majority class over-dominates the training data.

One way is to under-sample the majority class data so it's not too dominating, and over-sample the minority class data so it's more contributing to the learned rules.

However those simple under-sampling / over-sampling usually doesnt work well.

A slightly better way (probably only slightly better) is to oversample the minority class by creating synthetic data.

Instead of just repeat a same data sample, it finds the nearest neighbours of a data sample and insert a new data sample between it and it's neighbour.

For example, we need to oversample 200% of the minority class. 

1. foreach of the sample in the minority class

      1.1 find it's 10 nearest neighbours. (can be 5 nearest neighbours.. )

      1.2 randomly choose 2 of the 10 neighbours (so as to oversample 200%, choose 3 if need 300%)

          2. foreach of the choosen neighbours

             2.1 find a random position between the sample and the neighbor.

                 this is done by finding a random position (0 to 100%) of the distance between the sample and the neighbor for each of the feature dimensions. 

             2.2 that random position is the new synthetic data

So essentially it chooses a few of the nearest neighbours depending on the over sampling rate, and insert synthetic sample between itself and the chosen neighbour.

The position of the synthetic sample is a random position between itself and the neighbour.

This is all about smote. simple but works sometimes.

When the feature is binary, 0 / 1, would that work at all? 0.5 means nothing.

from imblearn.over_sampling import SMOTE

smote=SMOTE(ratio='minority', k_neighbors=5, kind='regular')

X_extra, y_extra = smote.fit_sample(X, y)