Matlab - SMOTE and Variant implementation

SMOTE (Synthetic Minority Over-Sampling Technique)

by Manohar

 

29 Oct 2012

The SMOTE function takes the feature vectors with dimension(r,n) and the target class with dimension

Description

The SMOTE (Synthetic Minority Over-Sampling Technique) function takes the feature vectors with dimension(r,n) and the target class with dimension(r,1) as the input. 
And returns final_features vectors with dimension(r',n) and the target class with dimension(r',1) as the output.

Implementation based on : 
N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: synthetic minority over-sampling technique. Arxiv preprint arXiv:1106.1813, 2011.

Acknowledgements

This file inspired Safe Level Smote(Original Features, Original Mark) and Adasyn (Improves Class Balance, Extension Of Smote).

MATLAB releaseMATLAB 7.11 (R2010b)


ADASYN (improves class balance, extension of SMOTE)

by Dominic Siedhoff

 

17 Apr 2015 (Updated 23 Apr 2015)

ADASYN algorithm to reduce class imbalance by synthesizing minority class examples

File Information
Description

This submission implements the ADASYN (Adaptive Synthetic Sampling) algorithm as proposed in the following paper: 
H. He, Y. Bai, E.A. Garcia, and S. Li, "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning", Proc. Int'l. J. Conf. Neural Networks, pp. 1322-1328, (2008). 
The purpose of the ADASYN algorithm is to improve class balance by synthetically creating new examples from the minority class via linear interpolation between existing minority class examples. This approach by itself is known as the SMOTE method (Synthetic Minority Oversampling TEchnique). ADASYN is an extension of SMOTE, creating more examples in the vicinity of the boundary between the two classes than in the interior of the minority class. 
A demo script producing the title figure of this submission is provided.

Acknowledgements

Smote Boost and Smote (Synthetic Minority Over Sampling Technique) inspired this file.

Required ProductsStatistics and Machine Learning Toolbox
MATLAB releaseMATLAB 7.13 (R2011b)
MATLAB Search Path
/


SMOTEBoost

by Barnan Das

 

26 Jun 2012

Implementation of SMOTEBoost algorithm used to handle class imbalance problem in data.

File Information
Description

This code implements SMOTEBoost. SMOTEBoost is an algorithm to handle class imbalance problem in data with discrete class labels. It uses a combination of SMOTE and the standard boosting procedure AdaBoost to better model the minority class by providing the learner not only with the minority class examples that were misclassified in the previous boosting iteration but also with broader representation of those instances (achieved by SMOTE). Since boosting algorithms give equal weight to all misclassified examples and sample from a 
pool of data that predominantly consists of majority class, subsequent sampling 
of the training set is still skewed towards the majority class. Thus, to reduce the bias inherent in the learning procedure due to class imbalance and to 
increase the sampling weights of minority class, SMOTE is introduced at each 
round of boosting. Introduction of SMOTE increases the number of minority class 
samples for the learner and focus on these cases in the distribution at each 
boosting round. In addition to maximizing the margin for the skewed class dataset, this procedure also increases the diversity among the classifiers in the ensemble because at each iteration a different set of synthetic samples are 
produced.

For more detail on the theoretical description of the algorithm please refer to the following paper: 
N.V. Chawla, A.Lazarevic, L.O. Hall, K. Bowyer, "SMOTEBoost: Improving Prediction of Minority Class in Boosting, Journal of Knowledge Discovery in Databases: PKDD, 2003.

The current implementation of SMOTEBoost has been independently done by the author 
for the purpose of research. In order to enable the users use a lot of different 
weak learners for boosting, an interface is created with Weka API. Currently, 
four Weka algortihms could be used as weak learner: J48, SMO, IBk, Logistic.

Acknowledgements

This file inspired Adasyn (Improves Class Balance, Extension Of Smote).

Required ProductsMATLAB
MATLAB releaseMATLAB 7.12 (R2011a)
Other requirementsJDK 6 or above


Algorithms for imbalanced multi class classification in Matlab?

Answer by Ilya on 13 Oct 2012
 Accepted answer

I described approaches for learning on imbalanced data here http://www.mathworks.com/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category This advice is applicable to any number of classes.

If you have Statistics Tlbx in R2012b, I recommend RUSBoost algorithm available from fitensemble function. It is described here http://www.mathworks.com/help/stats/ensemble-methods.html#btfwpd3 and an example is shown here http://www.mathworks.com/help/stats/ensemble-methods.html#btgw1m1

  4 Comments

Ilya on 14 Oct 2012

I'll make sure to pass your joy to the doc writer who worked on that page.

RUSboost undersamples the majority class(es) for every weak learner in the ensemble (decision tree, most usually). For example, if the majority class has 10 times as many observations as the minority class, it is undersampled 1/10. If the ensemble has say 100 trees, every observation in the majority class is used 100/10=10 times by the ensemble on average. Every observation in the minority class is used 100 times, once for every tree.

The MATLAB implementation follows the paper by Seiffert et al. If you are not certain about a specific detail, post your question to Answers or call our Tech Support.

Take a look at the doc for fitensemble function: http://www.mathworks.com/help/stats/fitensemble.html If you scroll down the somewhat lengthy list of input arguments, you will come to the description of RatioToSmallestparameter. By default, fitensemble counts the number of observations in the smallest class and samples that many observations from every class.

Assigning a large misclassification cost to a class tells fitensemble that misclassifying this class is penalized more heavily than misclassifying other classes, nothing less and nothing more. This shifts the decision boundaries away from this class toward the other classes, so fewer observations of this class and more observations of the other classes are misclassified.

It's ok to skew your data thus making it not representative of the real world if it gives you a better confusion matrix for the classes that you care to classify correctly. If you assign uniform prior, the accuracy for the rare classes will likely improve and the accuracy for the popular classes will likely go down.

The page for fitensemble describes cross-validation parameters you can pass to this function. In addition, every object returned by fitensemble has crossval method. For classification, cross-validation is stratified by default.

I typed 'cross-validate ensemble' in the online doc search box, and the 2nd hit was this page http://www.mathworks.com/help/stats/classificationensemble.crossval.html There is a short example at the bottom. Does this suffice?

Carlos Andrade on 14 Oct 2012

Hi Ilya,

Yes those were all great answers, thanks for covering all my questions. Please forward the compliment, it is well deserved. I learn a lot from them things that are sometimes over complicated in my textbooks.

I did not know cross validation in classification were stratified by default, this just makes me more happy :-)

I have one more question and one last concern.

The last question is in respect to the paper you pointed me out. The paper index refers to binary imbalanced problems. To my understanding, you and also the documentation of the method suggest it can be used to either 2 classes or more. From your comment, I understood that this is what some authors have been calling a one versus all approach. What I mean one versus all is that the algorithm is creating a weak classifier that only sees two classes, one is a weak (positive) and all the remaining classes are considered negative and grouped as a single class. So we would have k classifiers where k is defined by the number of classes in the dataset I want to predict. The final class label would be judged based on the agreement of all the weak classifiers of the ensemble. Is that so? I just want to make sure I am following how matlab extended the binary classification to a multi class classification and if it is already one I have seen (but only in theory).

I have one last concern, in respect to licensing, since it includes this problem I will post it here, I hope this is not a problem:

I have a package associated to my institution (Stevens Institute of Technology) which is currently in Matlab R2012a. What is the best option in my case to move to obtain this algorithm? Is it possible to buy only a toolbox and plug it in Matlab R2012a from my institution, or since this is a student version I must buy a completely separated version for R2012b? In any case, what minimum licenses would I need to be able to run Rusboost? Matlab R2012b plus the Statistical Package? And lastly, is it possible to run any trial version of this algorithm to see how it behaves with our datasets if requested from an institution or professor from academia?

Thank you,

Carlos

Ilya on 14 Oct 2012

RUSboost uses AdaBoost.M2 algorithm underneath. This is a multiclass algorithm proposed by Freund and Schapire. It is not reducible to one-vs-all strategy. I don't remember a published reference off top of my head, but a google search finds this http://users.eecs.northwestern.edu/~yingwu/teaching/EECS510/Reading/Freund_ICML96.pdf. An observation is assigned to the class with the largest score.

You need Statistics Tlbx R2012b. For licensing and trial questions, please call our customer support.


Walter Roberson
Answer by Walter Roberson on 13 Oct 2012

Usually multi-class problems are handled by doing pairwise discrimination. Class 1 vs everything else, to pull out class 1. Take the "everything else" and run it against class 2 to get class 2 and a new "everything else". And so on.

You can find the algorithms for multi-class SVM (e.g.), but the papers warn that it is computationally very expensive even just for 3 classes.

  1 Comment

Carlos Andrade on 13 Oct 2012

Hi Walter,

Thank you for your reply. By pairwise, are you referring to what they call the One versus all approach? I found some papers on them, specially on doing this together with AdaBoost and Ensemble methods, but I only found one implementation in R. The implementation requires splitting the data, while I found MATLAB stratified k-fold to be more appropriate to validate it in such case. Could you point out any implementation in MATLAB for this that already takes into account in the algorithm the Ensemble method? The only ones I have found so far do not address it looking as multi class.

Thank you,

Carlos


Comments