Quora - In classification, how do you handle an unbalanced training set?

22 Answers
Sergey Feldman
Sergey Feldmanmachine learning PhD & consultant @ www.data-cowboys.com

the ones listed above/below are great! here are a few more:

1) let's say you have L more times of the abundant class than rare class. for stochastic gradient descent, take L separate steps each time you encounter training data from the rare class.

2) divide the more abundant class into L distinct clusters. then train L predictors, where each predictor is trained on only one of the distinct clusters, but on all of the data from the rare class. to be clear, the data from the rare class is used in the training of all L predictors. finally, use model averaging for the L learned predictors as your final predictor.

3) this is similar to Kripa's number (2), but a little different.
let N be number of samples in the rare class. cluster the abundant 
class into N clusters (agglomerative clustering may be best here), and use the resulting cluster mediods as the training data for the abundant class. to be clear, you throw out the original training data from the abundant class, and use the mediods instead. voila, now your classes are balanced!

4) whatever method you use will help in some ways, but hurt in others. to mitigate that, you could train a separate model using all of the methods listed on this page, and then perform model averaging over all of them!

5) A recent ICML paper (similar to Kripa's (1)) shows that adding data that are "corrupt[ed] training examples with noise from known distributions" can actually improve performance. The paper isn't totally relevant to the problem of unbalanced classes because they add the data implicitly with math (that is, the dataset size remains unchanged). But I think the table of corrupting distributions in the paper is useful if you want to implement your own surrogate data code.

More details than you need: imho, the most interesting of the corrupting distributions is the blankout distribution, where you just zero out a random subset of features. Why is it interesting? Because you are helping your classifier be sturdier/hardier by giving it variations of your data that have essentially missing features. So it has to learn to classify correctly even in adverse conditions.

A related idea is dropout in neural networks, where random hidden units are removed (zeroed-out) during training. This forces the NN to, again, be hardier than it would be otherwise. See here for a recent treatment: http://www.stanford.edu/~sidaw/c...

Here’s a nice package that does a lot of this stuff and is compatible with the scikit-learn API: scikit-learn-contrib/imbalanced-learn

33.6k Views · View Upvotes
Kripa Chettiar
Kripa ChettiarNLP and ML practitioner; Personalization @ Amazon - Music Recommendations
You could do either of the following:

  1. Generate synthetic data using SMOTE or SMOTEBoost.[http://wiki.pentaho.com/display/... ]. This should give you good results in more cases.
  2. Decompose your larger class into smaller number of other classes. This is tricky and totally dependent on the kind of data you have. For something like 30 instances of A vs 4000 instances of B, you would decompose the 4000 instances of B into 1000 instances of B1 and 3000 instances of unknown. Effectively you are reducing the difference in number of instances between A and B1.
  3. A medly of 1 and 2 could also work.
  4. In the worst case use a One Class Classifier. What you are doing here is that you are considering the smaller class an outlier and confidently learn the decision boundary for ONLY the larger class. Weka provides a one-class classifier library. [http://webcache.googleuserconten...
5. Get more data!
16.9k Views · View Upvotes
Roar Nybø
Roar NybøUsing ML for diagnostics in offshore drilling
Written Sep 4, 2011 · Upvoted by Abhinav MauryaPhD Student Researcher (Machine Learning, Data Mining) at Carnegie Mellon
This was also discussed in What's a good approach to binary classification when the target rate is minimal? where Charles H Martin and myself both recommended resampling the unbalanced training set into not one balanced set, but several. Running an ensemble of classifiers on these sets could produce a much better result than one classifier alone.
16.6k Views · View Upvotes
Dayvid Victor
Dayvid VictorMachine Learning Researcher | PhD candidate in Computer Science | Data Scientist
Updated Jan 22 · Upvoted by Shehroz KhanML Researcher, Postdoc @U of Toronto

There are a few notes and suggestions:

  1. Regular classification rate (classification accuracy) isn't a good metric, because if you correctly classify only the instances of the majority class 
    (class with many samples), this metric still gives you a high rate.
    The Area Under the ROC Curve (AUC) is a good metric for evaluation of classifiers in such datasets.
  2. You can increase the number of minority class samples by:
    1. Resampling: bootstrapping samples of the minority class.
    2. Oversampling: generate new samples of the minority class, for this, I'd recommend to use SMOTE, SPIDER or any variant. You can also use Prototype Generation (PG) in order to generate new samples of the minority class - there are specific PG techniques for imbalanced datasets such as ASGP and EASGP.
  3. You can reduce the number of majority class samples by:
    1. Random Undersampling.
    2. Prototype Selection (PS) to reduce imbalance level, such as One-Sided Selection (OSS). Or, you can use Tomek Links, Edited-Nearest Neighbors (ENN) and other but only remove the majority class outliers.
  4. In your K-Fold validation, try to use the same proportion between the classes. If the number of instances of the minority class is too low, you can reduce the number of K, until there are enough.
  5. Use Multiple Classifier Systems (MCS):
    1. Using Ensemble Learning has been proposed as an interesting solution to learn from imbalanced data.

Also, be careful with the techniques/algorithms you will use. In prototype generation, for example, there are techniques that have a high performance on regular datasets, but if you use them with unbalanced datasets, they will misclassify most (or all) instances of the minority class.

In cases like that, search for a similar algorithm, adapted to handle unbalanced datasets, or adapt it yourself. If you get good results, you can even publish it (I've done that).

12.3k Views · View Upvotes
Ian Vala
Ian ValaSenior Data Scientist working in Silicon Valley. Harvard grad.

Haha you know whats funny? You get 90% accuracy for your model and you are like “awesome!” until you find out, well %90 of the data was all on one class lol

This is actually a very good interview question and what you are referring to is called “imbalanced data”.It is a very common problem when you get a real dataset. For example you get cancer patient data. They tell you go predict if the person has cancer or not. Great! Making the world a better place, and making 6 figure salary! You are excited and get the dataset, it’s 98% no cancer and 2% cancer.


Lo and behold, thankfully there are some solutions for this:

  1. Resample differently. Over-sample from your minority class and under-sample from your majority class, so you get a more balanced dataset.
  2. Try different metrics other than correct vs wrong prediction. Try Confusion Matrix or ROC curve. Accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of the values.
  3. Use Penalized Models. Like penalized-SVM and penalized-LDA. They put additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model towards paying attention to minority class.
  4. Try Anomaly Detection techniques and models often used there. Although that would probably be necessary if your data was even more Imbalanced.

Still not satisfied? Here’s the book on this: Imbalanced Learning: Foundations, Algorithms, and Applications: Haibo He, Yunqian Ma: 9781118074626: Amazon.com: Books

951 Views · View Upvotes
Yevgeny Popkov
Yevgeny PopkovData Scientist @ CQuotient
These two alternative approaches are typically used:
1) Quick&dirty approach: balance the data set by either under-sampling the majority class(es) or over-sampling the minority class(es).
2) A preferred approach: use cost-sensitive learning by applying class-specific weights in your loss function (smaller weights for the majority class cases and larger weights for the minority class cases). For starters, the weights can be set to be inversely proportional to the fraction of cases of the corresponding class.  In case of binary classification the weight of one of the classes can be tuned using resampling to squeeze even more predictive power out of the model.
8.5k Views · View Upvotes
Muktabh Mayank
Muktabh MayankData Scientist,CoFounder @ ParallelDots, BITSian for life, love new technolog...
Here is a recent paper which addresses the same problem. It uses a method similar to SVM.
I am working on an implementation, will open source it soon. Page on hal.inria.fr
6.8k Views · View Upvotes

There is a nice method to cope with unbalanced data set with a theoretical justification.
The method is based of the boosting algorithm Robert E. Schapire presented at "The strength of weak learnability" (Machine Learning, 5(2):197–227, 1990. http://rob.schapire.net/papers/s... ).

In this paper Schapire presented a boosting algorithm based upon combining triplets of 3 weak learners recursively. By the way, this was the first boosting algorithm.

We can use the first step of the algorithm (even without the recursion) to cope with the lack of balance.

The algorithm trains the first learner, L1, on the original data set.
The second learner, L2, is train on a set on which L1 has 50% chance to be correct (by sampling from the original distribution).
The third learner, L3, is trained on the cases on which L1 and L2 disagree.
As output, return the majority of the classifiers.
See the paper to see why it improves the classification.

Now for the application of the method of an imbalanced set.
Assume the concept is binary and the majority of the samples are classified as true.

Let L1 return always true.
L2, is being trained where L1 has 50% to be right. Since L1 is just true, L2 is being training on a balanced data set.
L3 is being trained when L1 and L2 disagree, hence, when L2 predicts false.
The ensemble predicts by majority, hence predicts false only when both L2 and L3 predict false.

I used this method in practice many times and it is very useful.

8.3k Views · View Upvotes
Kaushik Kasi
Kaushik Kasi(Data Science && Bitcoin) Enthusiast
Here are some options
  • Replicate data points for the lagging class and add those to the training data (might increase bias)
  • Artificially generate more data points by manually tweaking features. This is sometimes done with with systems involving images, where an existing image can be distorted to create a new image. You run the risk of training the model based on (artificial) samples that aren't representative of the test samples that the model would encounter in the real world.
  • Treat the multi-class classifier as a binary classifier for the larger sample. Define the larger class as positive and the smaller class as a negative. This would train your model distinctly on what the larger class looks like, and theoretically classifier the smaller class as "negatives". Your performance will depend on how many features you have for your samples and how tolerant your model can be to overfitting.
10.4k Views · View Upvotes
Here is my opinion on this. Please take it with a grain of salt, since the answer might differ with different applications.
  • Gotcha #1: Do not use accuracy alone as a metric. That way, we would get 99% accuracy with everything classified as the majority class, which would not mean anything. Precision & Recall, with respect to each class might be a better one.
  • If you are more  interested in the classification of the  minority class, you can use a Cost sensitive classifier (http://weka.wikispaces.com/CostS...) through which you can state the cost of misclassification of the different classes. Eg. Misclassifying the minority might be considered to be costlier.
  • You might want to boost the number of minority class training examples by artificially creating new samples from the existing samples.
  • Simplest of all, you could also just resample the set, to have a proportional number of samples in both the classes, if that is an option.
4.6k Views · View Upvotes
Shehroz Khan
Shehroz KhanML Researcher, Postdoc @U of Toronto

To handle skewed data, you can employ different strategies, such as

  • Over sampling the minority class or under sampling the majority class
  • Cost sensitive classification
  • One-class Classification

You can read my detailed answer to a similar question here - Shehroz Khan's answer to I have an imbalanced dataset with two classes. Would it be considered OK if I oversample the minority class and also change the costs of misclassification on the training set to create the model?

537 Views · View Upvotes · Answer requested by Lucian Sasu
Feng Qi (奇峰)
Feng Qi (奇峰)Software Engineer, Quora
Written Nov 19, 2014 · Upvoted by Lili JiangData Scientist at Quora
1. assign higher weights to training data of rare classes
2. up-sampling rare class
3. sometimes highly imbalanced data can still give good training results. I think it makes sense for some metrics, for example, auc.
4.6k Views · View Upvotes
Yiyao Liu
Yiyao Liuimproving every day

In this article: Practical Guide to deal with Imbalanced Classification Problems in R, it mentioned four methods with some cases, which are clear and helpful.

  1. Undersampling
  2. Oversampling
  3. Synthetic Data Generation
  4. Cost Sensitive Learning
1.2k Views · View Upvotes
Adding to what Krishnakumar said and specifically the third point of generating artificial or synthetic samples of the minority class to maintain a balance, check out this paper on SMOTE(Synthetic Minority Oversampling TEchnique) - http://arxiv.org/pdf/1106.1813.pdf. It seems to be using a nearest neighbor approach to generate synthetic samples of the minority class. In all probability, there will be a lot of noise in the artificial samples generated. 

I did have quite a bit of success in using this approach to classify credit card transactions as fraudulent.
AayushData Science Intern

I’d recommend three ways to solve the problem, each has (basically) been derived from Chapter 16: Remedies for Severe Class Imbalance of Applied Predictive Modeling by Max Kuhn and Kjell Johnson.

  1. Random Forests w/ SMOTE Boosting: Use a hybrid SMOTE that undersamples the majority class and generates synthetic samples for the minority class by adjustable percentages. Select these percentages depending on the distribution of your response variable in the training set. Feed this data to your RF model. Always cross-validate/perform gird-search to find the best parameter settings for your RFs.
  2. XGBoost w/ hyper-parameter optimisation: Again, cross-validate or perform gird-search to find the best parameter settings for the model. I found this postextremely useful in my case. Additionally, xgboost offers parameters to balance positive and negative weights using scale_pos_weight. See the parameter documentation for a complete list.
  3. Support Vector Machines w/ Cost Sensitive Training:
    - SVMs allow for some degree of mis-classification of data and then repeatedly optimizes a loss function to find the required “best fit” model.
    - It controls the complexity using a cost function that increases the penalty if samples are on the incorrect side of the current class boundary. 
    - For class imbalances, unequal costs for each class can adjust the parameters to increase or decrease the sensitivity of the model to particular classes.

I have used methods one and two and have been able to obtain an AUC of over 0.8 on the test data-set, and the data-set I was working on had very serious class imbalance with the minority class making up as little as 0.26% of the data-set.

Some important advice that I received and I’d like to give to anyone dealing with severely skewed classes is that:

  • Understand your data. Know what your requirements are and what type of classification matters more to you and which classification are you willing to trade-off.
  • Do not use % accuracy as a metric, use AUC, Sensitivity-Specificity and Cohen’s Kappa scores instead. The fundamental problem with using accuracy as a metric is that your model can simply classify everything into the majority class and give you a very high accuracy which is definitely one of the biggest “gotchas”, as mentioned in the question.
  • Run a lot of tests on multiple models. Intuition can take you a long way in data-science - if your gut tells you that an ensemble of classifiers will give you the best results, go ahead and try it.

On a closing note, I’d like to say here that XGBoost when tuned correctly has rarely ever disappointed anyone, but that shouldn’t stop you from experimenting!

Todd Jamison
Todd JamisonImagery Scientist, develop machine learning algorithms for earth observation

In our software, we use an accuracy metric that balances the Producer’s and User’s accuracy and applies a larger penalty for higher levels of error. It can accommodate any number of classes, not just binary problems. Our accuracy function equation is:


N = the number of classes,

Mn = the number of data points in each class,

Rn = the number of data points assigned to the class by the solution,

pni = the classification results for the ith training sample in class n, and

rni = the classification results for the ith training sample assigned to class n.

There are two items within the primary summation. The first item is related to the Producer’s Accuracy and the second is related to the User’s Accuracy. In this formula, we calculate the Producer’s and User’s error rates for each class instead of the accuracy rates by inverting the results values (i.e., (1-pni) and (1-rni) respectively). The error rates in the range 0 to 1 are root-mean-squared, in order to emphasize larger error rates. The error metric is then converted back to an accuracy metric by subtracting the result from 1. If you are using it as a cost function, you don’t do the final subtraction.

This metric should result in solutions that have a more balanced level of accuracy across classes and between Producer and User perspectives.

765 Views · View Upvotes
If you are looking to use a Support Vector Machine for classification, you may consider using the Twin SVM which is suited to unbalanced datasets. 
@Twin Support Vector Machines for Pattern Classification
The objective is two find two separating hyperplanes, each of which is closer to points of one class and far from points of the other. Predictions on test points are computed based on the distance from the two hyperplanes.
Nigel Legg
Nigel LeggDiscovering meaning in your data.
This is actually something I'm thinking about right now.  I have a data file of 49,000 records that need to be pushed through the classifier.  We have two classifications, Yes and No. From experience with a smaller data file, we will get >99% No, and <1% Yes, but the Yes classification is the important one. So I have manually classified 200 records, for the training set (gave 2xYes, 198xNo), and this is currently queued to be classified. 
I'm assuming that this will give some odd results - there aren't enough examples of the 'Yes' classification.  I think a better training set would be a more balanced one, in spite of the inbalance in the test set.  Thus a better approach would be to hunt out the 'Yes' records to ensure that there are >50 classified in the training set.  The danger with this, however, is that you could end up classifying all the 'Yes' records by hand, leaving nothing for the classifier to do. ;-)
5.2k Views · View Upvotes
Alex Gilgur
Alex GilgurData Scientist; IT Analyst; occasional lecturer at UCBerkeley's MIDS program.

I think you are referring to an imbalanced, rather than skewed, dataset. Regardless, it is a big problem in classification, because as you pointed out, data are rarely balanced. One way to solve it is, when training your classifier, to sample separately from the positive and the negative sets and combine them 9:1 when done: it's all about conditional probabilities.

Abhishek Ghose
Abhishek GhoseData Scientist @[24]7
There are a bunch of techniques to do this. I had mentioned some of them in this answer: Abhishek Ghose's answer to What's the most efficient classification algorithm for unbalanced data sets? And what pre-processing could be done to optimize the score? Is there a way to merge these questions?
Have a look at down/up sampling methods and hybrid one like SMOTE

You can also play with the threshold (default=0.5) in order to reach a specific precision or specificity requirement, but it's more artificial.

NB: if your class is not well balanced don't choose the accuracy as your evaluation metric (ex: Y={0:90%, 1:10%} then for a the silly Y_hat=0 you get an accuracy of 90%) one could prefer the AUC (Area Under the Curve)

A good read on this topic is: Learning from Imbalanced Classes

There's an excellent paper1 on this problem from 2009 that presents a number of solutions, many of which have already been mentioned. Some of the techniques are very simple, some are rather complicated. Some have been implemented in various places, others have not. There's way too much detail in the paper to fully summarize it here, but I can list the different techniques and algorithms mentioned which can aid your googling, or motivate you to read the paper. :) This is still a massive info-dump, so ignore at your liesure.

The first set of solutions are sampling based, which may or may not help, depending on the problem. Some of these were mentioned by trimetaashwhat, and ieeaaauuuuooooo. The different sampling methods mentioned are:
* Random oversampling and undersampling
* Informed undersampling (EasyEnsemble and BalanceCascade)
* Synthetic sampling with data generation (SMOTE)
* Adaptive Synthetic Sampling (Borderline-SMOTE and ADA-SYN)
* Sampling with data cleaning techniques. "Some representative work in this area includes the OSS method [42], the condensed nearest neighbor rule and Tomek Links (CNN+Tomek Links) integration method [22], the neighborhood cleaning rule (NCL) [36] based on the edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE + Tomek)
* Cluster-based sampling method (cluster-based oversampling - CBO)
* Integration of sampling and boosting (SMOTEBoost, DataBoost-IM)
* Over / undersampling with jittering (JOUS-Boost)

The next set of solutions are cost sensitive methods, mentioned by trimeta and ashwat. "There are many different ways of implementing cost-sensitive learning, but, in general, the majority of techniques fall under three categories. The first class of techniques apply misclassification costs to the data set as a form of dataspace weighting; these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassification costs are used to select the best training distribution for induction. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods; this class consists of various Metatechniques where standard learning algorithms are integrated with ensemble methods to develop cost-sensitive classifiers. ... The last class of techniques incorporates cost-sensitive functions or features directly into classification paradigms to essentially “fit” the cost-sensitive framework into these classifiers." The techniques mentioned are:
* Cost-Sensitive Dataspace Weighting with Adaptive Boosting (AdaC1, AdaC2, and AdaC3)
* Cost-Sensitive Decision Trees "In regards to decision trees, cost-sensitive fitting can take three forms: first, cost-sensitive adjustments can be applied to the decision threshold; second, cost-sensitive considera- tions can be given to the split criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree."
* Cost-Sensitive Neural Networks
* Cost-Sensitive Bayesian Classifiers
* Cost-Sensitive SVMs

The next technique is Kernel-Based Methods and Active Learning Methods for Imbalanced Learning. I don't think anyone has mentioned these yet, although the kernel based methods are in some sense cost-based learning.
* Kernel-Based Learning Framework
* Integration of Kernel Methods with Sampling Methods (SMOTE with Different Costs (SDCs), ensembles of over/undersampled SVMs, Granular Support Vector Machines—Repetitive Undersampling (GSVM-RU))
* Kernel Modification Methods for Imbalanced Learning
* Active Learning Methods for Imbalanced Learning (LASVM, simple active learning heuristic (SALH))

Additional Methods for Imbalanced Learning.
* One Class SVMs
* Autoassociator (or Autoencoder) Method
* The Mahalanobis-Taguchi System (MTS)

Assessment Metrics For Imbalanced Learning - These are metrics to evaluate performance on imbalanced datasets that are superior to accuracy. trimetastrayadvicesculler, and duckandcover mention some of these.
* Singular Assessment Metrics (Precision, recall, F-Measure, G-mean)
* Receiver Operating Characteristic (ROC) Curves
* Precision-Recall (PR) Curves "Although ROC curves provide powerful methods to visualize performance evaluation, they also have their own limitations. In the case of highly skewed data sets, it is observed that the ROC curve may provide an overly optimistic view of an algorithm’s performance. Under such situations, the PR curves can provide a more informative representation of performance assessment"

  1. Haibo He, Edwardo A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, pp. 1263-1284, September, 2009