Reddit - Classification when 80% of my training set is of one class. (self.MachineLearning)

This problem is called class imbalance, and occurs in a number of different machine learning domains. As jhinka states, bagging and boosting can be used to improve classification accuracy, although they are not specifically designed to deal with imbalanced data (they're for hard-to-classify data in general). The same applies to Random Forest, which was JeradDS's suggestion.

ashwhat and (to some extent) ieeaaauuuuooooo get at the real solution, though: sampling. Both oversampling (adding instances to the minority class) and undersampling (removing instances from the majority class) can be employed, and these can be done randomly or in a directed fashion (e.g., the SMOTE algorithm, which oversamples by generating new minority-class instances rather than simply duplicating existing ones, or the Wilson's Editing approach, which removes majority-class instances which are too close to the majority-minority class boundary). In addition, these can be applied in conjunction with one another, or with the aforementioned approaches for improving weak classifiers.

ashwhat also mentioned cost-based learning, which is when the costs of false negatives and false positives are different. By training the learner to minimize overall cost rather than misclassification, you can give it more incentive to give you more true positives. (I'm assuming here that the minority class is the positive class.) In addition, you can just manually change the decision threshold to weight your algorithm more in favor of the minority class. Both of these must be manually specified, however, and the proper costs and/or threshold values are not always immediately obvious.

One final note about AUC, the Area Under the ROC Curve. While many performance metrics assume the default decision threshold, this is inappropriate for imbalanced data. Often, the default threshold will simply classify everything as majority-class, since that gives the highest overall accuracy. If you want to see the performance of a learner on imbalanced data, you need to use the AUC, which gives performance across the whole range of decision thresholds. To be honest, when you say you've got a 79% success rate on your learner, I'd bet your learner is just saying "label everything as the majority class." If you used AUC, you wouldn't get misleading results like this.

EDIT: Adding a TL;DR

TL;DR: If 79% of your instances are in the majority class, and your model is giving you 79% accuracy, there's a very good chance your model is just labeling everything as "majority-class." Consider the AUC of your model to see its real performance. And use Weka's CostSensitiveClassifier learner, under the "meta" classifiers tree, to build your model with different error weights. Note that the "Cost-Sensitive Evaluation" checkbox under the "More Options..." button from the "Test Options" area is a trap; don't use that. Only use the actual CostSensitiveClassifier, and then set the classifier you really want under the settings for that.

[–]datapraxis 7 điểm 4 năm trước* 

There's an excellent paper1 on this problem from 2009 that presents a number of solutions, many of which have already been mentioned. Some of the techniques are very simple, some are rather complicated. Some have been implemented in various places, others have not. There's way too much detail in the paper to fully summarize it here, but I can list the different techniques and algorithms mentioned which can aid your googling, or motivate you to read the paper. :) This is still a massive info-dump, so ignore at your liesure.

The first set of solutions are sampling based, which may or may not help, depending on the problem. Some of these were mentioned by trimetaashwhat, and ieeaaauuuuooooo. The different sampling methods mentioned are:
* Random oversampling and undersampling
* Informed undersampling (EasyEnsemble and BalanceCascade)
* Synthetic sampling with data generation (SMOTE)
* Adaptive Synthetic Sampling (Borderline-SMOTE and ADA-SYN)
* Sampling with data cleaning techniques. "Some representative work in this area includes the OSS method [42], the condensed nearest neighbor rule and Tomek Links (CNN+Tomek Links) integration method [22], the neighborhood cleaning rule (NCL) [36] based on the edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE + Tomek)
* Cluster-based sampling method (cluster-based oversampling - CBO)
* Integration of sampling and boosting (SMOTEBoost, DataBoost-IM)
* Over / undersampling with jittering (JOUS-Boost)

The next set of solutions are cost sensitive methods, mentioned by trimeta and ashwat. "There are many different ways of implementing cost-sensitive learning, but, in general, the majority of techniques fall under three categories. The first class of techniques apply misclassification costs to the data set as a form of dataspace weighting; these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassification costs are used to select the best training distribution for induction. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods; this class consists of various Metatechniques where standard learning algorithms are integrated with ensemble methods to develop cost-sensitive classifiers. ... The last class of techniques incorporates cost-sensitive functions or features directly into classification paradigms to essentially “fit” the cost-sensitive framework into these classifiers." The techniques mentioned are:
* Cost-Sensitive Dataspace Weighting with Adaptive Boosting (AdaC1, AdaC2, and AdaC3)
* Cost-Sensitive Decision Trees "In regards to decision trees, cost-sensitive fitting can take three forms: first, cost-sensitive adjustments can be applied to the decision threshold; second, cost-sensitive considera- tions can be given to the split criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree."
* Cost-Sensitive Neural Networks
* Cost-Sensitive Bayesian Classifiers
* Cost-Sensitive SVMs

The next technique is Kernel-Based Methods and Active Learning Methods for Imbalanced Learning. I don't think anyone has mentioned these yet, although the kernel based methods are in some sense cost-based learning.
* Kernel-Based Learning Framework
* Integration of Kernel Methods with Sampling Methods (SMOTE with Different Costs (SDCs), ensembles of over/undersampled SVMs, Granular Support Vector Machines—Repetitive Undersampling (GSVM-RU))
* Kernel Modification Methods for Imbalanced Learning
* Active Learning Methods for Imbalanced Learning (LASVM, simple active learning heuristic (SALH))

Additional Methods for Imbalanced Learning.
* One Class SVMs
* Autoassociator (or Autoencoder) Method
* The Mahalanobis-Taguchi System (MTS)

Assessment Metrics For Imbalanced Learning - These are metrics to evaluate performance on imbalanced datasets that are superior to accuracy. trimetastrayadvicesculler, and duckandcovermention some of these.
* Singular Assessment Metrics (Precision, recall, F-Measure, G-mean)
* Receiver Operating Characteristic (ROC) Curves
* Precision-Recall (PR) Curves "Although ROC curves provide powerful methods to visualize performance evaluation, they also have their own limitations. In the case of highly skewed data sets, it is observed that the ROC curve may provide an overly optimistic view of an algorithm’s performance. Under such situations, the PR curves can provide a more informative representation of performance assessment"

  1. Haibo He, Edwardo A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, pp. 1263-1284, September, 2009

TL;DR: Go read the paper.

EDIT: Formatting, Added link to paper, Added TL;DR

[–]adamashton[S] 1 điểm 4 năm trước 

Thanks, I'm gonna go ahead and read that paper too.

[–]jhinka 3 điểm 4 năm trước 

So, this is a common problem in many machine learning tasks. For e.g. credit card fraud and such. So, to build robust classifiers people use two main strategies bagging and boosting. Bagging just says let's assume that we randomly pick 4 samples (2 positive and 2 negative) and train a classifier. Now, repeat this a large number of times. Average your weight vector (or matrix) and boom! you have a robust classifier if the statistics of the data generalizes from the samples that you have. Remember to keep your training and validation data sets independent.

Hope this helps :)

[–]adamashton[S] 2 điểm 4 năm trước 

I'm still unsure about bagging. Why does taking multiple averages of the subset of data do any better than taking an average of the whole data set?

[–]michiexile 2 điểm 4 năm trước 

My guess is that it's because you take balanced subsets.

[–]jhinka 1 điểm 4 năm trước 

Well, the idea is that you don't want to overfit your classifier. This is more likely to happen when you have unbalanced datasets (and the choice of your classifier). Bagging helps prevent that by taking bite-sized chunks of your data and fitting your classifier.

That said, I think SVMS are not very prone to overfitting, you should play around to see what your classifier does.

[–][deleted] 1 điểm 4 năm trước 

you can also use random forest to improve the modeling technique.

[–][đã xóa] 4 năm trước 


    [–]duckandcover 1 điểm 4 năm trước* 

    I'm on a phone so my main advice is to look at wiki but in a nutshell, given a classifier that outputs continuous values, take classifier thresholds for some sampling intervals (e.g. 200 evenly spaced points) between its min and max outputs (i.e. across all the test data) and compute the pD (TPR) and pFA (FPR) for each one. Choose the one that gives you the operating point you want and use that threshold at run time.

    Given the threshold sampling points are sorted, you can then go pair by pair and use the trapezoid rule to give a decent integration estimate; i.e. the Area Under the Curve.

    [–]ashwhat 2 điểm 4 năm trước 

    You can use the technique of oversampling to generate new datapoints for y = 1 within your dataset. This is better than removing majority (y=0) datapoints (undersampling), as you won't suffer from information loss!

    Check out Adasyn (paper: or Smote (google for paper).

    Also, you could try cost-based learning, which essentially differentiates the costs of mis-classifying the minority class.

    [–]duckandcover 1 điểm 4 năm trước 

    If the number of observations in your smaller class is sufficient you can get away with a large margin classifier ( eg svm, boosted trees) and a roc curve. Cross validation will give you a measure of over fit but if your data is truly unable to represent the population (in particular because whole implicit subclasses of data are missing) that's going to be a problem

    [–]ieeaaauuuuooooo 1 điểm 4 năm trước 

    maybe removing some instances of the big class in the training set, in order to get a balanced set

    [–][deleted] 1 điểm 4 năm trước 

    I'm 100% new to machine learning and am a bit confused about the question but Recursive partitioning and regression trees seem to be a good idea here.

    [–]SCombinator 1 điểm 4 năm trước 

    I'm guessing your classifier just chooses y=0 as its model.

    [–][deleted] 1 điểm 4 năm trước 

    ANG discusses skewed datasets in the coursera class. The F1 score might be a better judge of how well your algorithm performs than merely using the classification error. I think u/trimeta has summed it up pretty well. The AUC too is very similar to the F1 score and incorporates similar elements.

    I would probably use a cost-based learning method while weighting the costs separately, on a trial and error basis till I'm happy with the F1 Score.