Data Mining for Incomplete Data

Publications on Data Mining for Incomplete Data   (some of my publications can be found here)

(PAKDD'11)  Using Classifier-Based Nominal Imputation to Improve Machine Learning
X. Su, R. Greiner, T.M. Khoshgoftaar, A. Napolitano, Using Classifier-Based Nominal Imputation to Improve Machine Learning, The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Shenzhen, China, May 24-27, 2011.
 
Many learning algorithms perform poorly when the training data are incomplete. One standard approach involves first imputing the missing values, then giving the completed data to the learning algorithm. However, this is especially problematic when the features are nominal. This work presents “classifier-based nominal imputation” (CNI), an easy-to-implement and effective nominal imputation technique that views nominal imputation as classification: it learns a classifier for each feature (that maps the other features of an instance to the predicted value of that feature), then uses that classifier to predict the missing values of that feature. Our empirical results show that learners that preprocess their incomplete training data using CNI using support vector machine or decision tree learners have significantly higher predictive accuracy than learners that (1) do not use preprocessing, (2) use baseline imputation techniques, or (3) use this CNI preprocessor with other classification algorithms. This improvement is especially apparent when the base learner is instance-based. CNI is also found helpful for other base learners, such as naïve Bayes and decision tree, on incomplete nominal data.

____________________________ 

VipBoost: a More Accurate Boosting Algorithm, the 22nd Conference of Florida Artificial Intelligence Research Society (FLAIRS’09), Sanibel Island, FL, USA, May 2009 (AAAI Press).
(X. Su, T. M. Khoshgoftaar, and R. Greiner)

Boosting is a well-known method for improving the accuracy of many learning algorithms. In this paper, we propose a novel boosting algorithm, VipBoost (voting on boosting classifications from imputed learning sets), which first generates multiple incomplete datasets from the original dataset by randomly removing a small percentage of observed attribute values, then uses an imputer to fill in the missing values. It then applies AdaBoost (using some base learner) to produce classifiers trained on each of the imputed learning sets, to produce multiple classifiers. The subsequent prediction on a new test case is the most frequent classification from these classifiers. Our empirical results show that VipBoost produces very effective classifiers that significantly improve accuracy for unstable base learners and some stable learners, especially when the initial dataset is incomplete.

____________________________ 

Making an accurate classifier ensemble by voting on classifications from imputed learning sets (International Journal on Information and Decision Sciences (IJIDS), 2009)
(X. Su, T. M. Khoshgoftaar, and R. Greiner)

Abstract: Ensemble methods often produce effective classifiers by learning a set of base classifiers from a diverse collection of the training sets. In this paper, we present a system, voting on classifications from imputed learning sets (VCI), that produces those diverse training sets by randomly removing a small percentage of attribute values from the original training set, and then using an imputation technique to replace those values. VCI then runs a learning algorithm on each of these imputed training sets to produce a set of base classifiers. Later, the final prediction on a novel instance is the plurality classification produced by these classifiers. We investigate various imputation techniques here, including the state-of-the-art Bayesian multiple imputation (BMI) and expectation maximisation (EM). Our empirical results show that VCI predictors, especially those using BMI and EM as imputers, significantly improve the classification accuracy over conventional classifiers, especially on datasets that are originally incomplete; moreover VCI significantly outperforms bagging predictors and imputation-helped machine learners.

____________________________ 

Voting on Classifications from Imputed Learning Sets (IEEE International Conference on Information Reuse and Integration (IRI'08), Las Vegas, NV, July 2008)
(X. Su, T. M. Khoshgoftaar, and X. Zhu)

Abstract: We propose VCI (voting on classifications from imputed learning sets) predictors, which generate multiple incomplete learning sets from a complete dataset by randomly deleting values with a small MCAR (missing completely at random) missing ratio, and then apply an imputation technique to fill in the missing values before giving the imputed data to a machine learner. The final prediction of a class is the result of voting on the classifications from the imputed learning sets. Our empirical results show that VCI predictors significantly improve the classification performance on complete data, and perform better than Bagging predictors on binary class data.

____________________________ 

Using Imputation Techniques to Help Learn Accurate Classifiers (IEEE International Conference on Tools with Artificial Intelligence (ICTAI'08), Dayton, Ohio, USA, Nov. 2008)
(X. Su, T. M. Khoshgoftaar, and R. Greiner)

Abstract: It is difficult to learn good classifiers when training data is missing attribute values. Conventional techniques for dealing with such omissions, such as mean imputation, generally do not significantly improve the performance of the resulting classifier. We proposed imputation-helped classifiers, which use accurate imputation techniques, such as Bayesian multiple imputation (BMI), predictive mean matching (PMM), and Expectation Maximization (EM), as preprocessors for conventional machine learning algorithms. Our empirical results show that EM-helped and BMI-helped classifiers work effectively when the data is “missing completely at random”, generally improving predictive performance over most of the original machine learned classifiers we investigated.
 

____________________________ 

VoB Predictors: Voting on Bagging Classifications (The 19th IEEE International Conference on Pattern Recognition (ICPR'08), Tampa, FL, Dec. 2008)
(X. Su, T. M. Khoshgoftaar, and X. Zhu)

Abstract: Bagging predictors relies on bootstrap sampling to maintain a set of diverse base classifiers constituting the classifier ensemble, where the diversity among base classifiers is ensured through a random sampling (with replacement) process on the original data. In this paper, we propose a random missing value corruption based bootstrap sampling process, where the objective is to enhance the diversity of the learning sets through random missing value injection, such that base classifiers can form an accurate classifier ensemble. Our VoB (voting on bagging classifications) predictors first generate multiple incomplete datasets from a base complete dataset by randomly injecting missing values with a small missing ratio, then apply a bagging predictor trained on each of the incomplete dataset to give classifications. The final prediction of a class is the result of voting on the classifications. Our empirical results show that VoB predictors significantly improve the classification performance on complete data, and perform better than bagging predictors.
 

Comments