Xiaoyuan Su's Publications on Collaborative Filtering (download most papers here)
PhD dissertation topic: Collaborative Filtering Using Machine Learning and Statistical Techniques (defended on Nov 7th, 2008, download slides) Papers: 1. A Survey of Collaborative Filtering Techniques (Advances in Artificial Intelligence, 2009, download paper)
OVERVIEW OF COLLABORATIVE FILTERING TECHNIQUES
Collaborative Filtering for Multiclass Data Using Belief Nets Algorithms (18th IEEE International Conference on Tools with Artificial Intelligence, Washington D.C., USA, Nov 1315, 2006, slides). X. Su, T.M. Khoshgoftaar ImputationBoosted Collaborative Filtering Using Machine Learning Classifiers. (ACM Symposium on Applied Computing, March 2008). X. Su, T. Khoshgoftaar, X. Zhu, R. Greiner Abstract: As data sparsity remains a significant challenge for collaborative filtering (CF), we conjecture that predicted ratings based on imputed data may be more accurate than those based on the originally very sparse rating data. In this paper, we propose a framework of imputationboosted collaborative filtering (IBCF), which first uses an imputation technique, or perhaps machine learned classifier, to fillin the sparse useritem rating matrix, then runs a traditional Pearson correlationbased CF algorithm on this matrix to predict a novel rating. Empirical results show that IBCF using machine learning classifiers can improve predictive accuracy of CF tasks. In particular, IBCF using a classifier capable of dealing well with missing data, such as naïve Bayes, can outperform the contentboosted CF (a representative hybrid CF algorithm) and IBCF using PMM (predictive mean matching, a stateoftheart imputation technique), without using external content information.
A Mixture ImputationBoosted Collaborative Filter (Florida AI Research Symposium (FLAIRS), May 2008). X. Su, T. Khoshgoftaar, R. Greiner. Abstract: Recommendation systems suggest products to users. Collaborative filtering (CF) systems, which base those recommendations on a database of previous ratings by various users and products, have been proven to be very effective. Since this database is typically very sparse, we consider first imputing the missing values, then making predictions based on that completed dataset. In this paper, we apply several standard imputation techniques within the framework of imputationboosted collaborative filtering (IBCF). Each technique passes that imputed rating data to a traditional Pearson correlationbased CF algorithm, which uses that information to produce CF predictions. We also propose a novel mixture IBCF algorithm, IBCFNBM, that uses either naive Bayes or mean imputation, depending on the sparsity of the original CF rating dataset. Our empirical results show that IBCFs are fairly accurate on CF tasks, and that IBCFNBM significantly outperforms a representative hybrid CF system, contentboosted CF algorithm, as well as other IBCFs that use standard imputation techniques.
X. Su, R. Greiner, T.M. Khoshgoftaar, and X. Zhu Hybrid Collaborative Filtering Algorithms with Mixture Expertise (slides)
IEEE/WIC/ACM Web Intelligence '2007 send your comments to: xsu at fau dot eduExperimental Results
Table 1: MAE scores of the CF algorithms on the 10 (943 * 60) datasets
Over all 10 datasets... statistical significance in terms of pvalue of ttest: JMCF < contentboosted CF (p<0.027) SMCF < contentboosted CF (p<0.0026) contentboosted CF < Pearson CF (p<0.04) JMCF < Pearson CF (p<0.003) SMCF < Pearson CF (p<0.0023) improvement over Pearson CF in terms of average MAE: contentboosted CF < Pearson CF (3.01%) JMCF < Pearson CF (4.26%) SMCF < Pearson CF (8.50%)
We evaluate the performance of the CF algorithms on the datasets by their average MAE values, then determine statistical significance using pvalues of 1sided paired ttest, which is commonly practiced.
As dense datasets have more observed values and have better CF performance, they will receive larger weights over sparse ones when calculating the overall MAE values and so this would produce a better MAE score than the one reported in the paper.
Related Work
The survey paper"Toward the next generation of recommender systems: A survey of the stateoftheart and possible extensions" (Adomavicius, G. and Tuzhilin, A. (TKDE 17(6) pp 734749, 2005),basically cited 4 papers that compare hybrid CF with other recommender systems:[1]M. Balabanovic and Y. Shoham, “Fab: ContentBased, Collaborative Recommendation,” Comm. ACM, vol. 40, no. 3, pp. 6672, 1997. They worked on the data from 11 users on more than 400 items. They used statistical measure and used NDPM as the metric.

24 
M 
technician 
85 
53 
F 
Other 
94 
23 
M 
Writer 
32 
24 
M 
technician 
43 
33 
F 
Other 
15 
42 
M 
executive 
98 
57 
M 
administrator 
91 
36 
M 
administrator 
05 
29 
M 
student 
01 
53 
M 
Lawyer 
09 
5 
5 
5 



5 

4 
5 
3 
5 

1 
4 
3 
4 
4 


2 

4 
2 

2 

2 


5 


5 

4 

5 



5 
5 



4 

4 
5 
2 
5 

2 
2 

4 


4 
4 
5 
3 
1 
4 
4 

4 
5 
5 
5 

4 
3 





5 



4 
5 






5 

3 
4 

4 


4 
Given the above content information + rating matrix dataset. The steps to get predictions for the observed values using the sequential mixture CF (SMCF) algorithm are:
1) Use the following training data and testing data for the first column of rating matrix for the content predictor. Make predictions for the testing data using TANELR.
24 
M 
technician 
85 
5 
53 
F 
other 
94 
3 
23 
M 
writer 
32 
2 
24 
M 
technician 
43 
5 
42 
M 
executive 
98 
2 
57 
M 
administrator 
91 
4 
36 
M 
administrator 
05 
5 
33 
F 
other 
15 
0 
29 
M 
student 
01 
0 
53 
M 
lawyer 
09 
0 
2) Repeat the above step for each column in turn, get predictions for all of the missing values in the original rating matrix.
3) Replace missing values with their respective predicted values to form a pseudo rating matrix.
5 
5 
5 
4 
3 
4 
5 
5 
4 
5 
3 
5 
5 
1 
4 
3 
4 
4 
4 
4 
2 
4 
4 
2 
3 
2 
4 
2 
3 
5 
5 
4 
4 
5 
3 
4 
5 
5 
4 
4 
5 
5 
5 
3 
4 
4 
4 
4 
4 
5 
2 
5 
4 
2 
2 
4 
4 
4 
4 
4 
4 
5 
3 
1 
4 
4 
4 
4 
5 
5 
5 
5 
4 
3 
5 
4 
4 
4 
4 
5 
4 
5 
4 
4 
5 
4 
4 
4 
3 
4 
3 
5 
4 
3 
4 
5 
4 
5 
3 
4 
4) Use the Pearson correlationbased CF algorithm and allbutone scenario. Predict a rating for each value in the pseudo rating matrix
4 
5 
4 
4 
4 
4 
4 
4 
4 
5 
4 
4 
4 
3 
4 
4 
4 
4 
4 
4 
4 
4 
4 
3 
4 
3 
4 
3 
3 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
4 
3 
3 
3 
3 
3 
3 
4 
4 
5 
4 
4 
4 
4 
4 
4 
4 
4 
4 
5 
4 
4 
4 
4 
4 
4 
4 
5 
4 
5 
4 
4 
4 
4 
4 
4 
4 
5 
4 
4 
4 
3 
4 
4 
4 
4 
4 
4 
The steps to get predictions using the joint mixture CF (JMCF) algorithm:
5 
5 



5 

4 
5 
5 
5 

1 
4 
3 
4 
4 


3 

4 
2 

2 

2 


2 


5 

4 

5 


5 
5 

2 
2 

4 


4 
2 
5 
3 
1 
4 
4 

4 
5 
5 
4 

4 
3 





5 
5 
2) Repeat the above step for each column in turn, getting predictions for all the originally observed values (for the purpose of evaluation, we only make predictions for observed values, for their true values are there) as the TANELR CF predictions.
4 
5 
5 



5 

4 
5 
4 
4 

3 
4 
4 
4 
4 


3 

4 
2 

4 

4 


4 


3 

5 

3 



3 
5 



5 

4 
5 
3 
3 

3 
3 

3 


4 
5 
5 
4 
3 
4 
4 

4 
4 
5 
4 

5 
3 





5 



4 
4 






4 

3 
5 

4 


5 
3) Use similar strategy for the contentbased predictor, with the difference of using the user information as the attribute values. Work on the following generated data using TANELR, with 5cross validation to get the predictions.
24 
M 
technician 
85 
5 
53 
F 
other 
94 
3 
23 
M 
writer 
32 
2 
24 
M 
technician 
43 
5 
42 
M 
executive 
98 
2 
57 
M 
administrator 
91 
4 
36 
M 
administrator 
05 
5 
4) Repeat the above step for each column in the matrix in turn, and then make predictions for all observed values with this contentbased predictor.
4 
5 
5 



5 

5 
5 
4 
5 

3 
5 
4 
4 
3 


3 

2 
4 

4 

2 


5 


5 

4 

5 



5 
4 



5 

4 
5 
3 
5 

2 
4 

4 


5 
5 
5 
4 
4 
4 
4 

5 
5 
5 
5 

4 
4 





5 



4 
5 






5 

3 
4 

5 


5 
5) Make predictions directly from the Pearson correlationbased CF algorithm, just for the observed values.
4 
4 
5 



4 

4 
5 
4 
4 

4 
4 
3 
4 
4 


3 

3 
3 

3 

3 


4 


5 

5 

5 



4 
4 



4 

4 
4 
3 
3 

3 
3 

4 


4 
4 
4 
4 
4 
4 
4 

4 
4 
4 
4 

5 
4 





5 



5 
4 






5 

4 
4 

4 


4 
6) Vote the final predictions using a joint mixture voter among the predictions from the above three parties, (weighted average voter weights: 3 for TANELR CF, 2 for the contentbased predictor and Pearson CF each), calculate the MAE for this algorithm.
4 
5 
5 



5 

4 
5 
4 
4 

3 
4 
4 
4 
4 


3 

3 
3 

4 

3 


4 


4 

5 

4 



4 
4 



5 

4 
5 
3 
4 

3 
3 

4 


4 
5 
5 
4 
4 
4 
4 

4 
4 
5 
4 

5 
4 





5 



4 
4 






5 

3 
4 

4 


5 
Imputed Neighborhood Based Collaborative Filtering (IEEE/WIC/ACM International Conference on Web Intelligence, Sydney, Australia, Dec 2008).
X. Su, T. Khoshgoftaar, R. Greiner.
Abstract: Collaborative filtering (CF) is one of the most effective types of recommender systems. As data sparsity remains a significant challenge for CF, we observe that making predictions based on imputed data often improves performance on very sparse rating data. In this paper, we propose two imputed neighborhood based collaborative filtering (INCF) algorithms: imputed nearest neighborhood CF (INNCF) and imputed densest neighborhood CF (IDNCF), which first imputes the user rating data using an imputation technique, before using a traditional Pearson correlationbased CF algorithm on the corresponding imputed data of the most similar neighbors or the densest neighbors to make CF predictions for a specific user. We investigated of using an extension of Bayesian multiple imputation (eBMI) and the mean imputation (MEI) in these INCF algorithms, and compared them with the commonlyused neighborhood based CF, Pearson correlationbased CF, as well as a densest neighborhood based CF. Our empirical results show that IDNCF using eBMI significantly outperforms its rivals and takes less time to make its best predictions.