Collaborative Filtering

 

Xiaoyuan Su's Publications on Collaborative Filtering (download most papers here)

 

PhD dissertation topic: Collaborative Filtering Using Machine Learning and Statistical Techniques (defended on Nov 7th, 2008, download slides)
 

Papers:

1. A Survey of Collaborative Filtering Techniques (Advances in Artificial Intelligence, 2009, download paper)
2~3. Collaborative Filtering Using Bayesian Networks (ICTAI 2006, IJAIT 2008)
4~6. Imputation-Boosted Collaborative Filters  (ACM SAC 2008, FLAIRS 2008, A Journal Paper in Submission)
7. Hybrid Collaborative Filtering Algorithms with Mixture Expertise (Web Intelligence 2007)
8. Imputed Neighborhood Based Collaborative Filtering (Web Intelligence 2008)

 

 

OVERVIEW OF COLLABORATIVE FILTERING TECHNIQUES

 

CF categories

Representative

techniques

Main advantages

Main shortcomings

Memory-based CF

* Neighbor-based CF (item-based /user-based CF algorithms with Pearson/ vector cosine correlation)

* Item-based / user-based top-N recommendations

* easy implementation

* new data can be added easily and incrementally

* need not consider the content of the items being recommended

* scale well with co-rated items

* are dependent on human ratings

* performance decrease when data are sparse

* can not recommend for new users and items

* have limited scalability for large datasets

Model-based CF

* Bayesian belief nets CF

* clustering CF

* MDP-based CF

* latent semantic CF

* sparse factor analysis

* CF using dimensionality reduction techniques, e.g., SVD, PCA

* better address the sparsity, scalability and other problems

* improve prediction performance

* give an intuitive rationale for recommendations

* expensive model-building

* have trade-off between prediction performance and scalability

* lose useful information for dimensionality reduction techniques

Hybrid recommenders

* content-based CF recommender, e.g., Fab

*content-boosted CF

* hybrid CF combining memory-based and model-based CF algorithms, e.g. Personality Diagnosis

* overcome limitations of CF and content-based or other recommenders

* improve prediction performance

* overcome CF problems such as sparsity and gray sheep

* have increased complexity and expense for implementation

* need external information that usually not available

 

 

Collaborative Filtering for Multi-class Data Using Belief Nets Algorithms (18th IEEE International Conference on Tools with Artificial Intelligence, Washington D.C., USA, Nov 13-15,  2006, slides). X. Su, T.M. Khoshgoftaar

Abstract: As one of the most successful recommender systems, collaborative filtering (CF) algorithms can deal with high sparsity and high requirement of scalability amongst other challenges. Bayesian belief nets (BNs), one of the most frequently used classifiers, can be used for CF tasks. Previous works of applying BNs to CF tasks were mainly focused on binary-class data, and used simple or basic Bayesian classifiers. In this work, we apply advanced BNs models to CF tasks instead of simple ones, and work on real-world multi-class CF data instead of synthetic binary-class data. Empirical results show that with their ability to deal with incomplete data, extended logistic regression on naïve Bayes and tree augmented naïve Bayes (NB-ELR and TAN-ELR) models consistently perform better than the state-of-the-art Pearson correlation-based CF algorithm. In addition, the ELR-optimized BNs CF models are robust in terms of the ability to make predictions, while the robustness of the Pearson correlation-based CF algorithm degrades as the sparseness of the data increases.

Imputation-Boosted Collaborative Filtering Using Machine Learning Classifiers. (ACM Symposium on Applied Computing, March 2008). X. Su, T. Khoshgoftaar, X. Zhu, R. Greiner

Abstract: As data sparsity remains a significant challenge for collaborative filtering (CF), we conjecture that predicted ratings based on imputed data may be more accurate than those based on the originally very sparse rating data. In this paper, we propose a framework of imputation-boosted collaborative filtering (IBCF), which first uses an imputation technique, or perhaps machine learned classifier, to fill-in the sparse user-item rating matrix, then runs a traditional Pearson correlation-based CF algorithm on this matrix to predict a novel rating. Empirical results show that IBCF using machine learning classifiers can improve predictive accuracy of CF tasks. In particular, IBCF using a classifier capable of dealing well with missing data, such as naïve Bayes, can outperform the content-boosted CF (a representative hybrid CF algorithm) and IBCF using PMM (predictive mean matching, a state-of-the-art imputation technique), without using external content information.
 

 

 

 

A Mixture Imputation-Boosted Collaborative Filter  (Florida AI Research Symposium (FLAIRS), May 2008). X. Su, T. Khoshgoftaar, R. Greiner.

Abstract: Recommendation systems suggest products to users. Collaborative filtering (CF) systems, which base those recommendations on a database of previous ratings by various users and products, have been proven to be very effective. Since this database is typically very sparse, we consider first imputing the missing values, then making predictions based on that completed dataset. In this paper, we apply several standard imputation techniques within the framework of imputation-boosted collaborative filtering (IBCF). Each technique passes that imputed rating data to a traditional Pearson correlation-based CF algorithm, which uses that information to produce CF predictions. We also propose a novel mixture IBCF algorithm, IBCF-NBM, that uses either naive Bayes or mean imputation, depending on the sparsity of the original CF rating dataset. Our empirical results show that IBCFs are fairly accurate on CF tasks, and that IBCF-NBM significantly outperforms a representative hybrid CF system, content-boosted CF algorithm, as well as other IBCFs that use standard imputation techniques.


 

X. Su, R. Greiner, T.M. Khoshgoftaar, and X. Zhu

Hybrid Collaborative Filtering Algorithms with Mixture Expertise (slides)

IEEE/WIC/ACM Web Intelligence '2007

send your comments to: xsu at fau dot edu 

Experimental Results

missing

rate %

Pearson CF

Model-based CF

content predictor

 

 

CBCF

JMCF

SMCF

63.75

0.6901

0.7592

0.8055

0.6974

0.6818

0.6820

68.24

0.6976

0.7670

0.8203

0.7033

0.6885

0.6883

76.50

0.7108

0.7800

0.8178

0.7091

0.6981

0.6932

76.74

0.7325

0.8084

0.8359

0.7244

0.7221

0.7088

85.30

0.7723

0.8296

0.8479

0.7680

0.7433

0.7155

87.55

0.7895

0.8458

0.8664

0.7822

0.7538

0.7303

91.54

0.8166

0.8657

0.8952

0.7910

0.7797

0.7416

94.64

0.8937

0.8921

0.9178

0.8705

0.8135

0.7785

95.59

0.8858

0.8437

0.8669

0.8014

0.7836

0.7335

96.14

0.9803

0.9450

1.0174

0.8818

0.8786

0.8200

overall

0.7407

0.8000

0.8378

0.7344

0.7188

0.7062

Table 1: MAE scores of the CF algorithms on the 10 (943 * 60) datasets

  

Over all 10 datasets...

statistical significance in terms of p-value of t-test:

    JMCF < content-boosted CF (p<0.027)

    SMCF < content-boosted CF (p<0.0026)

    content-boosted CF  <  Pearson CF (p<0.04)

    JMCF < Pearson CF (p<0.003)

    SMCF < Pearson CF (p<0.0023)

 improvement over Pearson CF in terms of average MAE:

    content-boosted CF < Pearson CF (3.01%)

    JMCF < Pearson CF (4.26%)

    SMCF  < Pearson CF (8.50%)

 

 We evaluate the performance of the CF algorithms on the datasets by their average MAE values, then determine statistical significance using p-values of  1-sided paired t-test, which is commonly practiced.  

 

As dense datasets have more observed values and have better CF performance, they will receive larger weights over sparse ones when calculating the overall MAE values and so this would produce a better MAE score than the one reported in the paper.

 

Related Work

 

The survey paper 

    "Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions" (Adomavicius, G. and Tuzhilin, A. (TKDE 17(6) pp 734-749, 2005), 

     basically cited  4 papers that compare hybrid CF with other recommender systems:

 

[1]M. Balabanovic and Y. Shoham, “Fab: Content-Based, Collaborative Recommendation,” Comm. ACM, vol. 40, no. 3, pp. 66-72, 1997.

---- They worked on the data from 11 users on more than 400 items. They used statistical measure and used NDPM as the metric.


[2] P. Melville, R.J. Mooney, and R. Nagarajan, “Content-Boosted Collaborative Filtering for Improved Recommendations,” Proc. 18th Nat’l Conf. Artificial Intelligence, 2002.

---- They worked on the selected EachMovie data that has 1000 users, and evaluate 25% of the ratings on test sets.  The size of the test sets are 500 users and 55 items. (Each of our datasets has more users and items than theirs.)

---- They computed the MAE on the test set for each user, and then averaged over the set of test users. They also used the statistical significance in terms of paired t-tests


[3]M. Pazzani, A Framework for Collaborative, Content-Based, and Demographic Filtering, Artificial Intelligence Rev., pp. 393-408, Dec. 1999.

---- They worked on a data collected from a group of 44 users on 58 webpages, and used average precision as evaluation metric.


[4]I. Soboroff and C. Nicholas, “Combining Content and Collaboration in Text Filtering,” Proc. Int’l Joint Conf. Artificial Intelligence Workshop: Machine Learning for Information Filtering, Aug. 1999.

---- They worked on a test collection of 1400 documents and 225 scored queries for the text filtering, and used recall and average precision as evaluation metrics.

 

 

Data Preparation

 The CF results working on data from real-world experiments are more desirable than from artificial data. However, well-known collaborative filtering databases from real-world experiments are too big for most standard CF algorithms. For example, Netflix prize data has 17,770 movies, 480,189 users and 100,480,507 ratings and MovieLens data has 100,000 ratings for 1682 movies by 943 users.

 

A natural solution is to divide and conquer. Working on the MovieLens data, we first rank the movies according to the number of users who have rated them. Thus we have

 

user#(m_1) ≥  user#(m_2) ≥ … ≥ user#(m_600)

 

where user#(m_i) is the number of users that have rated the movie m_i.

 

We then divide the original MovieLens data into 10 subsets, each of which has all 943 users and 60 movies (please note we have not used up all of the moveis)

 

Data_1 has all 943 users, and movies {m_1, ..., m_60 }

 

Data_2 has all  943 users, and movies {m_61, ..., m_120 }

 

……

 

Data_10 has all 943 users, and movies {m_541, ..., m_600 }

 

 

A Case Study

To more intuitively illustrate our hybrid CF algorithms with a mixture of expertise, we provide this simple case study. The data and related results are taken from the first dataset and its CF results we used in our previous experiments; here the content dataset is the related user information of the pure rating matrix. The content dataset include the attributes of age, sex, occupation and postcode, and we take the first two digits of the postcodes.

                

24

M

technician

85

53

F

Other

94

23

M

Writer

32

24

M

technician

43

33

F

Other

15

42

M

executive

98

57

M

administrator

91

36

M

administrator

05

29

M

student

01

53

M

Lawyer

09

 
 
 
 
 
 
   
 
 
  
 
  
  
  
 
  
  

5

5

5

 

 

 

5

 

4

5

3

5

 

1

4

3

4

4

 

 

2

 

4

2

 

2

 

2

 

 

5

 

 

5

 

4

 

5

 

 

 

5

5

 

 

 

4

 

4

5

2

5

 

2

2

 

4

 

 

4

4

5

3

1

4

4

 

4

5

5

5

 

4

3

 

 

 

 

 

5

 

 

 

4

5

 

 

 

 

 

 

5

 

3

4

 

4

 

 

4

  
  
  
 
  
  
  
 
  
  
  
 
  
  
  
 

Given the above content information + rating matrix dataset. The steps to get predictions for the observed values using the sequential mixture CF (SMCF) algorithm are:

 1) Use the following training data and testing data for the first column of rating matrix for the content predictor. Make predictions for the testing data using TAN-ELR.

  

24

M

technician

85

5

53

F

other

94

3

23

M

writer

32

2

24

M

technician

43

5

42

M

executive

98

2

57

M

administrator

91

4

36

M

administrator

05

5

 

33

F

other

15

0

29

M

student

01

0

53

M

lawyer

09

0

 

2) Repeat the above step for each column in turn, get predictions for all of the missing values in the original rating matrix.

3) Replace missing values with their respective predicted values to form a pseudo rating matrix.

5

5

5

4

3

4

5

5

4

5

3

5

5

1

4

3

4

4

4

4

2

4

4

2

3

2

4

2

3

5

5

4

4

5

3

4

5

5

4

4

5

5

5

3

4

4

4

4

4

5

2

5

4

2

2

4

4

4

4

4

4

5

3

1

4

4

4

4

5

5

5

5

4

3

5

4

4

4

4

5

4

5

4

4

5

4

4

4

3

4

3

5

4

3

4

5

4

5

3

4

 4) Use the Pearson correlation-based CF algorithm and all-but-one scenario. Predict a rating for each value in the pseudo rating matrix

4

5

4

4

4

4

4

4

4

5

4

4

4

3

4

4

4

4

4

4

4

4

4

3

4

3

4

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

3

3

3

3

3

3

4

4

5

4

4

4

4

4

4

4

4

4

5

4

4

4

4

4

4

4

5

4

5

4

4

4

4

4

4

4

5

4

4

4

3

4

4

4

4

4

4

 
5) Calculate the Mean Average Error for this SMCF algorithm, only on the originally observed values for the evaluation purpose.

 

The steps to get predictions using the joint mixture CF (JMCF) algorithm:

1) Use the following data for the TAN-ELR classifier, with 5-fold cross validation (the last column is the class values of the data, observed values of the first column of the rating matrix, the users with missing values on that column are removed). Get predictions with TAN-ELR.

 

5

5

 

 

 

5

 

4

5

5

5

 

1

4

3

4

4

 

 

3

 

4

2

 

2

 

2

 

 

2

 

 

5

 

4

 

5

 

 

5

5

 

2

2

 

4

 

 

4

2

5

3

1

4

4

 

4

5

5

4

 

4

3

 

 

 

 

 

5

5

 

2) Repeat the above step for each column in turn, getting predictions for all the originally observed values (for the purpose of evaluation, we only make predictions for observed values, for their true values are there) as the TAN-ELR CF predictions.

4

5

5

 

 

 

5

 

4

5

4

4

 

3

4

4

4

4

 

 

3

 

4

2

 

4

 

4

 

 

4

 

 

3

 

5

 

3

 

 

 

3

5

 

 

 

5

 

4

5

3

3

 

3

3

 

3

 

 

4

5

5

4

3

4

4

 

4

4

5

4

 

5

3

 

 

 

 

 

5

 

 

 

4

4

 

 

 

 

 

 

4

 

3

5

 

4

 

 

5

 

 3) Use similar strategy for the content-based predictor, with the difference of using the user information as the attribute values. Work on the following generated data using TAN-ELR, with 5-cross validation to get the predictions.

24

M

technician

85

5

53

F

other

94

3

23

M

writer

32

2

24

M

technician

43

5

42

M

executive

98

2

57

M

administrator

91

4

36

M

administrator

05

5

4) Repeat the above step for each column in the matrix in turn, and then make predictions for all observed values with this content-based predictor.

4

5

5

 

 

 

5

 

5

5

4

5

 

3

5

4

4

3

 

 

3

 

2

4

 

4

 

2

 

 

5

 

 

5

 

4

 

5

 

 

 

5

4

 

 

 

5

 

4

5

3

5

 

2

4

 

4

 

 

5

5

5

4

4

4

4

 

5

5

5

5

 

4

4

 

 

 

 

 

5

 

 

 

4

5

 

 

 

 

 

 

5

 

3

4

 

5

 

 

5

 

5) Make predictions directly from the Pearson correlation-based CF algorithm, just for the observed values.

4

4

5

 

 

 

4

 

4

5

4

4

 

4

4

3

4

4

 

 

3

 

3

3

 

3

 

3

 

 

4

 

 

5

 

5

 

5

 

 

 

4

4

 

 

 

4

 

4

4

3

3

 

3

3

 

4

 

 

4

4

4

4

4

4

4

 

4

4

4

4

 

5

4

 

 

 

 

 

5

 

 

 

5

4

 

 

 

 

 

 

5

 

4

4

 

4

 

 

4

6) Vote the final predictions using a joint mixture voter among the predictions from the above three parties, (weighted average voter weights: 3 for TAN-ELR CF, 2 for the content-based predictor and Pearson CF each), calculate the MAE for this algorithm.

4

5

5

 

 

 

5

 

4

5

4

4

 

3

4

4

4

4

 

 

3

 

3

3

 

4

 

3

 

 

4

 

 

4

 

5

 

4

 

 

 

4

4

 

 

 

5

 

4

5

3

4

 

3

3

 

4

 

 

4

5

5

4

4

4

4

 

4

4

5

4

 

5

4

 

 

 

 

 

5

 

 

 

4

4

 

 

 

 

 

 

5

 

3

4

 

4

 

 

5


 

Imputed Neighborhood Based Collaborative Filtering  (IEEE/WIC/ACM International Conference on Web Intelligence, Sydney, Australia, Dec 2008).

X. Su, T. Khoshgoftaar, R. Greiner.

Abstract: Collaborative filtering (CF) is one of the most effective types of recommender systems. As data sparsity remains a significant challenge for CF, we observe that making predictions based on imputed data often improves performance on very sparse rating data. In this paper, we propose two imputed neighborhood based collaborative filtering (INCF) algorithms: imputed nearest neighborhood CF (INN-CF) and imputed densest neighborhood CF (IDN-CF), which first imputes the user rating data using an imputation technique, before using a traditional Pearson correlation-based CF algorithm on the corresponding imputed data of the most similar neighbors or the densest neighbors to make CF predictions for a specific user. We investigated of using an extension of Bayesian multiple imputation (eBMI) and the mean imputation (MEI) in these INCF algorithms, and compared them with the commonly-used neighborhood based CF, Pearson correlation-based CF, as well as a densest neighborhood based CF. Our empirical results show that IDN-CF using eBMI significantly outperforms its rivals and takes less time to make its best predictions.

 

ć
Xiaoyuan Su,
Oct 16, 2010, 8:21 PM
Comments