Presenters:
Trong Dinh Thac Do and Longbing Cao
Abstract:
With the explosion of data on the internet, social networks, finance, and e-commerce websites, modeling large and sparse datasets is highly in demand yet challenging. However, traditional methods face problem in handling these real-life datasets because of the intensive mathematical computation required. Although several statistical methods are proposed to handle a large amount of data, they are still inefficient for sparse data. It is because they perform computation on the whole datasets. However, in many real-life applications, a large amount of data are missing (i.e., sparse data). For example, in recommender systems, e.g., the Netflix data has 98.8\% of the matrix entries missing. In this tutorial, we summarize various statistical methods that are effective and efficient in handling large and sparse datasets.
On the other hand, combining observable data; e.g., users’ ratings on items in recommender systems, user friendship or item relations, and user/item metadata; helps to deal with cold-start problems where we have no or limit preliminary knowledge about one specific element. Accordingly, we focus on introducing our series of design for tackling these challenges on large, sparse and multi-source data. These designs perform the computation only on non-missing data and combine with efficient Bayesian inference methods. We will show that in-depth knowledge of statistical methods for large, sparse and multi-source data creates new opportunities, directions, and means for, learning and analysis of complex and practical machine learning problems.
Slides: