Title
Statistical Machine Learning: Big, Multi-source and Sparse Data with Complex Relations and Dynamics.
Presenters
Trong Dinh Thac Do, Longbing Cao and Jinjin Guo
Goal of Tutorial
The goal of this tutorial is to enable both academic and practical audiences with a comprehensive understanding and relevant techniques of applying advanced statistical methods to complex and practical machine learning problems with large, sparse and multi-source data while learning the complex relations and dynamics in the data. In this tutorial, we will present a systematic review and applications of the most recent statistical machine learning techniques for real-life applications such as collaborative filtering, network analysis, text analysis, and count data analysis. We also open a huge opportunity for other applications with large, sparse and dynamic data. This is expected to attract a great interest to a vast proportion of AAAI audiences such as engineers/researchers from machine learning, data mining, recommender systems, network analysis, document analysis, and natural language processing.
Stepping out from this tutorial, the audiences will have their preliminary knowledge of statistic learning methods as well as machine learning ones. They will know how to apply statistics and probability theory (i.e., Bayesian graphical models, Bayesian inference, and parametrics/nonparametrics) to machine learning models (e.g., collaborative filtering, network analysis, and document analysis). Further, they also find practical approaches to build sophisticated methods for handling real-life large, sparse and multi-source datasets with complex relations on several applications.
Tutorial Outline
With the explosion of data on the Internet, social networks, finance, and e-commerce websites, modeling large and sparse datasets while exploring the complex relations and dynamics inside the data is highly in demand yet challenging. However, traditional methods face problem in handling these real-life datasets because of the intensive mathematical computation required. Although several statistical methods are proposed to handle a large amount of data, they are still inefficient for sparse data. It is because they perform computation on the whole datasets. However, in many real-life applications, a large amount of data are missing (i.e., sparse data). For example, in recommender systems, e.g., the Netflix data has 98.8\% of the matrix entries missing. In this tutorial, we summarize various statistical methods that are effective and efficient in handling large and sparse datasets. On the other hand, combining observable data (e.g., users’ ratings on items in recommender systems, user friendship or item relations, and user/item metadata) and learning the complex relations within and between multiple sources and the dynamics of the data help deal with cold-start problems where we have no or limited preliminary knowledge about one specific element. Accordingly, we focus on introducing our series of designs for tackling these challenges on large, sparse, and multi-source data. These designs perform the computation only on non-missing data and combine with efficient Bayesian inference methods. We will show that in-depth knowledge of statistical methods for large, sparse and multi-source data (while taking into account the complex relations within and between multiple sources and the dynamics of the data) creates new opportunities, directions, and means for learning and analyzing complex and practical machine learning problems.
Prerequisite knowledge
Any audiences who may be interested in the combination of large and sparse datasets in varied domains, for example, social networks, internet log files analysis, e-commerce, and finance, will find it very helpful in attending this tutorial. Although we will introduce in brief the statistic techniques, people who are familiar with statistic and Bayesian theory would find it easier to understand the algorithms and case studies to be introduced in this tutorial.
Further, since this tutorial mainly focuses on data analytics and machine learning techniques, it would be more beneficial for people with basic knowledge of data mining and machine learning. It is also more accessible for people with the basis of several applications mentioned in this tutorial, such as collaborative filtering, network analysis, document analysis, and count data analysis.
Content
Slides
Will be further updated