AAAI 2020

Title

Statistical Machine Learning: Big, Multi-source and Sparse Data with Complex Relations and Dynamics.

Presenters

Trong Dinh Thac Do, Longbing Cao and Jinjin Guo

Goal of Tutorial

The goal of this tutorial is to enable both academic and practical audiences with a comprehensive understanding and relevant techniques of applying advanced statistical methods to complex and practical machine learning problems with large, sparse and multi-source data while learning the complex relations and dynamics in the data. In this tutorial, we will present a systematic review and applications of the most recent statistical machine learning techniques for real-life applications such as collaborative filtering, network analysis, text analysis, and count data analysis. We also open a huge opportunity for other applications with large, sparse and dynamic data. This is expected to attract a great interest to a vast proportion of AAAI audiences such as engineers/researchers from machine learning, data mining, recommender systems, network analysis, document analysis, and natural language processing.

Stepping out from this tutorial, the audiences will have their preliminary knowledge of statistic learning methods as well as machine learning ones. They will know how to apply statistics and probability theory (i.e., Bayesian graphical models, Bayesian inference, and parametrics/nonparametrics) to machine learning models (e.g., collaborative filtering, network analysis, and document analysis). Further, they also find practical approaches to build sophisticated methods for handling real-life large, sparse and multi-source datasets with complex relations on several applications.

Tutorial Outline

With the explosion of data on the Internet, social networks, finance, and e-commerce websites, modeling large and sparse datasets while exploring the complex relations and dynamics inside the data is highly in demand yet challenging. However, traditional methods face problem in handling these real-life datasets because of the intensive mathematical computation required. Although several statistical methods are proposed to handle a large amount of data, they are still inefficient for sparse data. It is because they perform computation on the whole datasets. However, in many real-life applications, a large amount of data are missing (i.e., sparse data). For example, in recommender systems, e.g., the Netflix data has 98.8\% of the matrix entries missing. In this tutorial, we summarize various statistical methods that are effective and efficient in handling large and sparse datasets. On the other hand, combining observable data (e.g., users’ ratings on items in recommender systems, user friendship or item relations, and user/item metadata) and learning the complex relations within and between multiple sources and the dynamics of the data help deal with cold-start problems where we have no or limited preliminary knowledge about one specific element. Accordingly, we focus on introducing our series of designs for tackling these challenges on large, sparse, and multi-source data. These designs perform the computation only on non-missing data and combine with efficient Bayesian inference methods. We will show that in-depth knowledge of statistical methods for large, sparse and multi-source data (while taking into account the complex relations within and between multiple sources and the dynamics of the data) creates new opportunities, directions, and means for learning and analyzing complex and practical machine learning problems.

Prerequisite knowledge

Any audiences who may be interested in the combination of large and sparse datasets in varied domains, for example, social networks, internet log files analysis, e-commerce, and finance, will find it very helpful in attending this tutorial. Although we will introduce in brief the statistic techniques, people who are familiar with statistic and Bayesian theory would find it easier to understand the algorithms and case studies to be introduced in this tutorial.

Further, since this tutorial mainly focuses on data analytics and machine learning techniques, it would be more beneficial for people with basic knowledge of data mining and machine learning. It is also more accessible for people with the basis of several applications mentioned in this tutorial, such as collaborative filtering, network analysis, document analysis, and count data analysis.

Content

Bayesian Probabilistic Models: preliminaries on Bayesian probabilistic models such as Bayesian networks and some basic distributions are given;
Large Scale Bayesian Inference: we preview some techniques for inference of Bayesian probabilistic models such as Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) and show how these techniques scale with large and sparse data;
Parametric versus Nonparametric: one of the most important techniques to reduce computational time and thus can be applied for large datasets is Bayesian nonparametrics. We provide a basic introduction to the Dirichlet process, Gaussian process, and to latent feature models;
Overview of Statistical Models for Large and Sparse Data: this part reviews some statistical models for capturing large and sparse data;
Combination of Multiple Sources of Data: we briefly introduce some techniques for combining multiple sources of data such as rating data on recommender systems, user/item networks, item relations, and user/item metadata;
Static vs Dynamic: we discuss about how to capture the complex relations inside both static and dynamic data using statistical models;
Statistical Models for Collaborative Filtering: statistical algorithms and case studies are provided for capturing large, sparse and multi-sources datasets in Collaborative Filtering (CF);
Statistical Models for Relational Network Analysis: statistical algorithms and case studies are provided for capturing large, sparse and multi-source datasets in Network Analysis;
Statistical Models for Document Analysis: statistical algorithms and case studies are provided for capturing large, sparse and multi-source datasets in Document Analysis;
Statistical Models for Count Data: statistical algorithms and case studies are provided for capturing large, sparse and multi-source datasets for analysis of count data;
Challenges and Prospects: open issues and potential are discussed for large, sparse, and combination of multiple sources of data together with other applications.

Slides

Will be further updated

AAAI-20_Tutorial.pdf

Google Sites

Report abuse