A Submodular Optimization Framework for Data, Feature and Topic Summarization

ECAI 2020 Tutorial

Rishabh Iyer and Ganesh Ramakrishnan

Abstract

This tutorial will address several applications of Data Selection and Summarization including Image and Video Summarization, Data Subset Selection, Feature Selection, Selection problem in Graphs and Rules, and Data selection in Active Learning and Meta Learning scenarios. We shall view all these problems from a lens of combinatorial optimization, and in particular, submodular optimization which has received a lot of attention in machine learning, robotics, game theory and social networks.

We shall study a rich class of submodular functions and corresponding optimization and learning algorithms, and particularly study their utility in a) modeling aspects such as representation, coverage, and diversity for summarization problems, b) modeling the information of data and feature subsets for data subset selection, feature selection and active learning, and c) defining a class of sparsity inducing norms defined via combinatorial functions for feature and rule selection.

Motivation and Outline of this Tutorial

Data from multiple modalities in the form of images, videos, live streams, unstructured text, knowledge graphs have been growing at an unprecedented rate in the last few years. While this massive data is a blessing to data science by helping improve predictive accuracy, it is also a curse since humans are unable to consume this large amount of data. Moreover, today, machine generated videos (via Drones, Dash-cams, Body-cams, Security cameras, Go-pro etc.) as well as textual feeds (news sites, social media data, etc.) are being generated at a rate higher than what we as humans can process. Moreover, majority of this data is plagued with redundancy. Given this data explosion, machine learning techniques which automatically understand, organize and categorize this data and its features are of utmost importance. These summarization problems can be best described as a obtaining a highlight of the most critical and important events in a set of items. The set of items could either be (i) a collection of videos, images, text documents or a dataset, (ii) a hierarchy of concepts, or (iii) a lattice of rules or features). The highlight could (a) help a user obtain a quick glimpse of the data set in a short amount of time (b) help identify subsets of features or rules that are characteristic of the data and (c) help identify subset of concepts or labels from a (large potentially hierarchically organized) set of labels. Below are specific examples.

  1. Create a summary containing the highlights (best moments/diverse set of events) in a image collection, video or large text corpus.

  2. Create data summaries for training machine learning models. Datasets today are growing, thereby creating the need for expensive and large GPU clusters, and larger experimental turn around times. Moreover, labeling these large datasets is getting more and more expensive and time consuming. The \emph{highlight} here refers to the most critical instances in the dataset to reduce training time and labeling efforts.

  3. Identifying feature/rule subsets for better generalization. The features/rules could come from a potentially exponentially large space. However, such a large space is bound to be fraught with redundancies. Some features could be overly specific, leading to overfitting during training, whereas other features could be over-generalized. Further, complex features (especially in the space of features in higher order logic) could be expensive to compute and the model could be made more run-time efficient by using them sparingly.

  4. Several real world machine learning applications involve hierarchy based categorization of topics for a set of objects. Objects could be a set of documents for text classification, a set of genes in functional genomics, or a set of images in computer vision. One can often define a natural topic hierarchy to categorize these objects. For example, in text and image classification problems, each document or image is assigned a hierarchy of labels. Moreover, many of these applications, naturally have an existing topic hierarchy generated on the entire set of objects. Given a hierarchy and a subset of objects, we present approaches to problem of finding a subset of DAG-structured topics that are induced by that subset (of objects). We also present approaches do discovering such topic clusters in the absence of such topic hierarchies.

Motivated by these applications, we shall explore a unified combinatorial approach of modeling the above problems. In particular, we shall study: a) properties and examples of submodular functions and their modeling aspects, b) the role of submodular functions in summarization and why they are a natural fit and also the problem of learning submodular functions for summarization, c) Data Selection for a rich class of Machine Learning Models and their relation to submodularity. We shall also discuss extensions of data selection to semi-supervised, active and meta learning, d) Submodularity in Feature Selection and e) a rich class of sparsity induced norms defined via combinatorial functions and how they can be leveraged for features, rules and attribute selection

Topics Covered in this Tutorial

This tutorial shall cover the following topics in detail, with the following tentative structure:

  1. Motivate the problems of data summarization (15 mins)

  2. Combinatorial Models for Summarization (30 mins)

  3. Submodular Optimization Algorithms (30 mins)

  4. Learning Submodular Functions (15 mins)

  5. Data Subset Selection and Active Learning (30 mins)

  6. Feature Selection and Model Compression (20 mins)

  7. Video/Image Collection/Text Summarization (40 mins)

Bio and Speaker Details

Bio of Rishabh Iyer: Rishabh Iyer is currently an Assistant Professor at University of Texas at Dallas. Prior to this, he was a Research Scientist at Microsoft, where he works on several problems around Online Learning, Contextual Bandits, Reinforcement Learning etc. He finished his PostDoc and Ph.D from the University of Washington, Seattle. His work has received best paper awards at ICML and NIPS, 2013. He also won the Microsoft Ph.D fellowship, a Facebook Ph.D Fellowship, and the Yang Outstanding Doctoral Student Award from University of Washington. He completed his B.Tech at the Department of from IIT Bombay in 2011, and has been a visitor at Microsoft Research, Redmond and Simon Fraser University. He has worked on several aspects of Machine Learning including Discrete and Convex optimization, deep learning, video/image summarization, data subset selection, active learning, online learning etc. He has applied his work in several domains including search advertisement, computer vision, text classification and speech.

Website: https://sites.google.com/view/rishabhiyer/

Email: rishabh.iyer@utdallas.edu


Bio of Prof. Ganesh Ramakrishnan: Ganesh is currently serving as a Professor at the Department of Computer Science and Engineering, IIT Bombay. After completing his BTech in the year 2000 at the Department of Computer Science and Engineering IIT Bombay, he decided to remain in India and contribute. He completed his PhD also department of CSE, IIT Bombay, worked at IBM India Research labs and thereafter joined back IIT Bombay as a faculty member of the Dept of CSE in 2009. His areas of research include Human Assisted Machine Learning, symbolic representation in machine learning and computational linguistics in Indian languages. In the past, he has received awards such as IBM Faculty Award, Microsoft Research Award, Yahoo! Research Award as well as IIT Bombay Impactful Research Award. He also held the J.R. Isaac Chair at IIT Bombay. Ganesh is very passionate about boosting the AI research eco-system for India and toward that, the research by him and his students as well as collaborators has resulted in a couple of startups which he continues to mentor.

Website: https://www.cse.iitb.ac.in/~ganesh/

Email: ganesh@cse.iitb.ac.in