Declarative Formulations for Machine Learning and Data Mining
Ian Davidson, Department of Computer Science, University of California/Davis, USA
Machine learning and data mining have a long history dating back to the 1950s. Over the decades the underlying mathematical formulations of the problems and their solutions have evolved. In the beginning most approaches were heuristic in nature, then were formalized as probabilistic parameter estimation and later as mathematical programming. In this talk I'll describe the history of different ML/DM formulations albeit briefly. I'll then talk about new formulations using constraint programming, SAT and Satisfiability Modulo Theory (SMT) with a careful emphasis on why they are even needed. To give the talk depth and provide some examples I will focus on unsupervised or weakly supervised learning mostly but touch upon other areas. I'll provide applications and examples with joint work with Google, Yahoo! and collaborators in Neuroscience.
Structuring temporal sparse data with application to opinion mining in the social media
Julien Velcin, ERIC, University Lyon 2
All the messages posted on the social media reflect only partially user's opinions. To gather those traces disseminated throughout the Web, evolutionary clustering techniques look eminently promising. In this talk, I will present two possible probabilistic models that address this issue. Our proposals extend the classic multinomial mixture for dealing with the temporal dimensions so that we can capture opinions over time. I will illustrate the two models with recent experiments performed within the ImagiWeb project that aims at studying the image (representation) of entities populating the social Web. I will more specifically use the image of two French politicians during the last presidential elections as a case study.
Discovery multiple clusterings
Marcilio de Souto, Guillaume Cleuziou, LIFO, University of Orléans
Currently, more and more data are collected from multiple sources or represented by multiple views (e.g., text, video, images, biological data, among others). In this context, clustering techniques are often required to be able to provide several possibilities for analyzing the data. As a consequence, in recent years, the interdisciplinary research topic on Multiple Clusterings has drawn significant attention of the data mining community. For example, there has been an increasing interest in discovering multiple clustering solutions from very high dimensional and complex databases. Likewise, there have been a great interest in searching for a consensus among clusterings representing different views of the data. In this talk, we will approach problems in areas such as cluster ensemble, multi-objective clustering and multi-view clustering.
A unified declarative framework for constrained clustering
Thi-Bich-Hanh Dao, Khanh-Chuong Duong, Willy Lesaint, Christel Vrain, LIFO, University of Orléans
In the last decade, several works have shown the interest of declarative frameworks for Constrained Data Mining. In this context, we have proposed a framework based on Constraint Programming for relational constrained clustering. This framework offers combinations of different kinds of user-defined constraints and a choice among different optimization criteria. The proposed optimization criteria are: minimizing the maximal cluster diameter, maximizing the minimal split between clusters or minimizing the within-cluster sum of dissimilarities. In this talk, we present a twofold extension of the framework. First, we extend the framework to integrate the most popular optimization criterion in relational clustering, which is minimizing the within-cluster sum of squares. We develop a global constraint exploiting this criterion and empirically show that our approach outperforms the existent exact approach based on Integer Linear Programming and Column Generation and allowing the integration of user-constraints. Second, we extend the framework to unify conceptual and relational constrained clustering. The unified framework offers the possibility of taking into account optimization criterion and user constraints issued from either conceptual or relational clustering.
A multiplex-network based approach for clustering ensemble selection
Parisa Rastin, Rushed Kanawati, LIPN, University Paris 13
Performance of cluster ensemble approaches is now known to be tightly related to both quality and diversity of input base clusterings. Cluster ensemble selection (CES) refers to the process of filtering the raw set of base clusterings in order to select a subset of high quality and diverse clusterings. Most of existing CES approaches apply one index for measuring the quality and another for evaluating the diversity of clusterings. Moreover the number of clusterings to select is usually given as an input to the CES function. In this work we propose a new CES approach that allow taking into account an ensemble of quality and diversity indexes. In addition, the proposed approach computes automatically the number of clusterings to return. The basic idea is to define a multiplex network over the provided set of base clusterings. Each slice in the multiplex network is obtained by defining a proximity-graph over the set of base clusterings using a given clustering dissimilarity index. A community detec- tion algorithm is applied to the obtained multiplex network. We then rank clusterings in each community applying an ensemble- ranking approach using different (internal) clustering quality indexes. From each community we select the base clustering ranked at the top. First experiments on benchmark datasets shows the effectiveness of the proposed CES approach.
Graph enhancement via clustering: application to Gene Regulatory Network inference
Aurélie Pirayre, Camille Couprie, Laurent Duval, Jean-Christophe Pesquet, IFP Energies nouvelles - LIGM, University Paris East
In this work, we introduce a methodology to couple network inference and clustering in a Gene Regulatory Network (GRN) context. Such networks are useful to visualize gene interactions occurring in living organisms: some genes regulate the expression of others, structuring the network into modules where they play a central role. Our approach consists in jointly inferring the graph and performing a clustering using the graph-Laplacian-based random walker algorithm. We validate our approach on the DREAM4 dataset, showing significant improvement over state-of-the-art GRN inference methods.
Toward an efficient one pass clustering algoirthm
Nicolas Labroche, LI, University of Tours
This presentation describes novel experimental results for the Bounded One Pass (BOP) clustering algorithm. BOP relies on a cluster race mechanism paired with bounds or heuristic criterion to minimize the number of comparisons for the assignment of new data to existing clusters. Until now, experiments have mainly shown to which extent each bound or heuristic can accelerate the clustering process in an ideal case, without considering the propagation of previous assignment errors. We describe here realistic experiments where the clustering quality of the algorithm is evaluated on several real data sets with Rand index measures and with errors propagation. New results show that Student bound leads to the best trade-off between efficiency and clustering quality.