clustering

This toolbox is dedicated to the (probabilistic) clustering/classification of multivariate data.

The toolbox works with two qualitatively distinct types of data, i.e. binary (MoB) and continuous (MoG).

Example of an estimated partition (clustering/classification) of multivariate (here, 2D) data.

Download and install

The toolbox can be downloaded here.

It requires at least MATLAB 7.1 to run.

Add the root folder to your MATLAB path, as well as the directory '/sampling'.

Getting started

Any Bayesian data analysis relies upon a generative model, i.e. a probabilistic description of the mechanisms by which observed data are generated. This toolbox has been developed to deal with a certain class of generative models, namely mixtures of Gaussian -MoG- and of binomial -MoB- densities, which are described below.

The root folder contains the functions that deal with MoG and MoB data (see below) and the folder '/sampling' contains functions that allow drawing samples from most common (exponential) probability density functions (NB: these are freely available on the internet).

Run 'demo_GMM' in the MATLAB command window.

This will generate a data matrix, which is structured into a given number of clusters.

The clustering algorithm will then try to disclose these clusters from the data.

A graphical output summarises the results, in comparison to the simulated data:

2D-projection of the simulated data (on the two first eigenmodes).

First-order moments of the Gaussian components and data labels (simulated and estimated)

The file 'demo_BMM' operates a similar demo with binary data.

Editing these scripts might prove useful to understand the I/O of the code :)

Simulated data (the y-axis depicts the data dimensions)

First-order moments of the Binomial components and data labels (simulated and estimated)

The MoG and MoB probabilistic models of data

The MoG model has been widely used in the machine learning and statistics community. It assumes that the observed data y is actually partitioned into K classes or subsets of points, each one of which can be well described with a multivariate Gaussian density. More formally, the MoG generative model assumes that n binomial vectors are sampled from a Dirichlet density. These binomial vectors have dimension K, and are such that their non-zero entry indicates which class each point belongs to. These so-called labels then switch on and off a set of K multivariate Gaussian densities with different first- and second- order moments, from which the observed data themselves are sampled. If these densities are sufficiently different, then the overall datasets enjoys a clustering structure, when projected onto an appropriate subspace.

NB: Any arbitrary density over continuous data can be described in terms of a MoG, given a sufficient number of components or classes.

The function 'VBEM_GM.m' inverts, using a variational Bayesian scheme, the MoG model. Given the data y, and the maximum number of classes K, it estimates the most plausible number of classes, the labels, and the first- and second- order moments of each Gaussian density. In addition, it also returns the model evidence, which can be useful for model comparison purposes.

Now suppose you design an experiment, in which n subjects are asked a number yes/no questions from a questionnaire. Suppose these subjects are grouped into K categories, which are defined in terms of how likely its member are to answer 'yes' to each of the questions. You have just defined a MoB model, which you can use to disclose the categories of subjects from their profile of responses to the questionnaire. This is what the function 'MixtureOfBinomials.m' does, using either a Gibbs sampling algorithm or a variational Bayesian algorithm. In both cases, the function returns the data labels, the first- order moment profile for each category and the model evidence. NB: in the MoB case, the inversion schemes cannot automatically eliminate the unnecessary classes in the model. Therefore, (Bayesian) model comparison is mandatory to estimate the number of classes in the data, if it is unknown to the user.

Controlling the clustering/classification accuracy

The variational Bayesian schemes that are used to finesse the inversion of both the MoG and the MoB generative models do not require the knowledge of data labels to learn the structure of the models (i.e. sufficient statistics of the components). This means one is not forced to use so-called train/test strategies, which are ubiquitous in standard supervised classification approaches. However, there is no harm in measuring the accuracy of the model inversion by using the conditional predictive density for a new data point whose label is known. This can be done in a systematic fashion, in order to derive an independent measure of generalisation error.

As with most classifiers, these schemes are most sensitive to the so-called 'feature space', i.e. the user-specified properties of the real raw experimental measures. For example, working with time-series in the time domain or in the frequency domain might not give similar results, in terms of the optimal partition. It is 'good' to use a feature space which is as low-dimensional as possible, without falling short of necessary discriminant dimensions. In brief, the art is in choosing the feature space appropriately, i.e. in accordance with expert domain-specific knowledge.

As a rule of thumb, choose a feature space of dimension approximately equal to the expected number of classes.

NB: the function 'PCA_MoG.m' can be used to eyeball the structure of the data, when projected onto its first three eigenvectors. This can also be used to initialise the above VB clustering algorithms. Ultimately, it can also be used to operate a data dimension reduction before sending it to the classifier.