Expectation Maximization (EM) using Bayesian Gaussian Mixture Models (BGMM)

Quicklinks

Background

Implementation

Developing the Expectation Maximization (EM) using Bayesian Gaussian Mixture Model (BGMM)

Likelihood of the model correctly assigning the data points to a cluster

Results

Clustering using the EM-BGMM

Posterior probability of the components given the data

Background

Expectation-Maximization (EM) in Gaussian Mixture Models (GMM) can also be used to cluster data. A dataset with x number of features can be represented by a combination of Gaussian distributions. The number of Gaussian distributions that represent the dataset equals the number of clusters in the dataset.

Expectation-Maximization is used to determine the means and variances of the mixture of Gaussian distributions. That is to say, the EM algorithm is used to determine the probability that a data point belongs to a Gaussian distribution. Each of the Gaussian distributions represents a cluster. Thus, the EM provides the probability of a data point belonging to a cluster.

The Bayesian version of the EM-GMM uses posterior probability distribution to infer the number of components from the data. The number of components used in the model may be less than the value input by the user. This is accomplished by setting the components not included in the model to nearly 0.

Learn more about Expectation-Maximization in Gaussian Mixture Models

Implementation

The workflow for this portion of the project draws heavily from the following:

https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/

https://scikit-learn.org/stable/modules/mixture.html#bgmm

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html

Developing the Expectation Maximization (EM) using Bayesian Gaussian Mixture Model (BGMM)

Caller script that submits a job for each file

The job submission args are the python script, the file for the dataset, and a counter which is used to access the appropriate element of arrays within the python script.

SBATCH set-up

Import packages

Setup inputs and lists

Function for plotting results of EM-BGMM

Develop EM-BGMM model

Likelihood of the model correctly assigning the data points to a cluster

Plot data points labeled by EM-BGMM model

Examine probability of data given the EM-BGMM model

Results

Clustering using the EM-BGMM

The EM-BGMM model indicates each of datasets contains one cluster.

1306 expression data

88069 expression data

1Shape data

2Shape data

Posterior probability of the components given the data

The average posterior probability of the components of the EM-BGMM model given the data and the associated standard deviations are shown in the plots below.