Quicklinks
Expectation-Maximization (EM) in Gaussian Mixture Models (GMM) can also be used to cluster data. A dataset with x number of features can be represented by a combination of Gaussian distributions. The number of Gaussian distributions that represent the dataset equals the number of clusters in the dataset.
Expectation-Maximization is used to determine the means and variances of the mixture of Gaussian distributions. That is to say, the EM algorithm is used to determine the probability that a data point belongs to a Gaussian distribution. Each of the Gaussian distributions represents a cluster. Thus, the EM provides the probability of a data point belonging to a cluster.
The Bayesian version of the EM-GMM uses posterior probability distribution to infer the number of components from the data. The number of components used in the model may be less than the value input by the user. This is accomplished by setting the components not included in the model to nearly 0.
The workflow for this portion of the project draws heavily from the following:
https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/
https://scikit-learn.org/stable/modules/mixture.html#bgmm
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
Caller script that submits a job for each file
The job submission args are the python script, the file for the dataset, and a counter which is used to access the appropriate element of arrays within the python script.
SBATCH set-up
Import packages
Setup inputs and lists
Function for plotting results of EM-BGMM
Develop EM-BGMM model
Plot data points labeled by EM-BGMM model
Examine probability of data given the EM-BGMM model
The EM-BGMM model indicates each of datasets contains one cluster.
The average posterior probability of the components of the EM-BGMM model given the data and the associated standard deviations are shown in the plots below.
Average posterior probability
Standard deviation of posterior probabilities
Average posterior probability
Standard deviation of posterior probabilities
Average posterior probability
Standard deviation of prior probabilities
Average posterior probability
Standard deviation of prior probabilities