Nonparametric clustering for spatio-temporal datasets

Clustering algorithms attempt the identification of distinct subgroups within heterogeneous data and are commonly utilised as an exploratory tool. The definition of a cluster is dependent on the relevant dataset and associated constraints; clustering methods seek to determine homogeneous subgroups that each correspond to a distinct set of characteristics.

These nonparametric clustering methods were developed to detect spatially contiguous clusters in a grid-style graph network. The user-generated datasets were assumed to be recorded over static sources. The determined spatially contiguous clusters also retain underlying differences in temporal patterns.

Method 1: A nonparametric Bayesian approach to clustering spatio-temporal datasets

Method 2: A functional distributional approach for spatial clustering

For complete description of these methods, please see theses.gla.ac.uk/40957/.

Method 1:

NetCRP: Network Chinese Restaurant Process

This flexible Bayesian model employs a network dependent Chinese restaurant process (NetCRP) to place a prior over the geographical constraints posed by a graph-based network. The NetCRP is a special case of the distance dependent Chinese restaurant process (ddCRP) that was first introduced by Blei and Frazier (2011); the NetCRP is modified to account for data that poses spatial constraints. The NetCRP seeks to cluster data such that adjacent or neighbouring regions in a spatial structure are more likely to belong to the same cluster. The NetCRP introduces a large number of singletons within the spatial structure and modifies the NetCRP to enable the researcher to restrict the number of clusters in the graph. It is also reasonable to assume that individual vertices within a cluster are spatially correlated to adjacent vertices. In order to fully account for spatial correlation within a cluster structure, the model utilises a type of the conditional auto-regressive (CAR) model. The model also accounts for temporal dependencies using a first order auto-regressive (AR-1) model.

In this mean-based flexible Bayesian model, the data is assumed to follow a Gaussian distribution and utilise Kronecker product identities within the definition of the spatio-temporal precision matrix to improve the computational efficiency. The model utilises a Metropolis within Gibbs sampler to fully explore all possible partition structures within the network and infer the relevant parameters of the spatio-temporal precision matrix. The flexible Bayesian method is also applicable to map-based spatial structures and we describe the model in this context as well.

Method 2:

A functional distributional approach using agglomerative hierarchical clustering

This algorithm is implemented within an agglomerative hierarchical clustering framework as a two-stage process. The method is based on a measure of distance that utilises estimated cumulative distribution functions over the data and this unique distance is both functional and distributional. This notion of distance utilises the differences in densities to identify distinct clusters in the graph, rather than raw recorded observations.