Software

Nonlinear Dimensionality Reduction for Clustering 

Check the github link of our most recent published article.

https://github.com/usersotiris/nonlinearclustering/blob/master/README.md

Sotiris Tasoulis, Nicos G. Pavlidis, Teemu Roos, Nonlinear dimensionality reduction for clustering, Pattern Recognition, Volume 107, 2020, 107508, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2020.107508

Clustering High Dimensional Data

While data clustering has a long history and a large amount of research has been devoted to the development of numerous clustering techniques, significant challenges still remain. One of the most important of them is associated with high data dimensionality. A particular class of clustering algorithms has been very successful in dealing with such datasets, utilising information driven by projections of the data onto a lower dimensional subspace. 

[1] S.K. Tasoulis, D.K. Tasoulis, and V.P. Plagianakos, “Enhancing Principal Direction Divisive Clustering”, Pattern Recognition, Volume 43 Issue 10, October, 2010.

dePDDP clustering 

IDX=dePDDP(data) 

IDX = dePDDP(data,param1)

IDX=dePDDP(data) partitions the points in the n-by-p data matrix X into an automatic determined number of clusters. Rows of X correspond to points, columns correspond to variables. dePDDP returns an n-by-1 vector IDX containing the cluster indices of each point. IDX = dePDDP(data,param1) enables you to specify optional parameter/value of the multiple of the Kernel Density Estimation bandwidth parameter.

The dePDDP algorithm requires minimal user-defined parameters and have the desirable feature of being able to provide approximations for the number of clusters present in the data. The algorithm is based on a splitting criterion that suggests that the minimiser of the estimated density of the projected data onto the first principal component, is the best we can do to avoid splitting clusters. The cluster selection criterion and the termination criterion are guided by the same idea.

The following creates two clusters from separated random data:

X = [randn(100,2)+ones(100,2); randn(100,2)-ones(100,2)]; 

idx=dePDDP(X); 

plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12) 

hold on 

plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12) 

legend('Cluster 1','Cluster 2')


Manifold Visualization via Short Walks

Visualizing low-dimensional non-linear manifolds underlying high-dimensional data is a challenging data analysis problem. Different manifold visualization methods can be characterized by the associated definitions of proximity between high- dimensional data points and score functions that lead to different low-dimensional embeddings, preserving different features in the data. The geodesic distance is a popular and well-justified metric. However, it is very hard to approximate reliably from finite samples especially between far apart points. In this paper, we propose a new method called Minimap. The basic idea is to approximate local geodesic distances by shortest paths along a neighborhood graph with an additional penalizing factor based on the number of steps in the path. Embedding the resulting metric by Sammon mapping further enhances the local structures at the expense of long distances that tend to be less reliable. Experiments on real-world benchmarks suggest that Minimap can robustly visualize manifold structures.

Vizualization example of the COIL-20 dataset:

DISCLAIMER:

The code provided on this page comes without any warranty whatsoever. Use it at your own risk. All the other standard disclaimers also apply. This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research. In case you use this material please reference it as [1].