Projects

PhD

Large Scale Kernel Clustering
- As the size of digitally available data increases, more efficient methods to organize this data need to be developed. Clustering is a commonly used tool to analyse the data and find similar groups of data. Most large scale clustering methods are based on Euclidean geometry and do not perform very well on real data sets. Kernel based clustering methods, on the other hand , incorporate non-linear similarity functions in the clustering process, and are therefore able to capture the non-linear structure of the clusters in the data set. However, kernel clustering algorithms have a high computational and memory complexity. The aim of this project is to develop efficient techniques to perform kernel clustering.
- Approximate Kernel k-means is an efficient variant of the kernel k-means algorithm. It reduces the complexity of kernel k-means by restricting the cluster centers to lie in a space spanned by a subset of the data points. The complexity of this algorithm is linear in terms of the number of data points and its performance in terms of prediction accuracy is almost similar to that of the kernel k-means algorithm.

Illustration of the approximate kernel k-means algorithm

- Random Fourier Features allow us to leverage the efficiency of linear learning algorithms while achieving the accuracy of kernel-based learning algorithms. The data points are projected into a low dimensional space spanned by base vectors obtained from the Fourier transform of the given shift-invariant kernel similarity function. A linear clustering algorithm such as k-means is then executed on the transformed data.

Two dimensional dataset Data projected on to the Clusters generated using

of 500 points containing subspace spanned by the Random Fourier Maps

two clusters the Fourier basis

Illustration of clustering using Random Feature Maps

Sample clusters from the Tiny data set obtained using the approximate kernel clustering algorithms

Stream Kernel Clustering
- Approximate Stream Kernel k-means is a kernel-based clustering algorithm for clustering streaming data which is generated in a continuous manner. The data is often unbounded, due to which it cannot all be stored in memory, and can be accessed at most once. Batch kernel-based clustering algorithms cannot be applied to streaming data, because they assume all the data is available for computing the kernel matrix. The approximate stream kernel k-means algorithm samples the streaming data based on the "importance" of the data, constructs an approximate kernel matrix, and uses this approximate kernel matrix to perform clustering in linear time. Data streams such as the Twitter stream, can be clusters with high accuracy at speeds as high as 8 MBps.

A sample cluster from the Twitter stream obtained using the

approximate stream kernel k-means algorithm, representing the hashtag #HTML

Trends of different hashtags over the month of January 2015 in the Twitter stream obtained using the

approximate stream kernel k-means algorithm

Document and Image Categorization
- Document and image data sets, containing millions of high-dimensional points, usually belong to a large number of clusters. Finding clusters in such data sets is computationally expensive using kernel-based clustering techniques. The sparse kernel k-means algorithm is an online kernel-based clustering algorithm, specifically designed to cluster large repositories of documents and images. It samples the data points based on their importance in the data set, and constructs a nearest-neighbor graph using the sampled points. The affinity matrix of this graph is clustered in time linear in the number of data points, linear in the number of dimensions, and logarithmic in the number of clusters. The sparse kernel k-means algorithm is able to cluster large data sets (such as the Tiny image data set and Youtube data set) into thousands of clusters in less than an hour, with minimal resources.
Image Deblurring and Denoising
- Space-variant convolutions are useful in modeling complex image transformations. The problem of deblurring and denoising digital images can be modeled as a space-variant deconvolution problem. Structured optimization techniques such as Fast Iterative Shrinkage/Thresholding Algorithm (FISTA), Alternating Direction Method of Multipliers (ADMM) and Split Augmented Lagrangian Shrinkage Algorithm (SALSA) can be used to solve the deconvolution problem. We studied the efficiency and accuracy of the different algorithms, for image denoising.

Original Image Blurred and Noisy image

FISTA ADMM SALSA

Illustration of image deblurring using structured optimization techniques

Master of Engineering

Linear Time SVM
- This was my Masters major project under the guidance of Prof. M. Narasimha Murty. The objective of this project was to reduce the running-time complexity of Support Vector Machines. SVMs achieve a good generalization performance by minimizing the empirical risk as well as the V-C dimension of the discriminant. However, the training time complexity of SVM is cubic in the number of examples. The training set in most data mining applications is very large and using SVM for such applications results in inefficiency. In this project, we first propose an efficient variant of the k-means clustering algorithm called the "Two-level k-means algorithm". We then devise a scheme to integrate this algorithm into SVM learning procedure.
- Important results:
  - Proposed two-level k-means algorithm is faster than conventional k-means algorithms like Lloyd's and MacQueens algorithms.
  - Clustering error of the proposed two-level k-means algorithm is bounded by two times the clustering error of the optimal clustering.
  - Integration of two-level k-means in SVM renders the time-complexity of SVM as linear and reduces its training time significantly. In addition to SVM, we also integrated this algorithm in k-NN classifier and observed a considerable reduction in learning time.
  - Analysis of the two-level k-means leads to a relationship between the number of clusters and the expected size of each cluster in a dataset. This relationship is a unifying framework for the two types of partitional clustering algorithms: those that take the number of clusters as input (for example, k-means, k-medoids) and those that take a distance threshold as input (example, DB-Scan, Leader).
- Paper based on this project published in the Pattern Recognition journal.
Pattern Synthesis for Support Vector Machines
- - When the training set size is small, the classification algorithm does not have sufficent examples to learn from and this leads to underfitting and consequently, low generalization performance. Pattern Synthesis involves generating linearly-independent examples from the available training examples. Under Prof. Narasimha Murty's guidance, I developed a Pattern Synthesis scheme for Support Vector Machines, as a mini-project for the "Topics in Pattern Recognition" course. The scheme involved identifying the Support Vectors in the initial training set and applying transformations like scaling and perturbation on these to generate more examples.
  - A chapter based on this work is published in the Encyclopedia of Data Warehousing and Mining, 2009.
VIPANI - An Electronic Marketplace
- This was a mini-project for Electronic Commerce course.Vipani is a configurable electronic marketplace which can be configured for selling and buying products and services. Potential sellers and buyers can use the application to declare their products/services, and conduct and participate in auctions online. Various auction mechanisms such as Sealed Bid Auction, Dutch Auction, etc. were implemented.

Bachelor of Engineering

Horde Project : Collaborative Browsing Framework for Dynamic Online Communities of Practice
- The objective of the project was to design an efficient system for collaborative browsing using a neural network based information retrieval model. The system identifies common interest users based on their browsing patterns and provides an environment for the users to communicate with each other and browse collaboratively. This was my major project for B.E. We showcased it in MSAPP 2006 (Microsoft Academics Projects Program) and it was ranked among the top 10 projects that year. Further project details available here.
Canvas – A 2D Graphics Editor
- I worked on this project during my 6th semester for the Computer Graphics course. The objective of this project was to develop a device-independent graphics editor package, which provides facilities for image editing, and has an attractive and good user interface with online help. Tools for drawing various objects such as lines, rectangles, squares, circles, ellipses, spirals, polygons etc. were implemented. Other features that were implemented are file loading and saving, transformations on selected areas, fill styles, line styles, color filling, etc.
Linux Shell
- As a mini-project for the Systems Software course in 5th semester, I developed a customized UNIX shell. Various internal commands and shell features such as aliases, input and output redirection, command piping, background job execution, metacharacters, shell variables and shell scripting were implemented.
Parallel programming through MPICH
- This was part of my work as a research assistant under Prof. Nitin Pujari. MPICH is an implementation of the Message Passing Interface. It provides a set of functions for inter-process communication and parallel programming. These functions were used in a program to generate permutations among a given set of numbers, more efficiently and in lesser time.
Web page generator “makewebpage”
- As a part of the Total Student Development Program conducted during my 3rd semester in my undergraduate institution PES Institute of Technology, I developed a BASH based script to generate custom web pages. It accepts personal information, picture files and other custom parameters such as background color, font size & color, etc., from the terminal or a file containing records of personal information and generates individual web pages.