1. Bulk IncrementalDBSCAN :
The DBSCSAN algorithm is a popular density based clustering algorithm for data mining tasks,especially in spatial databases. As a research project in BITS under Prof. Navneet Goyal and Prof. Poonam Goyal, I looked at an application of this algorithm in a data warehousing set up. We considered a data warehouse that apart from storing the data points, also maintains their clustering information for analytical purposes. As is typical in such a scenario, new data points arrive periodically. These new points need to be added to the older set of clusters. We found the the typical approach to do this was the IncrementalDBSCAN which adds one point at a time to the old clusters.
However, we realized that sometimes it might be interesting to identify patterns in these new data points separately before merging them to the old data points. This could be useful in identifying any new patterns that are emerging in the data. With this intent, I worked in a team of two to develop and algorithm for this task without introducing any significant overheads.
The overall procedure can be summarized thus:
1. Cluster the new data points using DBSCAN algorithm.
2. Merge the two sets of clusters ( those consisting of the new points and the older clusters ) to get the final clustering.
The challenge was to achieve the second step as efficiently as possible. For this, we realized the the definition of DBSCAN clusters is such that two merge two (sets of) clusters one needs to look at only a subset of points. We called these Intersection points .The second step was thus bifurcated into two subtasks
1. Find the intersection points between two sets of clusters
2. Process these points to obtain the final clustering.
We used the properties of R*-trees to devise an algorithm finding the intersection points. Further, we proved that the points obtained by this algorithm contained sufficient information for the merging of the clusters. We then tested the algorithm on various generated and real data sets. The results showed that for few intersection points, our algorithm out-performed IncrementalDBSCAN in the time taken. Even for very high (close to 90% of the new points) number of intersection points, our algorithm did better than IncrementalDBSCAN.
The work was accepted as a poster presentation at the 4th International Conference for Contemporary Computing. The proceedings can be found here.
The Bulk insertion algorithm is only half useful unless we work out algorithms for Bulk deletion. The necessity for this arises as in many applications the older data points need to be ignored and removed from the warehouse as they are less representative of the recent patterns.
2. Can we predict bugs ?
I worked on this project as a research assistant at INSEAD business school under Prof. Jurgen Mihm and Prof. Manuel Sosa. Bugs, as any software developer will tell you, tend to take up most of the development time. And often the reason for bugs is poor design at the start. This project tried to quantify "poor design" in terms the fraction of mutually interdependent components in a system ("system cyclicality") . We found that there was a statistically significant between the cyclicality and the number of bugs reported a software. The Empirical study was done on a number of Open source software. My work comprised mainly of implementing the code for calculation of various control variables that were used in the regression analysis.
3. Parallel solver for NNLS problem
This was my internship project at the Max Planck Institute for Biological Cybernetics. I worked in the Department of Empirical Inference under Mr. Suvrit Sra. My job was to write a parallel implementation of an algorithm for the Non-Negative Least Squares problem developed by Suvrit and his co-authors. I used PETSc library for distributed matrix operations.
4. Document image translation for South Indian languages.
I recently concluded a 3 months long internship at IIIT,Hyderabad under Prof. C.V. Jawahar.
OCR for English an similar scripts is pretty much a solved problem. Indian scripts present a few more difficult problems which makes the character recognition approach somewhat cumbersome for them. We instead try a different approach where we view it as a Sequence transcription problem. We try to solve this problem by means of BLSTM neural networks. This approach not only simplifies a number of problems for Indian languages but also allows us to exploit the power of BLSTM networks of using context for recognition of characters. We achieve high (>95%) accuracy in all four south Indian languages (Tamil, Telugu, Malayalam, Kannada).
5. Speaker Recognition using Convolutional Neural Nets (details to be updated)
6. Joint Entity-Event Co-reference via Spectral Co-clustering (details to be updated)
7. Automatic Alert System: Improving Information Management on Construction Projects (details to be updated)