Open Source and Personal Projects

Google Summer of Code 2014 : CPython IDLE Improvements Project

Worked on extending test coverage, creating a non-buildbot test framework for IDLE, adding line numbering with breakpoints and creating an generic API to add 3rd part code checkers into IDLE.

Click to read the

Cachediff - A tool for localized cache analysis

Click for

Cachediff is a tool to study the effect of cache performance between two versions (differing from each other by a small diff/delta) of the same C/C++ program. This is useful to students, educationist and professionals. Cachediff presents to the user a localized and global view of the cache and its statistics. It uses cache simulation based on instruction/memory tracing during execution. It can be extended to support n-versions of the same program.


Linux USB Device driver for a joystick/game controller

A linux device driver for a generic 3 axis 4 button game controller. Accompanied by a user space demo tool. This project was done by me some time ago to understand how to write a device driver/module that runs in the kernel space in Linux. This includes a custom USB driver. Uses netlink socket for kernelspace to userspace communcation. Can handle multiple devices from same vendor/product ID. Supports multiple listener per device by using multicast in netlink.

Click for

FIRE2015 Shared Task on Source code plagiarism detection

System to detect plagiarism detection across programming languages(C and Java). Uses vector space modelling, similarity measures, and normalization. For the shared task, it had a precision of 100% and recall value of 74.

Click for

FIRE2014 Shared Task on Transliterated Search in collaboration with I.R.S.I. and Microsoft Research

To classify Indian language words in Roman scripts into Indian language/English classess and transliterate the Indian language words into corresponding Devanagiri script equivalents. Built using Python, scikit-learn and nltk

Click for


IR Based source code similarity detection

This project aims to propose an information retrieval and representative query generation based approach to software code plagiarism detection. Current approaches which are mainly based on pair wise exhaustive text similarity comparisons are not scalable. Neither do they consider the semantic/syntactic structure of the code fully. We aim to overcome this by using representative query formulation for each document in the dataset. The queries so generated would consider the semantic and structural aspects of the source program. Then, finding similarity in code is simplified into finding the probability that a query was generated by a document. We would then rank the classes in decreasing probability. We will also perform result comparison and evaluation with existing methods.

Click for