Projects

During the time of disasters, lots of short-texts are generated containing crucial situational information. Proper extraction and identification of situational information might be useful for various rescue and relief operations. Few specific types of infrequent situational information might be critical. However, obtaining labels for those resource-constrained classes is challenging as well as expensive. Supervised methods pose limited usability in such scenarios. To overcome this challenge, we propose a semi-supervised learning framework which utilizes abundantly available unlabelled data by self-learning. The proposed framework improves the performance of the classifier for resource-constrained classes by selectively incorporating highly confident samples from unlabelled data for self-learning. Incremental incorporation of unlabelled data, as and when they become available, is suitable for ongoing disaster mitigation. Experiments on three disaster-related datasets show that such improvement results in overall performance increase over standard supervised approach.

[2018] Class Specific TF-IDF Boosting for Short-text Classification:

Proper formulation of features plays an important role in short-text classification as the amount of available text is very little. In literature, Term Frequency - Inverse Document Frequency (TF-IDF) is commonly used to create feature vector for such tasks. However, TF-IDF does not utilize the class information available in supervised learning. For classification problems, if it is possible to identify terms that can strongly distinguish between classes then more weight can be given to those terms during feature construction phase. This may result in improved classifier performance. We apply this proposed feature construction method to classify tweets, based on the actionable information may be present, posted during different disaster scenarios. We used three benchmark datasets containing tweets posted during Nepal and Italy earthquakes. Improved classifier performance for such classifications may be helpful in the rescue and relief operations. Experimental results show that the proposed method obtains better classification performance on these benchmark datasets.

[2017] Using Social Media for Classifying Actionable Insights in Disaster Scenario:

Micro-blogging sites are important source of real-time situational information during disasters such as earthquakes, hurricanes, wildﬁres, ﬂood etc. Such disasters cause miseries in the lives of affected people. Timely identiﬁcation of steps needed to help the affected people in such situations can mitigate those miseries to a large extent. In this paper, we focus on the problem of automated classiﬁcation of disaster related tweets to a set of predeﬁned categories. Some example categories considered are resource availability, resource requirement, infrastructure damage etc. Proper annotation of the tweets with these class information can help in timely determination of the steps needed to be taken to address the concerns of the people in the affected areas. Depending on the information category, different feature sets might be useful for proper identiﬁcation of posts belonging to that category. In this work, we deﬁne multiple feature sets and use them with various supervised classiﬁcation algorithms from literature to study the effectiveness of our approach in annotating the tweets with their appropriate information categories.

[2011] Design Heuristic Approach to Message Assignment and Interface Device Minimization Problem:

Now a days, any high-end car generally has near to 70 ECUs connected through CAN or FlexRay networks. These cars mainly follow Federated Architecture which is a pattern which describes an approach to enterprise architecture that allows interoperability and information sharing between semi-autonomous de-centrally organized systems and applications. Here, all the sensors and actuators are connected to either to the ECUs or to the Bus directly, which costs more. But the next generation evaluation turns towards Integrated Architecture, where all the sensors and actuators will be connected to a specific type of interface device. Those interface devices will be connected to the Bus. Bus is in turn connected to the ECUs. In this architecture, the required number ECUs are less, in turn lowering the overall cost compared to Federated Architecture.

My task in this project is to design a system that can efficiently map the system designed with functional model to Integrated Architecture model. This mapping step must consider various available options for both the functional and architectural models.

[2011] Flickr Crawler:

The aim of this assignment was to make a web crawler that will crawl the Flickr website (www.flickr.com) and make a database of the retrieved data and will perform various types of operations. This crawler was implemented using Python, MySQL and Flickr API.

[2008] Online Auction System (B.Tech Project):

We, group of four, implemented an Online Auction System which can handle a large amount of user data around the web based on Customer to Customer business model where users were allowed to put items along with item details online for other users to place bid on them. The item will be granted to the highest bidder. We used PHP with MySQL to implement this project.

Industry

[-] Automated Root Cause Detection from Log Files:

Being on Customer Focus Team sometime calls for lots of log analysis. We started working on a project to automate the log analysis part by designing a system which uses Machine Learning and Finite State Machine (FSM) to figure out possible erroneous log entry and their location. The system leverages in house correct training logs to create a FSM based on the possible preceding valid log statements, then we put the erroneous customer log to be validated by the FSM using global alignment of two sequences (Needleman Wunsch) algorithm. This algorithm was effective in approx. 1 in 10 erroneous logs. We implemented this algorithm using Python 3.

[-] Securing the NetBackup Client Daemon (BPCD):

NetBackup Client Daemon (BPCD) is the main orchestrator for various types of incoming NetBackup server requests which runs on every NetBackup client machine as a daemon. Before our feature, BPCD daemon had some security vulnerability as BPCD can be leveraged to access undesired files which are not necessary or relevant to NetBackup. We designed and implemented a feature which acts as a filter to provide access only to legitimate NetBackup servers and to only NetBackup specific files restricting all other access. We also provided a way to override permission to other folders if required but that can only be done with specific configuration change. We mainly used C and C++, C language as interface to legacy codes and C++ for others parts.

[-] TestEngine development for NetBackup Client Daemon (BPCD) Feature:

NetBackup Client Daemon (BPCD) runs on all NetBackup client systems and as our feature was to restrict various types of calls depending upon some parameters it may lead to complete sanity failure. It needs careful testing to ensure basic features are working fine with this change. My job was to ensure the integration and component testing of the feature so that nothing breaks at the customer side. I worked on the integration testing using Perl and TestEngine framework and C++ for component testing. To ensure functionality and sanity, integration testing required almost 650 files to be modified properly. The reason behind this vast number of files mainly the number of scenarios BPCD affects.

[-] Delayed Cataloging for Microsoft Exchange Server GRT Restores:

NetBackup’s GRT (Granular Recovery Technology) provide the Customers the capability to restore individual email items from a backup image. But to fetch a particular mail item from an image takes lots of time as it has to process through the image for that particular item. We designed and implemented a feature to preprocess the image at the time of backup itself so that processing time can be eliminated. Using this feature, individual email items can be restored faster. We implemented this feature using C++ and EWS (Exchange Web Services) API.

[-] TestEngine development for Exchange:

TestEngine is a framework written in Perl to automate various manual testing. I implemented part of TestEngine using Perl which is responsible for interacting with NetBackup to perform backup and restore of MS Exchange with validation.

Personal

[2016] Research Paper Title Renamer:

Many a times when we download research papers, the file names do not contain the title. It becomes confusing to find the most important ones when downloading a lot of papers. This python script will find the title in the meta data and rename all the “pdf” files in a given folder by its title. If the title is not in meta-data, then it will read the paper and try to filter the title from the text. Although it does not work all the time but pretty useful in most the cases.

[2014] Dropbox Duplicate Checker:

Dropbox is an online cloud storage service provider. Online storage costs money. If duplicate files are stored in Dropbox server, Dropbox does not notify users about the space wasted because of these duplicate files. I worked with a friend on a tool which connects to Dropbox server and search for duplicate files Online (Dropbox client or the files need not be on the system for it to work) and reports to user. We used Python with Dropbox API and used hashing techniques.

[2011] AccuRate:

Our goal of this project was to come up with an automated movie rating system where user intervention was not required. We used twitter.com as the base platform to collect user sentiments for movies. We used bag-of-words approach to calculate ratings. Our ratings were within ±0.5 with IMDB ratings for a set of 20 movies.