research

Feature Engineering

Feature Engineering is an important step in Data Science Pipeline. There are two usual approaches - fully manual , or fully automated to perform this task. In manual approach a data scientist / human analyst performs the task using his domain knowledge, and goes through several iterations to come up with best set of features for the ML model. This is a slow and tedious process, but yields a set of features that is easy to comprehend by humans. In fully automated approach, analyst will feed the data to a software system and get suggested list of engineered features. This is a much faster approach, but can provide features that are not so intuitive (though might work well for prediction). My current research is to develop a semi-automated method involving human expertise along with automated system that will provide intuitive set of features through feature engineering, and this will be much faster than manual approach.

Feature Selection

Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to the descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure to effectively approximate top-k attributes, without actually calculating MI. Calculating this new measure is faster than calculating actual MI, resulting in a better runtime for feature selection.