Research

My research interests are in the applications of computer vision, machine learning and multimodal data analytics in real-world research problems, including but not limited to smart rooms, human behavior understanding, social signal processing, human-centered computing, medical imaging, and online recommender systems.

In my current appointment as a Research Engineer/Scientist at Stanford University, my research focuses on developing predictive models that leverage imaging and non-imaging information for early cancer detection, aggressiveness classification, and treatment planning. Using labels derived from pathology, we train machine learning models that can detect and identify sub-types of cancer on radiology images using multimodal co-learning strategies. I work with a highly interdisciplinary team of computer scientists, engineers, radiologists, pathologists, and oncologists at Stanford University.

My graduate research has been an interdisciplinary fusion of Artificial Intelligence (AI) with social psychology, focusing on the development of unobtrusive sensing and analysis of human behavior in task-based group interactions. Studying human behavior in groups is an inherently difficult problem for a number of reasons, including, (1) the multi-modal nature of human behavior, (2) behavioral changes occurring at extremely fine spatial and temporal resolution, and (3) the need for unobtrusive sensing techniques for recording behavioral data in unconstrained natural environments. In my Ph. D. work, I addressed several of these challenges and developed computational models to explain group performance, perceived leadership, and contribution, using unobtrusive ceiling-mounted distance sensors. I work in close collaboration with social psychology, communication and network science researchers (Dr. Brooke F. Welles and Dr. Christoph Riedl) from Northeastern University (NEU) , and natural language processing researchers (Dr. Heng Ji) from RPI. Most of my graduate education has been funded by the National Science Foundation (NSF) and I am affiliated with the Engineering Research Center (ERC) for Lighting Enabled Systems and Applications (LESA).

The work of our research group in the NSF ERC for Lighting Enabled Systems and Applications (LESA) was featured in this video made by the Illumination Engineering Society.

Radiology-pathology fusion for automated prostate cancer detection

Development of privacy-preserving occupant-aware smart spaces

Video cameras are often not the right choice for the development of occupant aware spaces, where the privacy of the occipants is critical (e.g., smart homes, office spaces, hospital rooms, restrooms). Commonly used occupancy sensors like passive infrared, ultrasound, etc., are not highly accurate and do not provide higher level understanding of the exact location and pose of the occupants. We investigated the use of sparse arrays of ceiling-mounted, single-pixel time-of-flight (ToF) sensors. The sensors preserve the privacy of the occupants as they only return the range to a set of hit points., and people appear as blobs in the depth map of the room formed from this distance measurements. We used both real-world experiments and 3D computer graphics simulation to test the performance of our tracking and coarse body pose estimation algorithms using these extremely low-resolution overhead depth sensors.

Sensor-fusion for understanding body orientation, head pose and visual focus of attention

The relative seated body orientation of participants in a group discussion can convey rich non-verbal information about the attentiveness and mutual engagement. In this project, we focused on understanding the relative seated body orientation, using a fusion of low resolution (25 X 20 pixels) overhead ToF sensors and lapel microphones. We used compressed sensing and Bayesian estimation techniques to develop sensor-fusion algorithms that could classify a seated person blob into one of the 8 orientation directions.

We used higher resolution overhead ToF sensors that enabled the creation of accurate 3D pointclouds for the estimation of the head pose, by robustly fitting ellipsoids to the segmented 3D heads of the occupants.

Using the head pose and the synchronized speaker identification information, we developed sensor-fusion algorithms to estimate the visual focus of attention (VFOA) of meeting participants at a very fine temporal resolution.

Multi-modal understanding of perceived emergent leadership and contribution from unobtrusive sensors

We collected a dataset of task-based group interactions using the overhead ToF sensors and lapel microphones, where the participants were required to perform the Lunar Survival Task. Using the estimated location, body orientation, and VFOA from the overhead depth measurements and fusing them with non-verbal and verbal metrics derived from the speech signal, we studied correlations with perceived leadership and contribution, and developed computational models to predict emergent leaders and major contributors.

Interactive, context-aware, intelligent agent for assisting online customers in visual browsing of large catalogs using multi-modal dialog

Although not directly related to my Ph.D. research, I had the opportunity to work as a research intern in IBM Research Labs, India during the summer of 2017, where I worked with Dr. Vikas C. Raykar on the development of an interactive, context-aware agent that leverages multi-modal dialog to assist online customers to visually browse through large online catalogs. We formulated the problem of ``showing the k best products to a user'' based on the dialog context so far, as sampling from a Gaussian Mixture Model in a high dimensional joint multi-modal embedding space, that embeds both the text and the image queries. While my doctoral research has focused on physically collocated systems, this research experience provided insights about development of deep learning based AI agents for online human-computer interaction.