Dissertation Research

Human-Inspired Topological Representations for Robust Perception in Unstructured Environments

Deep learning models that achieve remarkable performance for perception related problems in seen environments often suffer dramatic decreases in performance when tested on unseen environments. Therefore, in this work, funded in part by Amazon Science, we are working towards developing methods for robust object detection using topologically persistent features.

Recognition in unseen environments:

In our 2D shape-based recognition framework, we extract two kinds of features, namely, sparse persistence image (PI) and amplitude, by applying persistent homology to multi-directional height function-based filtrations of the cubical complexes representing the object segmentation maps obtained from RGB images. The features are then used to train a fully connected network for recognition. Unlike the state-of-the-art Faster R-CNN and its domain adaptive counterpart Domain Adaptive Faster R-CNN, recognition performance of these models, which are trained using images from a living room, remains relatively unaffected an images from an unseen warehouse (See more).

Recognition in cluttered unseen environments

The above work showcases the potential of topological features but faces challenges with varying camera poses, emphasizing the necessity for 3D shape-based features. I introduced TOPS, a new descriptor for point clouds derived from depth images, and THOR, a recognition framework inspired by human reasoning. TOPS captures the detailed shape of objects in a permutation-invariant manner while ensuring similarities in the descriptors of the occluded objects and the corresponding unoccluded objects. THOR uses this similarity to facilitate object unity-based recognition using standard classifiers, eliminating the need for training data that exhaustively represents all occlusion scenarios. Specifically, a slicing-based descriptor function is designed to compute topological features from the filtrations of simplicial complexes using persistent homology and form the TOPS descriptor. (See more).

Other Research Projects

Bird’s Eye View Generation from Surround-View Fisheye Camera Images for Automated Driving

Wide-angle fisheye lens cameras are often the sensor of choice to visually capture a car's surroundings. Obtaining a bird’s-eye-view (BEV) representation from multiple such car-mounted cameras is of tremendous value for advanced driver-assistance systems and autonomous vehicles; the clear presentation of the location and scale of objects in a BEV is helpful for downstream tasks such as lane keeping and object tracking. However, BEV generation in such cases is more challenging than typical image stitching owing to the strong visual distortion produced by the fisheye lenses and the violation of the flat earth assumption during homography estimation. As a research intern at Volvo Cars, %First, 

I proposed generating the BEV images and the corresponding segmentation maps using homography-informed spatial-transformer networks that address the distortion by spatially manipulating the feature maps (patent application filed). In a subsequent Volvo Cars-sponsored project at the University of Washington, I developed an attention-based network for obtaining BEV segmentation and height maps, which incorporates the camera parameters in its design to overcome the challenges due to the flat earth assumption.

Non Destructive Evaluation of Composite Structures for Wrinkle Detection

Ultrasound testing is often used during non-destructive evaluation (NDE) of parts to detect manufacturing defects such as wrinkles. As a part of Boeing's data science effort at UW though the Boeing Advanced Research Center (BARC), we developed a novel deep learning-based method for automated detection of wrinkle defects in parts made of composite materials. We use Faster R-CNN to locate wrinkles from ultrasound scans of composite parts, thereby replacing manual inspection. Additionally, we use standard image processing techniques to determine certain geometric and physical parameters of the detected wrinkles to enable further classification according to their severities.

Tow End Detection in Automated Fiber Placement using Deep Learning 

Composite materials, comprising carbon-fiber reinforced polymers, are used in manufacturing aircraft structures due to their combination of high strength and low density. The manufacturing process is, however, challenging in terms of quality control, which makes accurate in-process inspection of composite parts particularly important. Therefore, we developed a method for detecting tow ends of laid fibers from grayscale images using semantic segmentation. We use a modified U-Net structure, trained using psuedo-labeling, for semantic segmentation of images. Then, a series of post-processing steps are performed to extract tow ends from the segmentation maps. A presentation on this work, was awarded the best presentation at the 2019 Boeing Technical Excellence Conference. (See more)

Learning-based Semantic Segmentation for Position Detection of Microscale Objects

Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91. (See more)

Ergonomic Risk Prediction of Indoor Object Manipulation (IOM) Actions using Deep Learning

Automated real-time prediction of the ergonomic risks of manipulating objects is a key unsolved challenge in developing effective human-robot collaboration systems. We cast it as a problem of action segmentation from RGB-D camera videos. Spatial features are first learned using a deep convolutional model from the video frames, which are then fed sequentially to temporal convolutional networks to semantically segment the frames into a hierarchy of actions. Every action is labeled as either ergonomically safe, require monitoring, or need immediate attention based on automated computation of the Rapid Entire Body Assessment (REBA) score. We also collected a new dataset, the UW-IOM dataset, comprising twenty individuals picking up and placing objects of varying weights to and from cabinet and table locations at various heights. Results show very high 87-94% F1 overlap scores among the ground truth and predicted frame labels for videos lasting over two minutes and consisting of a large number of actions. (See more)

Image Processing-based Position Detection of Microscale Objects

Automated optical tweezers-based robotic manipulation of microscale objects requires real-time visual perception for estimating the states, i.e., positions and orientations, of the objects. Such visual perception is particularly challenging in heterogeneous environments comprising mixtures of biological and colloidal objects, such as cells and microspheres, when the popular imaging modality of low contrast bright field microscopy is used. Therefore, as an undergraduate intern advised by Prof. Ashis Banerjee, I investigated the performance of algorithms such as SURF, MSER combined with other image processing  techniques for detection of microscale objects in different media. Apart from developing perception methods, I also deployed the methods on a host system interfaced with a camera and optical tweezers for real-time control of microspheres. (See more)


Enhancement of Images from Dark Environments Using Learning-based Approaches

An image captured in dark environment usually has ambient illumination, but the image looks dark and noisy. At the same time the use of flash can introduce unwanted artifacts such as sharp shadows at silhouettes, red eyes, and non-uniform brightness in the image. We propose a new framework to enhance images captured in dark environments by combining the best features from a flash and a no-flash image. We use sparse and redundant dictionary learning based approach to denoise the no-flash image. A weighted least squares framework is used to transfer sharp details from the flash image into the no-flash image. We show that our approach is simple and able to generate better images than that of the state-of-the-art flash/no-flash fusion method. (See more)