Object recognition is an essential component of visual perception tasks that help robots build a semantic-level understanding of their environment. Although deep learning methods achieve extraordinary recognition performance in previously seen environments, they are insufficient for deployment in complex and continually-changing environments due to their sensitivity to environmental variations. To realize the goal of long-term autonomy in robots, we need perception methods that go beyond statistical correlation. Therefore, my dissertation focuses on developing robust object recognition methods using topological methods and human-like reasoning mechanisms. (See more)
Wide-angle fisheye lens cameras are often the sensor of choice to visually capture a car's surroundings. Obtaining a bird’s-eye-view (BEV) representation from multiple such car-mounted cameras is of tremendous value for advanced driver-assistance systems and autonomous vehicles; the clear presentation of the location and scale of objects in a BEV is helpful for downstream tasks such as lane keeping and object tracking. However, BEV generation in such cases is more challenging than typical image stitching owing to the strong visual distortion produced by the fisheye lenses and the violation of the flat earth assumption during homography estimation. As a research intern at Volvo Cars, I proposed generating the BEV images and the corresponding segmentation maps using homography-informed spatial-transformer networks that address the distortion by spatially manipulating the feature maps (patent application filed). In a subsequent Volvo Cars-sponsored project at the University of Washington, we develop an attention-based network for obtaining BEV segmentation and height maps, which incorporates the camera parameters in its design to overcome the challenges due to the flat earth assumption.
Ultrasound testing is often used during non-destructive evaluation (NDE) of parts to detect manufacturing defects such as wrinkles. As a part of Boeing's data science effort at UW though the Boeing Advanced Research Center (BARC), we developed a novel deep learning-based method for automated detection of wrinkle defects in parts made of composite materials. We used Faster R-CNN to locate wrinkles from ultrasound scans of composite parts, thereby replacing manual inspection. Additionally, we used standard image processing techniques to determine certain geometric and physical parameters of the detected wrinkles to enable further classification according to their severities.
Composite materials, comprising carbon-fiber reinforced polymers, are used in manufacturing aircraft structures due to their combination of high strength and low density. The manufacturing process is, however, challenging in terms of quality control, which makes accurate in-process inspection of composite parts particularly important. Therefore, we developed a method for detecting tow ends of laid fibers from grayscale images using semantic segmentation. We use a modified U-Net structure, trained using psuedo-labeling, for semantic segmentation of images. Then, a series of post-processing steps are performed to extract tow ends from the segmentation maps. A presentation on this work, was awarded the best presentation at the 2019 Boeing Technical Excellence Conference. (See more)
Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91. (See more)
Automated real-time prediction of the ergonomic risks of manipulating objects is a key unsolved challenge in developing effective human-robot collaboration systems. We cast it as a problem of action segmentation from RGB-D camera videos. Spatial features are first learned using a deep convolutional model from the video frames, which are then fed sequentially to temporal convolutional networks to semantically segment the frames into a hierarchy of actions. Every action is labeled as either ergonomically safe, require monitoring, or need immediate attention based on automated computation of the Rapid Entire Body Assessment (REBA) score. We also collected a new dataset, the UW-IOM dataset, comprising twenty individuals picking up and placing objects of varying weights to and from cabinet and table locations at various heights. Results show very high 87-94% F1 overlap scores among the ground truth and predicted frame labels for videos lasting over two minutes and consisting of a large number of actions. (See more)
Automated optical tweezers-based robotic manipulation of microscale objects requires real-time visual perception for estimating the states, i.e., positions and orientations, of the objects. Such visual perception is particularly challenging in heterogeneous environments comprising mixtures of biological and colloidal objects, such as cells and microspheres, when the popular imaging modality of low contrast bright field microscopy is used. Therefore, as an undergraduate intern advised by Prof. Ashis Banerjee, I investigated the performance of algorithms such as SURF, MSER combined with other image processing techniques for detection of microscale objects in different media. Apart from developing perception methods, I also deployed the methods on a host system interfaced with a camera and optical tweezers for real-time control of microspheres. (See more)
An image captured in dark environment usually has ambient illumination, but the image looks dark and noisy. At the same time the use of flash can introduce unwanted artifacts such as sharp shadows at silhouettes, red eyes, and non-uniform brightness in the image. We propose a new framework to enhance images captured in dark environments by combining the best features from a flash and a no-flash image. We use sparse and redundant dictionary learning based approach to denoise the no-flash image. A weighted least squares framework is used to transfer sharp details from the flash image into the no-flash image. We show that our approach is simple and able to generate better images than that of the state-of-the-art flash/no-flash fusion method. (See more)