Computer Vision (CS6476)

MS Robotics, Georgia Tech

Duration: January22 - April22

Professor: Prof. James Hays

This course is an introductory-level graduate course in the field of computer vision. The class covered several CV algorithms and how to apply them to solve simple problems, including,  Detecting features to match images/stitch a panorama (Harris, SIFT, and RANSAC), Detecting movements of objects across multiple images (Optical flow), Tracking movements of subjects in videos (Particle filters), and Deep Learning Implementation for scene recognition and semantic segmentation. Assignments generally consist of algorithm implementation followed by experimenting with the parameters to achieve the desired accuracy. Most of the assignments involved implementing the CV code in python using the NumPy and PyTorch libraries. When I took the course, it consisted of the following assignments (a brief overview is given below):

Convolution and Hybrid Images:

This assignment aims to write an image filtering function and use it to create hybrid images using a simplified version of the SIGGRAPH 2006 paper by Oliva, Torralba, and Schyns. Hybrid images are static images that change in interpretation as a function of the viewing distance. By blending one image's high-frequency portion with another's low-frequency portion, you get a hybrid image that leads to different interpretations at different distances. This project is straightforward in its implementation with the usage of numpy and PyTorch libraries for Image Filtering.

SIFT Local Feature Matching:

The goal of this assignment was to create a local feature-matching algorithm using techniques described in Szeliski chapter 7.1. The pipeline we suggest is a simplified version of the famous SIFT pipeline. The matching pipeline is intended for instance-level matching – multiple views of the same physical scene. For this project, I implemented two versions of the local feature descriptor along with the subsequent two steps of a local feature matching algorithm: detecting interest points and matching feature vectors. One version of the feature descriptor involves using a simple normalized patch, and the other involves the SIFT feature.

Camera Calibration and Fundamental Matrix Estimation with RANSAC:

This project aims to introduce you to camera and scene geometry. Specifically, we will estimate the camera projection matrix, which maps 3D world coordinates to image coordinates, as well as the fundamental matrix, which relates points in one scene to epipolar lines in another. The camera projection matrix and fundamental matrix can each be estimated using point correspondences. To estimate the projection matrix (camera calibration), the input is corresponding 3D and 2D points. To estimate the fundamental matrix, the input is corresponding 2D points across two images. You will start out by estimating the projection matrix and the fundamental matrix for a scene with ground truth correspondences. Then you will move on to estimating the fundamental matrix using point correspondences obtained using SIFT.

Scene Recognition with Deep Learning:

In this project, I designed and trained deep convolutional networks for scene recognition. In Part 1, I trained a simple network from scratch. In Part 2, I implemented a few modifications on top of the base architecture from Part 1 to increase recognition accuracy to ∼55%. In Part 3, I fine-tuned a pre-trained deep network to achieve more than 80% accuracy on the task. We will use the pre-trained ResNet architecture, which was not trained to recognize scenes at all. Finally, we will explore the multi-label prediction of scene attributes in Part 4.

Semantic Segmentation Deep Learning:

In this project, I designed and trained deep convolutional networks for semantic segmentation. I implemented the PSPNet architecture, which uses a ResNet [2] backbone but also uses dilation to increase the receptive field and aggregates context over different portions of the image with a “Pyramid Pooling Module” (PPM). The dataset to be used in this assignment is the Camvid dataset, a small dataset of 701 images for self-driving perception. I used the PSPNet trained on the Camvid dataset as my pre-trained model and trained it on the KITTI road segmentation dataset. After finishing the function model and optimizer, I trained the pre-trained model on KITTI.