Projects

Pedestrian detection in Images

Guide: Prof. Ganesh Ramakrishnan Course: CS 725: Foundations of Machine Learning

Pedestrian detection is an essential task in any video surveillance system. Given an image we would like to detect regions where pedestrians exist. This becomes a challenging task due to variations in human poses, appearance and clothing. Therefore, we need a robust feature set such that it can distinguish pedestrians independent of their background.

This project highlights the implementation of Pedestrian detection in two methods:

1. The ﬁrst method extracts the features of an image using Histogram of Oriented Gradient (HOG) descriptor and then classiﬁes them using linear Support Vector Machine (SVM).

2. To make the detection more invariant to multiple size bounding boxes of diﬀerent scales, a Region Proposal based Convolutional Neural Network (R-CNN) was developed, this consisted of two steps, viz., selective search for generating region proposals and a ResNet based CNN architecture for feature extraction and classification.

Convolutional Neural Network (CNN) on a Digital Signal Processor (DSP)

Guide: Prof. Rajbabu Velmurugan Course: EE 750: DSP System Design and Implementation

Currently, existing machine learning algorithms are typically implemented on expensive, high powered Graphical Processing Units (GPU) which have high computational capabilities. Such implementations are generally not feasible for cost effective applications which have strict power and computational constraints. Digital Signal Processors (DSP) have high computation speed due to efficient implementation of mathematical operations and are energy efficient. The main objective of the project is to have an image of a hand written text and then digitize it by recognizing the characters written on it. Key contributions towards the project were :

Training a 3 layer deep CNN digit classifier on MNIST data-set with 98% accuracy using pytorch framework.
Extracting learned parameters i.e weights and biases from every convolution and fully connected layer, mean and variance from batch norm layers for inference on the DSP.
Implementing the forward pass of the convolution neural network on the TMS320C6748 in C using Texas Instruments Code Composer Studio.
Interfacing a video camera to the DSP to capture hand written digit.
Implemented block average based image thresholding and re-scaling operation to match the testing image to samples in the train set.

Text-Independent Speaker Verification using 3D Convolutional Neural Networks

Guide: Prof. Preethi Jyothi Course: CS 753: Automatic Speech Recognition

The Speaker verification problem refers to verifying the claimed identity of a speaker by using their voice characteristics. In this project we developed a technique to solve the speaker verification problem using a 3D-CNN in a text independent setting. Any speaker verification system consists of three phases:

• Development Phase: In the development phase a background model is generated by classifying a large number of speakers at the utterance level. The aim here is to create a speaker representation that is distinctive enough. Here, we trained a 3-D CNN model to perform the classification task.

• Enrollment Phase: In the enrollment phase speaker specific model is developed for new speakers, this is generated with the help of the background model. The output of (N-1)th fully connected layer of the trained 3D-CNN is used as a fixed feature extractor to generate speaker specific model during the enrollment phase.

• Evaluation Phase: In the evaluation phase, the identity of an unknown person is verified by previously generated speaker models. During verification phase a test speaker’s utterance is passed through the trained 3D-CNN model and its feature is extracted. Distance between the extracted feature with all the enrolled speaker’s features is computed and the speaker with the least distance in the dictionary is identified as the test speaker.

Key contributions towards the project were :

Stacked modified MFCC Coefficients corresponding to multiple utterances of a user was used as input feature.
A six layer deep 3-D CNN model was trained to classify speakers in the development stage.
Achieved a Speaker Error rate of 22.5 on the test set of VoxCeleb Data-set.

Lip Reading Words from Videos

Guide: Prof. Rajbabu Velmurugan Research & Development Project

Visual speech recognition (also known as lipreading) is a field that is becoming increasingly important. It is has emerged out as a natural complement to speech based recognition systems that can facilitate transcriptions even in noisy environments. Perfect Lip reading is still a challenge due to variations in lip articulations that exist while producing a particular utterance, also many phonetically similar words have similar lip movements, encouraging us to develop a system that harnesses clues from not only the target word but also from contextual words that the target word generally co-occur with. Key contributions towards the project were :

Developing a novel 24 layer deep 3-D convolution neural network with hierarchical residual connections.
Carefully, pooling across time and using LSTMs to extract temporal features.
Visualizing intermediate convolutional layers to understand the various lip articulations at which model attends itself to.
Attained a Word Error Rate (WER) rate of 20% on the test set of the BBC-Oxford 'Lip Reading in the Wild' data-set ( BBC LRW), samples of the data-set can be obtained from here.

Detecting Text in Images

Guide: Prof. Shabbir Merchant Course: EE 610: Image Processing

Detection of text in images ﬁnds important application in content based search. It is also a key step in Optical Character Recognition. In this project, we describe a method to detect, localize and extract horizontally aligned text in images, that usually appear in the form of on-screen text and subtitles in TV advertisements, news channels and movies. Key contributions towards the project were :

Generation of an edge image based on the fact that character contour pixels show high contrast compared to their local neighbors.
Considering strongest edge component in horizontal, vertical and diagonal direction.
Analyzing the horizontal projection profile of the edge image to locate potential text areas.
Making use of Power law transform to avoid merging of adjacent characters with the background.
Eliminating Non-text elements using connected component analysis.
Visual results indicate good performance on images taken from screen texts, subtitles in TV advertisements, news channels and movies.

EarNet: Deep Networks for Super-Resolution of Ear Images using Wavelets

Guide: Prof. Vikram M. Gadre Course: EE 678: Wavelets

Ear bio-metrics are gaining importance in surveillance applications at a very rapid pace. Ear Images are now considered unique for person identification. Ear Images captured by surveillance cameras are typically poor in resolution. Wavelets are known to capture incremental information while going from one resolution to a higher resolution. In this project we aim to super resolve low quality ear images by using a deep convolution neural network to predict wavelet coefficients of high resolution image. Key contributions towards the project were :

Interpreted Haar DWT & iDWT as Convolution & Deconvolution operations for faster on the fly pre-processing on GPUs
Trained a deep CNN on the USTB: Hello Ear dataset consisting of 6,10,000 ear images to predict patch based wavelet coefficients of SR image from LR image
This model was trained on Two Nvidia GTX 1080 (8GB) GPUs using Pytorch framework

Detecting Anomalies in Images [ Will be Updated Soon ]

Google Sites

Report abuse