Guide: Prof. Ganesh Ramakrishnan Course: CS 725: Foundations of Machine Learning
Pedestrian detection is an essential task in any video surveillance system. Given an image we would like to detect regions where pedestrians exist. This becomes a challenging task due to variations in human poses, appearance and clothing. Therefore, we need a robust feature set such that it can distinguish pedestrians independent of their background.
This project highlights the implementation of Pedestrian detection in two methods:
1. The first method extracts the features of an image using Histogram of Oriented Gradient (HOG) descriptor and then classifies them using linear Support Vector Machine (SVM).
2. To make the detection more invariant to multiple size bounding boxes of different scales, a Region Proposal based Convolutional Neural Network (R-CNN) was developed, this consisted of two steps, viz., selective search for generating region proposals and a ResNet based CNN architecture for feature extraction and classification.
Guide: Prof. Rajbabu Velmurugan Course: EE 750: DSP System Design and Implementation
Currently, existing machine learning algorithms are typically implemented on expensive, high powered Graphical Processing Units (GPU) which have high computational capabilities. Such implementations are generally not feasible for cost effective applications which have strict power and computational constraints. Digital Signal Processors (DSP) have high computation speed due to efficient implementation of mathematical operations and are energy efficient. The main objective of the project is to have an image of a hand written text and then digitize it by recognizing the characters written on it. Key contributions towards the project were :
Guide: Prof. Preethi Jyothi Course: CS 753: Automatic Speech Recognition
The Speaker verification problem refers to verifying the claimed identity of a speaker by using their voice characteristics. In this project we developed a technique to solve the speaker verification problem using a 3D-CNN in a text independent setting. Any speaker verification system consists of three phases:
• Development Phase: In the development phase a background model is generated by classifying a large number of speakers at the utterance level. The aim here is to create a speaker representation that is distinctive enough. Here, we trained a 3-D CNN model to perform the classification task.
• Enrollment Phase: In the enrollment phase speaker specific model is developed for new speakers, this is generated with the help of the background model. The output of (N-1)th fully connected layer of the trained 3D-CNN is used as a fixed feature extractor to generate speaker specific model during the enrollment phase.
• Evaluation Phase: In the evaluation phase, the identity of an unknown person is verified by previously generated speaker models. During verification phase a test speaker’s utterance is passed through the trained 3D-CNN model and its feature is extracted. Distance between the extracted feature with all the enrolled speaker’s features is computed and the speaker with the least distance in the dictionary is identified as the test speaker.
Key contributions towards the project were :
Guide: Prof. Rajbabu Velmurugan Research & Development Project
Visual speech recognition (also known as lipreading) is a field that is becoming increasingly important. It is has emerged out as a natural complement to speech based recognition systems that can facilitate transcriptions even in noisy environments. Perfect Lip reading is still a challenge due to variations in lip articulations that exist while producing a particular utterance, also many phonetically similar words have similar lip movements, encouraging us to develop a system that harnesses clues from not only the target word but also from contextual words that the target word generally co-occur with. Key contributions towards the project were :
Guide: Prof. Shabbir Merchant Course: EE 610: Image Processing
Detection of text in images finds important application in content based search. It is also a key step in Optical Character Recognition. In this project, we describe a method to detect, localize and extract horizontally aligned text in images, that usually appear in the form of on-screen text and subtitles in TV advertisements, news channels and movies. Key contributions towards the project were :
Guide: Prof. Vikram M. Gadre Course: EE 678: Wavelets
Ear bio-metrics are gaining importance in surveillance applications at a very rapid pace. Ear Images are now considered unique for person identification. Ear Images captured by surveillance cameras are typically poor in resolution. Wavelets are known to capture incremental information while going from one resolution to a higher resolution. In this project we aim to super resolve low quality ear images by using a deep convolution neural network to predict wavelet coefficients of high resolution image. Key contributions towards the project were :