The interaction between human beings and computers will be more natural if computers are able to percieve and respond to human non-verbal communication such as emotions. Facial emotion recognition will become vitally important in designing future multi-cultural visual communication systems. We tried different types of classifiers such as Fisher Face classifier, SVM, Adaboost, and transfer learning from a pre-trained CNN classifier on features relevant to emotion recognition extracted from static images in standard emotion recognition data sets. We used a variety of datasets to train the model. An overall training accuracy of 89.7% and a test accuracy of 54.6% is achieved.
Facial expression are important cues for non verbal communication among human beings. This is only possible because humans are able to recognize emotions quite accurately and efficiently. An automatic facial emotion recognition system is an important component in human machine interaction. Apart from the commercial uses of an automatic facial emotion recognition system it might be useful to incorporate some cues from the biological system in the model and use the model to develop further insights into the cognitive processing of our brain.
Emotion Recognition has wide spread applications in the domains of Medicine, E-learning, Monitoring, Entertainment, Law, Marketing, Surveillance cameras for Security. Emotion can be recognized through a variety of means such as voice intonation, body language, and more complex methods such electroencephalography. However, the easier, more practical method is to examine facial expressions. There are five types of human emotions shown to be universally recognizable across different cultures : anger, fear, happiness, sadness, surprise. Interestingly, even for complex expressions where a mixture of emotions could be used as descriptors, cross-cultural agreement is still observed.
The task of emotion recognition is particularly difficult for two reasons: 1) There does not exist a large database of training images and 2) classifying emotion can be difficult depending on whether the input image is static or a transition frame into a facial expression. The latter issue is particularly difficult for real-time detection where facial expressions vary dynamically.
Most applications of emotion recognition examine static images of facial expressions. We investigate the application of convolutional neural networks (CNNs) to emotion recognition in real time with a video input stream. Given the computational requirements and complexity of a CNN, optimizing a network for efficient computation for frame by-frame classification is necessary. In addition, accounting for variations in lighting and subject position in a nonlaboratory environment is challenging. We have developed a system or detecting human emotions in different scenes, angles, and lighting conditions in real-time. The result is a novel application where an emotion-indicating emoji is superimposed over the subjects’ faces as is shown in Qualitative Results section. The flow diagram below shows the general steps involved in emotion recognition.
The first step of emotion recognition is to detect faces in input images. This is done by using the Viola-Jones algorithm. It detects a list of bounding boxes of potential face regions on which further processing will be done to extract the features that are relevant to different types of facial emotions . We used the already existing implementation from OpenCV library which detects faces under controlled conditions of lighting and stable posture while standing in front of the camera.
After the initial bounding box for the face is detected we need to do some preprocessing steps in order to obtain pure facial images with normalized intensity, uniform size and shape. In order to make the face recognition system more robust and easy to design, face alignment is performed to justify the scales and orientations of these patches.
Different types of feature extraction techniques were tried out as explained below:-
Eigen Face Method
The Eigen Face method is based on linearly projecting the image space to low dimensional feature space.The face features extracted by the PCA method, reduces the dimensionality of input space. It has been seen that variations between the images of the same subject due to variation in pose, orientation, etc. are quite high. Therefore, to achieve high recognition rate, structural information of face images of the same subject is considered for classification process. This has been realized by identifying sub-clusters corresponding to a subject separately using a clustering algorithm.
Fisher Face Method
This method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The advantage of the Fisher Face method over the normal PCA method is that it projects each face onto a low dimensional subspace such that the projection maximises the scatter between different facial expressions and minimizes the scatter within the same faces.
HAAR Feature Method
This method extracts the features using the change in contrast values between adjacent group of pixels rather than using the intensity values of a pixel. The contrast variances between the pixel groups are used to determine relative light and dark areas. Detecting human facial features such as the mouth, eyes, and nose require that Haar classifier cascades first be trained. In order to train the classifiers, this gentle AdaBoost algorithm and Haar feature algorithms must be implemented. We used the OpenCV based HAAR classifiers for extracting the eyes and mouth from the preprocessed faces detected in the previous step. The reason for using eyes and mouth is the fact that most of the emotions are centered around these two facial features.
Different types of classifiers were used to train the model on the features extracted:-
Support Vector Machine(SVM) Classifier
The features extracted using the above methods are used to train a SVM classifier. But the features extracted were not complicated enough to model the specific differences between various emotions. Hence the accuracy achieved was not enough using the SVM classifier.
AdaBoost Classifier
The weak learner used in the adaboost classifier is a Decision Tree. We considered 30 such weak learning units with a depth of 3 in every decision tree. All of this form a boosted classifier to adapt to each other's misclassifications. This gave better performance than SVM but still the performance on the test set can be further improved.
CNN Based Classifier
We used a pre-trained VGG S Convolutional Neural Network to apply the transfer learning technique on our facial emotion dataset. We used a cascaded model of classifier using the outputs of this CNN and passed it onto a SVM classifier. This ensures a better model for our dataset. The SVM classifier cascaded ensures that the outputs would be only among the five emotions we considered. And since we are only using output from pre-trained CNN model as a rough measure of each feature, our cascaded model makes sure to custom fit model for our trainning data set.
This method has shown best results compared to previous models. But it has been observed that some of the incorrect classifications from CNN model are being well predicted by AdaBoost model. Thus a fianl ensemble of models has been designed for the classification task. This ensemble contained CNN+SVM cascaded model, AdaBoost decision tree and other variants of boosted decision tree like GradientBoost Classifier. With all of these models in the ensemble, we use a weighted average of predictions from each model to decide upon the actual final output of the ensemble model. This ensures that the actual accuracy of CNN is retained while there is an improvement in certain cases where Boosting algorithms were better classifier.
To develop a working model of real-time emotion recognition, we used two differently available datasets: the extended Cohn-Kanade Dataset(CK+) which contained emotions expressed by 100 subjects which accounted for a total of 500 images and the Japanese Female Facial Expression (JAFFE) database which accounted for a total of 400 images. We also uniquely developed our new (home-brewed) database that consists of images from five individuals. All these images contained one of the five primary emotions that we intend to classify in the project : anger, fear, happy, sadness, surprise. We subsequently applied image equalization techniques from the OpenCV library to account for variations in lighting conditions and subject positions in the final implementation.
Initially we applied traditional machine learning classifiers such as SVM, Fisher Face, Adaboost on the facial features extracted using the techniques above. The training accuracies reported for the above classifiers were around 80-85% but the test accuracies which were calculated on a dataset taken from a Kaggle Facial Expression Recognition Challenge varied around 30-40%. The following figures shows the confusion matrices for the above classifiers.
Classification Accuracy of Fisher Face Classifier = 30.1%
Classification Accuracy of SVM Classifier = 33.97%
Classification Accuracy of AdaBoost Classifier = 39.59%
To improve the classification accuracy, we applied transfer learning on the pre-trained VGG_S convolutional neural network on the CK+ datasets as well as our own dataset. The test accuracies reported using this classification strategy were significantly better than the previous classifiers and we were able to achieve a classification accuracy of around 55% on the test dataset.
Classification Accuracy of CNN Classifier = 54.28%
The chart below shows the comparison of test accuracy of different classifiers on different emotions.
Although we observed confusion among similar categories such as Fear and Angry, we observed some really nice results subject to the right lighting conditions and face posture in front of the web camera.
The figure above shows an Angry face correctly labelled by the classifier as Angry.
The figure above shows a Happy face correctly labelled by the classifier as Happy.
The figure above shows a Sad face correctly labelled by the classifier as Sad.
The figure above shows a Surprise face correctly labelled by the classifier as Surprise.
Next we show the cases where the emotions were not correctly classified. In the first case, the input face is fear but it is slightly confusing even for the human beings. So the classifier classifies the emotion as angry.
In this case, the actual emotion is Surprise but the classifier predicted the emotion as Fear. This is expected because there are very less differences between the two emotions.
The mis-classification observed in the previous two cases can be attributed to a combination of factors. First is the lighting conditions which creates confusion between various emotions. The second factor that we observed is that classifying emotions discretely is a challenging task because in general the emotions expressed by human beings are not discrete but a combination of emotions. So trying to model a classifier assuming the fact that each emotion acts discretely in human beings requires a careful analysis of the specific facial features which can model such a task accurately.
The objective of this project was to implement a facial emotion recognition classifier and then use the same to implement a real time application that displays the emotion of the face in front of the camera using EMOJIS. Comparing the performance of classification for different classifiers, our custom trained VGG S network with a face-detector provided by OpenCV, performed very well in determining one of five expressions (anger, fear, happy, sad, surprise). While we achieved a successful implementation, significant improvements can be made by addressing several key issues. First, a much larger dataset should be designed to improve the model’s generality. This dataset should include the variations in expressing emotions among different geographies. While we achieved > 89% accuracy in laboratory conditions (perfect lighting, camera at eye level, subject facing camera with an exaggerated expression), any deviation from that caused the accuracy to fall significantly. In particular, there was a lot of confusions between angry and fear emotions. This is justified to some extent due to the fact that the two emotions resemble very close to each other in most of the subjects.
After trying different types of classifiers, we observed that emotion recognition is a difficult, and sensitive task. It is highly dependent on the facial expression type and the geographical region to which people in data set belong. As people from different geographical regions would have varied facial structure and would have some variations in the way they express emotions. The model is very sensitive to lighting conditions and would need laboratory conditions for high degree of accuracy. Under experimental setup with factors like correct angle and level of face with camera, best lighting conditions our model gives accurate predictions. Dataset size is also an issue for improving model. With a big and extensively covered dataset with various faces covering all cases, we could train better to make the model better in making predictions for the new facial images.
Since emotions are the primary form of non-verbal communication among human beings, so any future advancement in artificial intelligence demands effective exchange of information between humans and computers. Understanding facial emotions and interpreting them to make smart decisions is a trending area of research in Computer Vision.
The fundamental idea underlying this work is to detect emotions from human faces. This could be further used in various applications such as surveillance cameras, emotion based music recommender systems, smart mood based ambience variation systems, feedback systems and many more.
1.Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I. (2010). The Extended Cohn-Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-specified expression. Proceedings of the Third International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB 2010), San Francisco, USA, 94-101
2.Gil Levi and Tal Hassner, Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns, Proc. ACM International Conference on Multimodal Interaction (ICMI), Seattle, Nov. 2015
3. Kaggle Competitio on Learning Facial Expressihttps://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data
4. Seyed Mehdi Lajevardi, Zahir M. Hussain Local Feature Extraction Methods For Facial Expression Recognition, European Signal Processiing Conference( EUSIPCO 2009)
5. Kanade, T., Cohn, J. F., Tian, Y. (2000). Comprehensive database for facial expression analysis. Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), Grenoble, France, 46-53
6.Mathew Turk and Alex Pentland , Vision and Modeling Group, The Media Laboratory, Massachusetts Institute of Technology, EigenFaces For Recognition
7.Caffe : Convolutional Architecture For Fast Feature Embedding, Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor