Abstract - We present a vision-based method that directly recognizes human emotion from monocular modern dance image sequences in real-time. The method only exploits the visual information within image sequences and does not require cumbersome attachments such as sensors. This makes the method an easy-to-use and human-friendly one.
As shown in figure, our approach is divided into three parts.
To extract information from given frames, first of all, we must segment it. That is, we remove needless information and extract the information concerned with desired object in frames. In our work, we remove background because it’s needless one. Next, we extract meaningful features from the preprocessed frames. Generally, one more camera is needed to capture the movement in three dimensions. However, we use only one camera and adopts the rectangle surrounding a moving object. That means we try to track the movement of rectangle instead of object. It is impossible to track a subtle motion, while suitable to extract the elements of movement (space, time, energy) based on Laban’s theory. In our experiments, we extract the features same as the magnitude of box, the coordinate of centroid, etc.
Without contour information, we couldn’t discriminate between similar but contextually different human motions. We introduced the number of the dominant points on the boundary of the silhouette area as a new feature. We use Teh-Chin’s algorithm to detect the dominant points because it has shown reliable results even if the object is dynamically scaled or changed.
With only a bounding box, we cannot discriminate between two motions. In the right, a little motion of her left leg gives rise to new dominant points. Then we can discriminate between two motions.
We use PCA to know tha stocastic characteristics of features. First of all, we apply SVD(Singular Value Decomposition) to the matrix containing the extraced features and get contribution measure by LMS(Least Mean Square) method. Using this measure, we calculate principal values from the features.
a is the values of features.
α is a contribution measure.
We use TDMLP(Time-Delay Multi-Layer Perceptron) to classify the features. It is inherently nonlinear classification scheme and use time-dependency between features. It is adequate to classify dynamic data same as motion.
A set of features and its time delay set are used as input vectors.
We obtained above 70% recognition rate in outside the training sequence. This is really nice. People cannot recognize other person’s emotion with that precision when they watch only his/her natural motion. In practice, we confirmed through a subjective evaluation on college students that people have the correctness of approximately 50~60%.