Articulab

Carnegie Mellon University

Thin Slicing for Annotations and Gaze Detection in SCIPR

SCIPR stands for Sensing Curiosity in Play and Responding. We worked on developing a model to quantify curiosity and make a machine learning model for the same.

More details about the project can be fount at it's home page: http://articulab.hcii.cs.cmu.edu/projects/scipr/

Abstract

There has been a lot of research showing the advantages of thin slicing a larger video clip to annotate low level behaviours like happy, sad, anger and excitement. No research has been found which studies the effect of thin slicing on higher level behaviours like curiosity, trust, etc. During my internship, we tried to study the approach of thin slicing to annotate curiosity in young children. The group of young children is busy doing some group activity. Various lengths of slices were experimented upon. Statistical conclusions drawn from agreement among various raters and reliability among raters, various approaches for doing the above stated experiments and feature extraction for machine learning algorithm are discussed in this blog.

Introduction

I was an intern in the Articulab, CMU during the summer of 2016. I worked closely with Dr. Zhen Bai, Tanmay Sinha and Prof. Justine Cassell. I was working on the SCIPR - Sensing Curiosity in Play and Responding project. The project aims at developing a computational model of curiosity and then developing a machine learning algorithm to predict curiosity of young children involved in the group activity.

My overall research experience was very exciting and interesting. I got to see, perform and be part of some extraordinary research which I haven't seen at my home institution in India. Learning new methods of work and increasing my efficiency are two most important things I have taken back from my internship. My project was in machine learning and development of an online tool. I learnt a lot of new things related to my main eld of interest. Apart from that I learnt managing and collaborative skills for huge projects. In the past semester at BITS, I tried to continue a few habits and methods I used at CMU for my semester long project and the results were very good. All I would like to say to new interns is just make the most of it when you're there. Justine and the team are very helpful and frank people.

Apart from the high quality research work, the culture and atmosphere in the Articulab is just amazing! The people thrive to give their best and help each other a lot. I think this is what helps the lab to prosper as a team and not as an individual achievement. For my research interest of machine learning, I learnt about a lot of new techniques like Time Series analysis, statistical measures like Krippendor's Alpha, Cronbach Alpha, etc. I have used them to get better results in my projects at BITS Pilani.

Thin Slicing

Thin slicing refers to looking only at a small period of the interaction to observer certain features and draw conclusions. These are usually very short in length as compared to the actual interaction. Thin slicing is used to observe human emotions, behaviours and attitudes when people are interacting. By looking at the thin slice, one needs to make a very quick decision about the observable feature. One main aspect of thin slicing is to remove the contextual information which can be observed when looking at the complete interaction. Studies have shown that brief observations or judgements made by looking at thin slices are similar or better to those made by watching the complete interaction and having much more information related to context. Thin slicing helps us understand how people make decisions without contextual information based on various behavioural patterns, eye gaze, head and face position, hand gestures, etc. These features tend to get ignored when contextual information is present. The length of the thin slice is an extremely important parameter as this would be used a lot while aggregating results and drawing overall conclusions.

Thin Slicing for SCIPR

There are groups of 3 or 4 children building a Rube Goldberg Machine. The complete activity is video-graphed and this video is then thin sliced and analysed. In the video, you can look at 3 children at one time, with the centre child being the focus. The information other two children give some social information about the environment. So a total of 4 videos are recorded for each study. This length of video is thin-sliced into slices of duration 10 seconds each.\newline \newline The aim is to study the following after analysing the thin slices:

  • Behavioural patterns when children are curious
  • Cause of curiosity in the children
  • Social influence of curiosity

By social influence, the aim is to determine how activities or behaviour of one child relates with curiosity of the other children in the group. This is the reason why multiple children are visible in the video rather than a single child being visible.

To achieve the above stated aim, first the curiosity of the child need to be quantified in each 10 second slice. This will serve as the ground truth for all future models developed for curiosity. So each of the 10 second slices have been annotated as curious on a scale of 0 to 4, 0 being not curious and 4 being extremely curious. Also, a confidence in the rating, i.e. how confident the rate is about his/her curiosity rating is also noted. Since there are a lot of slices to be annotated, crowdsourcing is the preferred method.

Previously crowdsourcing has been used to analyse human emotions, behaviours, etc., but there is no existing literature suggesting use of crowdsourcing to annotate and label for human curiosity. A novel approach has been made in this regard and various pilot studies have been performed to perform curiosity annotations via crowdsourcing. After all these studies, tasks like determining an ideal length of thin slice for curiosity annotations in child behaviour, whether or not crowdsourcing is feasible for such studies, etc. have a good answer for them.

Methods

We crowdsourced the annotations. We used 3 statistical measures for agreement and consistency among raters.

  • Krippendorff’s Alpha (inter-rater reliability)
  • Inter Class Correlation Coefficient (ICC)
  • Cronbach Alpha (consistency for a rater)

Various scripts have been written to automate parts of results analysis and statistical computations.

Method 1

In the dataset we chose 5 participants. 30 non overlapping slices were made for each participant 10 slices of duration 10 seconds, 20 seconds and 30 seconds each. So this we had a total of 50 slices. These slices were rated by 5 individual raters. These raters were other lab interns. Here every rater rated every slice. So we had a complete matrix of ratings.

A novel approach, which we did not find in any of the existing literature, was to take subsets of 3 or more raters from the full set of raters and calculate the statistical measures for the subset. Doing this can give us the best subset of raters, whose results can be used to find and aggregate the overall result. We choose at least 3 raters in any subset as less than 3 raters are too few to calculate any statistical measures among the raters. By subset, we mean out of a n*p table, having n videos rated by p raters, we find Krippendorff’s alpha for each subset of 3 or more raters (refer file irr.csv). We chose the best subset by looking at these values. The subset table will have a size n*m, n is number of videos and m is the size of subset. \par After applying statistical measures, we found the inter-rater reliability among all 5 raters and the subsets of 3 or more raters as below

After applying statistical measures, we found the inter-rater reliability among all 4 raters and the subsets of 3 or more raters as below:

To check whether RGM is a good activity to measure curiosity, we found slices with high curiosity

Method 2

We did actual crowdsourcing rather than annotations by lab interns. The dataset contained 30 slices of the same participant each of 10 seconds duration. Different non overlapping slices were used for both methods described below. 5 raters annotated each slice. We deployed two techniques to annotate the results

  • Technique 1: Here one task is one slice and so every slice has 5 ratings. Since the ratings are crowdsourced, not every rater has rated every slice. So we can only calculate agreement and consistency ratings for the complete set of ratings. We cannot apply the subset approach suggested above for this type of a dataset as removing one column is just removing one rating from each video, not necessarily the rating by the same rater.
  • Technique 2: Here one task is ten slices and hence every slice has 5 ratings. Here we have 3 groups slices 1-10, slices 11-20, slices 21-30. Within each group, there are 50 ratings such that 5 raters have rated 10 slices each. So we can apply the subset approach to each group individually in this method unlike in method 1. These subset results can be aggregated to find best results for the overall dataset.

Below are the detailed results of both techniques

Technique 1

The inter-rater reliability stood at 0.0366. The ICC was 0.0368.

Technique 2

The IRR and ICC for each group are mentioned below. The subset analysis follows.

The inter rater reliability for the subset analysis is as below:

OpenFace for detecting facial features

We used OpenFace to automate gaze detection from video clips. OpenFace is a new recently developed open source toolkit for extracting facial expressions. It has the ability to detect facial landmarks, estimate head pose, recognise the facial units and estimate eye gaze. It runs at a decent speed as well. It can be used to do real time feature extraction as well. The GitHub link of the project is \url{https://github.com/TadasBaltrusaitis/OpenFace/wiki}. For the gaze detection module, only the gaze estimation part of OpenFace has been used. It gives a frame by frame comma separated file as output, with one row for each frame. This file is further used as input to the automatic gaze detection module. For real time analysis, the output of OpenFace can be redirected or piped to the automatic gaze detection module written in Python.

Annotation for Gaze

The coding scheme for verbal and non-verbal annotations has been developed by Zhen, Tanmay and the team. A small 5-minute video clip was annotated for gaze by three annotators from the lab. They did the gaze annotations in three steps.

  • All the three annotators individually annotated the same 5-minute segment for gaze orientation, trunk movement, body movement and locomotion. All the analysis has been done only for gaze but can be done for the remaining parameters in a similar way. The three annotators did not look at annotations of anyone else.
  • The annotators looked at the previous annotations of step 1 of all three annotators. They modified their own annotations if they felt the need by looking at annotations of others. They worked individually and did not talk to each other during the annotations.
  • All the three annotators sat together, discussed and came up with one single annotation instead of three different from three annotators. They looked at annotations from step 1 and step 2 for reference.

After getting these annotations, a Python script converted them from Elan format to a frame by frame annotations. Hence now we have a frame by frame ground truth as the annotations from the in lab annotators and frame by frame numerical features of gaze from OpenFace. We map this data and use it as input for the Machine Learning model for automatic detection of gaze.

Curiosity Results Aggregation

After obtaining results from various raters through Amazon MTurk, we need to aggregate the results from different raters to get a single value of curiosity for each slice. This can be done using the confidence and curiosity ratings given by each rater do each slice. We take a weighted average of the curiosity ratings, weighted by confidence.

The results thus obtained are mean curiosity values for each slice. Still the gaze values as obtained from annotations or from the automatic model are at a per frame level. Curiosity is a subjective topic and cannot be defined on a per frame level. Saying so, we chose to define the curiosity per unit, where the unit may be one of 500 milliseconds, 1 second, 2 seconds, 5 seconds or 10 seconds. Further units can be added as required. The unit is just a parameter in all the Python code files, so can be changed by simply changing the value of the parameter and running the script again. Currently, the analysis is done for one unit to be 500 milliseconds. So we need to get per unit value of curiosity. This is done by just repeating the same value for all units that span for a 10 second slice. So for the unit as 500 milliseconds, we put in the same value of curiosity 20 times for a 10 second slice.

Now we have a per unit mapping of gaze to curiosity. Using this we can find the correlation between gaze and curiosity. This correlation has been found currently by making both gaze and curiosity as boolean fields, i.e. whether curious or not and whether GOO or not. This is done by setting a threshold of curiosity value to be marked as being curious.

Conclusion

From the work done on thin slicing of videos for annotation of curiosity, we have come up with an optimal time slice length into which the video should be divided to obtain the best annotation results. This work is one of the earliest works in the field of thin slice annotations for higher level human behaviours.

Future Work

A lot of more work can be done in the field on building the computational model after annotations and in automatic feature extraction. For the computational model, various concepts of fusion, like early and late fusion, can be tried out. Fusion is very important in this work because we have features coming from various different channels like facial features, acoustic features and verbal features. For automatic feature extraction, more research needs to be done for acoustic and verbal channel features.

Acknowledgement

The author would like to thank Doctoral Candidates Tanmay and Zhao for their sincere guidance and help. The author would also like to show the gratitude to the Prof. Cassell and Dr. Bai for their support during the course of this research. The author takes this opportunity to thank various other doctoral candidates in the Articulab for their small tips and guidance wherever required. This work would not have been possible without help from other interns working on SCIPR project. Lastly, the author would like to thank Articulab and CMU for giving this research opportunity.