Touchless Technologies

IEEE BigMM Grand Challenge 2020

Call For Participation

About

IEEE BigMM is hosting a grand challenge to bring together interdisciplinary novel research focused on improving the social space for the members of the community. Inspired by the COVID-19 pandemic, which spreads rapidly through touch and other forms of human-to-human contact, the theme of this year’s challenge focuses on touchless technologies. In particular, the challenge focuses on lipreading and touchless typing technologies.

Motivation

Recent works on touchless typing and a series of works on lipreading suggest that touchless human-computer interaction is a challenging but achievable task. In general, the touchless typing refers to the activity in which an individual interacts with a computer without touching any interface. In other words, the individual provides some gestural cues which are captured via devices such as camera, and/or BCI headset. The gestural inputs are then translated to a sequence of clusters, letters, or words. In its current state, the technology is being put forward as an assistive tech for people with disabilities; however, COVID-19 like pandemics make us think beyond i.e., developing touchless typing for everyone.

On the other hand, lipreading involves the inference of words from silent speech videos. The use of speech modality as a touchless interface has increased manifold in the last few decades, especially so after the introduction of Automatic Speech Recognition (ASR) systems. However, accuracy constraints, especially on different accents and languages, still present a research challenge. A potential solution is provided by the large-scale presence of both cameras and microphones on mobile devices. Thus, lipreading as a research problem stands right at that intersection. This is the motivation behind, including it as a key part of our challenge statement.

Dataset Description

Touchless Typing

A total of 22 volunteers participated in the data collection, of which 19 were male, and six were female university students.. Each participant typed 20 words, ten phrases, and five sentences. The exercise was repeated for each participant thus, totaled to 2234. The average lengths of words, phrases, and sentences were 4.33, 10.6 and 18.6 letters, respectively. The words were chosen manually ranging from three to six characters long, such that each of the 8 clusters representing 26 English alphabets is included in at least one word, and the unique transitions between the clusters in a given word are maximized.

OuluVS2

The database contains video recordings from 52 subjects speaking three types of utterances: including continuous digit strings, short phrases and TIMIT sentences. To address the fact that our talking motion is produced in a three-dimensional space, we placed six cameras around the speaker to film from five different views simultaneously, resulting in more than 20k video recordings.

Lip Reading in the Wild (LRW)

The dataset consists of up to 1000 utterances of 500 different words, spoken by hundreds of different speakers. All videos are 29 frames (1.16 seconds) in length, and the word occurs in the middle of the video. The word duration is given in the metadata, from which you can determine the start and end frames.

Preprocessing Details For LRW Dataset

Since the length of frames can be varying in every clip we would want the participants to use equal number of frames for uniformity during the evaluation. Most of the clips in the dataset contain 29-31 frames, you are requested to pad by repeating the last frame to get 32 frames in the clip for training and testing.

General Preprocessing Details

Also to enable maximum participation and relax hardware restrictions you are requested to downsample the clips to 64x64. Thus every video clip should have a uniform shape of 32x64x64


Challenge Overview

Task 1

Given a set of videos containing human subjects speaking some predefined but unknown phrases, the task is to cluster the related videos. Relatedness would be defined by two factors - speech content and pose. A key detail about the videos would be the presence of visemically equivalent phrases such as billion and million.Clustering in speech videos is a relatively unexplored problem, and this task, in our opinion, would yield innovative architectures as well as interesting insights to solve the problem.

To make things more interesting the words which are spoken will be informed thus helping you know the correct number of clusters required. The task will be done in 2 stages:

  1. Stage-1: It will consist of samples from OuluVS2. Clustering needs to be performed on all the 10 phrases of Oulu VS-2. Test videos IDs can be accessed from here.
  2. Stage-2: It will consist of samples from LRW dataset. The teams who qualify Stage-1 will be sent the details of this task

Evaluation Criterion

Evaluation would be done by measuring the number of clusters correctly assigned for train and test videos.

Submission

Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:

  1. A csv file containing predictions of clusters for all the training videos
  2. A csv file containing predictions of clusters for all the testing videos
  3. Train and test code

Task 2

Given a single image of a person and a word, the task is to generate the probable sequence of head and face movements of the person speaking the required phrase from LRW dataset. This would further help to generate synthetic data, increase the data size, and help in generative learning or meta-learning aspects.

Evaluation Criterion

The evaluation will be performed by measuring:

  1. Mean Squared Error (MSE) and
  2. Per-frame Structural Similarity (SSIM) scores
  3. Average Peak Signal to Noise Ratio (PSNR)

obtained on the test samples.

Submission

Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:

  1. Generated videos for the provided words
  2. Train and test code

Task 3

Given a single image of a person, a character sequence and a keyboard layout (for instance, QWERTY, DHI-ATENSOR, etc), the task is to generate the probable sequence of head and face movements. This would further help to generate synthetic data, increase the data size and help in generative learning or meta-learning aspects.

Evaluation Criterion

The evaluation will be performed by measuring:

  1. Mean Squared Error (MSE) and
  2. Per-frame Structural Similarity (SSIM) score

obtained on test samples.

Submission

Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:

  1. Generated videos for the provided phrases
  2. Train and test code

Task 4

Given a video from the touch-less typing dataset and various readings from the MUSE device representing different typing videos, the task is to correlate and find corresponding MUSE readings for the given video.

Evaluation Criterion

Evaluation would be done by measuring the number of clusters correctly assigned for train and test videos.

Submission

Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:

  1. A csv file containing predictions of clusters for all the training videos
  2. A csv file containing predictions of clusters for all the testing videos
  3. Train and test code


Note: Only 1 submission/day per team is allowed for any of the tasks.

Download Links

Ethical Considerations

  • All the participants would be required to honour the agreement and conditions signed for obtaining the dataset.
  • Usage of any other external dataset is strictly prohibited.
  • In order to promote reproducibility submission of code would be required along with the initial parameters used for training.

System Description Paper Submissions

At the end of the competition hosted, based on leaderboard rankings, top submissions would be invited to submit system description/ model papers describing the proposed methodology, improvements and limitations. A more detailed guideline on submission will be provided soon. Submissions should be made via EasyChair and must follow the guidelines of IEEE, BigMM 2020. All the submissions must conform to the 2 column format of IEEE. Maximum length of papers to be considered for evaluation will be 4 pages excluding references.

Important Dates

Competition Time Frame:

  • April 18, 2020: Competition begins.
  • July 18, 2020: Public release of test data along with demo submission file.
  • July 20, 2020: Deadline to make the submissions.

Conference Preparation

  • August 1, 2020: Top submissions would be invited to open source the code and models for evaluation.
  • August 5, 2020: Invitations sent to top submissions for submitting the system description papers in the proceedings of IEEE BigMM 2020.
  • TBA: Deadline for submitting the system description papers.

Organizing Committee

Aradhya Mathur (IIITD) - aradhyam@iiitd.ac.in

Yaman Kumar (IIITD, Adobe) - yamank@iiitd.ac.in, ykumar@adobe.com

Henglin Shi (University of Oulu) - Henglin.Shi@oulu.fi

Dr. Rajesh Kumar (Syracuse University, Haverford College) - rkumar@haverford.edu

Dr. Li Liu (University of Oulu) - li.liu@oulu.fi

Dr. Guoying Zhao (University of Oulu) - guoying.zhao@oulu.fi

Dr. Rajiv Ratn Shah (IIITD) - rajivratn@iiitd.ac.in

Terms and Conditions

  • The organizers make no warranties regarding the dataset provided, including but not limited to being correct or complete. The members of the organizing committee cannot be held accountable for the usage of the dataset.
  • By submitting results to this competition, you consent to the public release of your scores at this website and at IEEE BigMM workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • Team constitution (members of a team) cannot be changed after the evaluation period has begun. Once the competition is over, we will release the gold labels and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.

References:

  • Rustagi, Shivam, et al. "Touchless Typing using Head Movement-based Gestures." arXiv preprint arXiv:2001.09134 (2020).
  • Anina, Iryna, et al. "Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis." 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). Vol. 1. IEEE, 2015.
  • Chung, Joon Son, and Andrew Zisserman. "Lip reading in the wild." Asian Conference on Computer Vision. Springer, Cham, 2016.
  • Afouras, Triantafyllos, et al. "Deep audio-visual speech recognition." IEEE transactions on pattern analysis and machine intelligence (2018).
  • Kumar, Yaman, et al. "Harnessing GANs for Addition of New Classes in VSR." arXiv preprint arXiv:1901.10139 (2019).
  • Uttam, Shashwat, et al. "Hush-Hush Speak: Speech Reconstruction Using Silent Videos." Proc. Interspeech 2019 (2019): 136-140.
  • Shrivastava, Nilay, et al. "MobiVSR: Efficient and Light-weight Neural Network for Visual Speech Recognition on Mobile Devices." Proc. Interspeech 2019 (2019): 2753-2757.
  • Salik, Khwaja Mohd, et al. "Lipper: Speaker independent speech synthesis using multi-view lipreading." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
  • Kumar, Yaman, et al. "Lipper: Synthesizing thy speech using multi-view lipreading." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
  • Nowosielski, Adam, and Paweł Forczmański. "Touchless typing with head movements captured in thermal spectrum." Pattern Analysis and Applications 22.3 (2019): 841-855.
  • Kumar, Yaman, et al. "Mylipper: A personalized system for speech reconstruction using multi-view visual feeds." 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 2018.
  • Kumar, Yaman, et al. "Harnessing ai for speech reconstruction using multi-view silent video feed." Proceedings of the 26th ACM international conference on Multimedia. 2018.