IEEE BigMM is hosting a grand challenge to bring together interdisciplinary novel research focused on improving the social space for the members of the community. Inspired by the COVID-19 pandemic, which spreads rapidly through touch and other forms of human-to-human contact, the theme of this year’s challenge focuses on touchless technologies. In particular, the challenge focuses on lipreading and touchless typing technologies.
Recent works on touchless typing and a series of works on lipreading suggest that touchless human-computer interaction is a challenging but achievable task. In general, the touchless typing refers to the activity in which an individual interacts with a computer without touching any interface. In other words, the individual provides some gestural cues which are captured via devices such as camera, and/or BCI headset. The gestural inputs are then translated to a sequence of clusters, letters, or words. In its current state, the technology is being put forward as an assistive tech for people with disabilities; however, COVID-19 like pandemics make us think beyond i.e., developing touchless typing for everyone.
On the other hand, lipreading involves the inference of words from silent speech videos. The use of speech modality as a touchless interface has increased manifold in the last few decades, especially so after the introduction of Automatic Speech Recognition (ASR) systems. However, accuracy constraints, especially on different accents and languages, still present a research challenge. A potential solution is provided by the large-scale presence of both cameras and microphones on mobile devices. Thus, lipreading as a research problem stands right at that intersection. This is the motivation behind, including it as a key part of our challenge statement.
A total of 22 volunteers participated in the data collection, of which 19 were male, and six were female university students.. Each participant typed 20 words, ten phrases, and five sentences. The exercise was repeated for each participant thus, totaled to 2234. The average lengths of words, phrases, and sentences were 4.33, 10.6 and 18.6 letters, respectively. The words were chosen manually ranging from three to six characters long, such that each of the 8 clusters representing 26 English alphabets is included in at least one word, and the unique transitions between the clusters in a given word are maximized.
The database contains video recordings from 52 subjects speaking three types of utterances: including continuous digit strings, short phrases and TIMIT sentences. To address the fact that our talking motion is produced in a three-dimensional space, we placed six cameras around the speaker to film from five different views simultaneously, resulting in more than 20k video recordings.
The dataset consists of up to 1000 utterances of 500 different words, spoken by hundreds of different speakers. All videos are 29 frames (1.16 seconds) in length, and the word occurs in the middle of the video. The word duration is given in the metadata, from which you can determine the start and end frames.
Preprocessing Details For LRW Dataset
Since the length of frames can be varying in every clip we would want the participants to use equal number of frames for uniformity during the evaluation. Most of the clips in the dataset contain 29-31 frames, you are requested to pad by repeating the last frame to get 32 frames in the clip for training and testing.
General Preprocessing Details
Also to enable maximum participation and relax hardware restrictions you are requested to downsample the clips to 64x64. Thus every video clip should have a uniform shape of 32x64x64
Given a set of videos containing human subjects speaking some predefined but unknown phrases, the task is to cluster the related videos. Relatedness would be defined by two factors - speech content and pose. A key detail about the videos would be the presence of visemically equivalent phrases such as billion and million.Clustering in speech videos is a relatively unexplored problem, and this task, in our opinion, would yield innovative architectures as well as interesting insights to solve the problem.
To make things more interesting the words which are spoken will be informed thus helping you know the correct number of clusters required. The task will be done in 2 stages:
Evaluation would be done by measuring the number of clusters correctly assigned for train and test videos.
Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:
Given a single image of a person and a word, the task is to generate the probable sequence of head and face movements of the person speaking the required phrase from LRW dataset. This would further help to generate synthetic data, increase the data size, and help in generative learning or meta-learning aspects.
The evaluation will be performed by measuring:
obtained on the test samples.
Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:
Given a single image of a person, a character sequence and a keyboard layout (for instance, QWERTY, DHI-ATENSOR, etc), the task is to generate the probable sequence of head and face movements. This would further help to generate synthetic data, increase the data size and help in generative learning or meta-learning aspects.
The evaluation will be performed by measuring:
obtained on test samples.
Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:
Given a video from the touch-less typing dataset and various readings from the MUSE device representing different typing videos, the task is to correlate and find corresponding MUSE readings for the given video.
Evaluation would be done by measuring the number of clusters correctly assigned for train and test videos.
Submission can be made by sending a mail to aradhyam@iiitd.ac.in and yamank@iiitd.ac.in. Submission mail would include:
Note: Only 1 submission/day per team is allowed for any of the tasks.
At the end of the competition hosted, based on leaderboard rankings, top submissions would be invited to submit system description/ model papers describing the proposed methodology, improvements and limitations. A more detailed guideline on submission will be provided soon. Submissions should be made via EasyChair and must follow the guidelines of IEEE, BigMM 2020. All the submissions must conform to the 2 column format of IEEE. Maximum length of papers to be considered for evaluation will be 4 pages excluding references.
Competition Time Frame:
Conference Preparation
Aradhya Mathur (IIITD) - aradhyam@iiitd.ac.in
Yaman Kumar (IIITD, Adobe) - yamank@iiitd.ac.in, ykumar@adobe.com
Henglin Shi (University of Oulu) - Henglin.Shi@oulu.fi
Dr. Rajesh Kumar (Syracuse University, Haverford College) - rkumar@haverford.edu
Dr. Li Liu (University of Oulu) - li.liu@oulu.fi
Dr. Guoying Zhao (University of Oulu) - guoying.zhao@oulu.fi
Dr. Rajiv Ratn Shah (IIITD) - rajivratn@iiitd.ac.in