Traffic accidents cause over a million deaths every year, of which a large fraction is attributed to drunk driving. An automated intoxicated driver detection system in vehicles will be useful in reducing accidents and related financial costs. Existing solutions require special equipment such as electrocardiogram, infrared cameras or breathalyzers. In this work, we propose a new dataset called DIF (Dataset of perceived Intoxicated Faces) which contains audio-visual data of intoxicated and sober people obtained from online sources. To the best of our knowledge, this is the first work for automatic bimodal non-invasive intoxication detection. Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) are trained for computing the video and audio baselines, respectively. 3D CNN is used to exploit the Spatio-temporal changes in the video. A simple variation of the traditional 3D convolution block is proposed based on inducing nonlinearity between the spatial and temporal channels. Extensive experiments are performed to validate the approach and baselines.
In this work, the final dataset is created from videos uploaded to Youtube, Periscope, and Twitch. We can not share the final dataset as the copyright remains with the original owners of the videos.
We can share the links and the features as described below:
Links- It contains the youtube links to the original videos from which the dataset is created and the link to code for video processing (code to create the dataset from raw videos). You can create the final dataset using these links.
Features- It contains the various features extracted from the final dataset and the train/val/test split information. Features include- VGG Face fc7 features for video frames and audio features are extracted from the openSMILE library.
If you find any of the above-mentioned data suitable for your work, fill out this form https://forms.gle/7yi6iDNNEQu9NQKKA .
NOTE**:- The shared data will only be used for academic/research purposes. You need to agree that the dataset or its derivatives will not be used for commercial purposes.
It is not possible to create the same dataset from the shared links due to factors such as video links expiration, manual steps such as trimming and cleaning involved in the data processing, etc. The complete codebase consists of four stages as shown in the diagram below:-
The data has been collected from Youtube, Periscope, and Twitch, and the links for the same are aggregated in excel sheets for both the Drunk and Sober category. These videos are downloaded and trimmed manually. It can be possible that some of the links are expired or the video has been removed from the user channel.
In the initial approach based on visual frames- features related to eye gaze, face pose, and facial expressions were extracted (Arxiv link- DIF ). The video processing pipeline consisted of these steps:- Shot Segmentation, Face Detection, Face Tracking, Face cropping, Face Alignment, Feature Extraction, Face clustering, and Process Output. For more details about the pipeline, check out the Github repo and documentation.
In this work, we modified/extended the pipeline for audio processing and used different visual features such as VGG Face, Inception, etc. These are the details of the modifications and the manual effort involved in each step:-
Shot Segmentation, Face Detection, Face Tracking steps remained the same whereas the Face cropping stage was modified to keep track of the exact frame number in the original video (by saving an additional CSV file for each cropped video mapping each relative frame number to the original video frame number).
The Face Alignment could create some blank videos which need to be checked and removed manually.
Since the eye gaze and facial expressions related features from the visual frames are not used, the stages- Feature Extraction and Process Output were removed.
After face alignment, Face clustering was done in three or four chunks due to memory constraints while clustering a large number of videos. After each chunk, an offset was added manually to the face id (cluster id assigned) to avoid overlap. Finally, these ids were used only to create a face-wise train/val/test split.
In the next step, the face-aligned frame videos were broken into chunks of 10 seconds with a frame rate of 24. For the corresponding audio snippet, the original video was trimmed using the video timestamp (using the CSV files from the Face cropping stage) corresponding to the first and last frame of the frame video.
Visual Baseline
CNN-RNN- For a frame video, features were extracted using pre-trained CNN models such as Inception and VGG Face. The best results were found using the fc7 layer of the VGG face network. VGG-Face Keras model weights - https://gist.github.com/EncodeTS/6bbe8cb8bebad7a672f0d872561782d9
Audio Baseline- For the audio sample, the features are computed using the OpenSmile library [11] specifically the INTERSPEECH 2010 Paralinguistic Challenge feature set- https://audeering.github.io/opensmile/get-started.html#the-interspeech-2010-paralinguistic-challenge-feature-set.
Given the Audio and visual features, the visual CNN_RNN and audio baseline experiments can be reproduced whereas the 3DCNN experiments require the visual frames which can't be shared and reproduced exactly the same.
If you find this work useful for your research, please consider citing our work:
@inproceedings{mehta2019dif,
title={DIF: Dataset of Perceived Intoxicated Faces for Drunk Person Identification},
author={Mehta, Vineet and Katta, Sai Srinadhu and Yadav, Devendra Pratap and Dhall, Abhinav},
booktitle={2019 International Conference on Multimodal Interaction},
pages={367--374},
year={2019},
organization={ACM}
}
The feature extraction code and the training code is available at https://github.com/ivineetm007/drunk-detection
The visual dataset creation code is available at https://github.com/DevendraPratapYadav/gsoc18_RedHenLab/tree/master/video_processing_pipeline
For any questions, please send an email to the authors below :
Vineet Mehta (vinmehta007@gmail.com)
Abhinav Dhall (abhinav.dhall@monash.edu)