LSPD: A large scale dataset for benchmarking pornographic visual content detection and classification
Dinh-Duy Phan, Thanh-Thien Nguyen, Quang-Huy Nguyen, Hoang-Loc Tran, Khac-Ngoc-Khoi Nguyen, Duc-Lung Vu
Description
One of the most concerning cybersecurity problems is content censoring. Pornography detection is particularly difficult. Machine-learning models work efficiently only when a comprehensive dataset is available. This paper introduces a new dataset named Large-Scale dataset for Pornography Detection (LSPD) that intends to advance the standard quality of pornographic visual content classification and sexual object detection/segmentation tasks. The dataset gathers a large-scale corpus of pornographic/normal images and videos containing a rich diversity of context. The images and videos are not only labeled with their representative class but are also annotated by polygon masks of four private sexual objects (breasts, male and female genitals, and anuses). Our dataset contains 500,000 images and 4,000 videos, with 93,810 labeled instances in 50,212 images. To ensure fair use of the dataset, we present a detailed statistical analysis and provide baseline benchmarking scenarios for both image/video classification and instance detection/segmentation tasks. Finally, we evaluate the performance of four object detection and our instance segmentation algorithms on the LSPD dataset in several benchmarking scenarios.
Fig1. Sample images in our dataset (left to right): drawing, hentai, normal, porn, and sexy
Summary of LSPD:
Table 1. Resolution of images in the LSPD dataset
Table 2. Distribution of the LSPD dataset
Table 3. Distribution of annotated images in the LSPD
Table 4. Durations of porn videos
Table 5. Durations of non-porn videos
Data Construction
1. Image/video collecting
The porn images were mainly collected from adult content websites on the Internet. The non-porn images were obtained by searching for images on the Google search engine with approximately 250 keywords on several topics including people, nature, urban, rural, cartoon, art, transport, economics, and science. Then we chose approximately 1000 images to download for each keyword’s result.
The videos were collected from both adult and non-porn video websites. The videos contain varied scenes and are of various qualities, being produced by both amateur and professional users. Among the videos collected were hentai, cartoon, news, film, and music.
2. Image/video filtering and labeling for classification tasks
Image set:
To ensure the quality of the dataset, we checked all downloaded images manually. Each image was retained only if it satisfied the following requirements:
The width and height of the image must be at least 300 pixels.
The content must be sufficiently clear to be recognized by normal people.
After all, the image set contains 500,000 images which is divided into 5 sub-categories: 200,000 porn images, 150,000 normal images, 50,000 sexy images, 50,000 hentai images, and 50,000 drawings.
Video set:
The video set was divided into only two classes: porn and non-porn. All videos with resolutions lower than 240p were discarded. We also removed videos that did not obviously fit into one of the categories. The filtered video dataset contained 4,000 videos, 2,000 porn, and 2,000 non-porn videos.
3. Image annotating for detection tasks
To build the annotated set for evaluating the object detection algorithms, we randomly selected 50.212 images from the porn and hentai classes, then branded four main sexual objects on these images: male genitals, female genitals, female breast, and anus. Using the VGG Image Annotator tool, we annotated every image with polygons and their respective labels describing sexual objects. This annotated information is easily converted into segment bitmaps for Mask R-CNN and Cascade Mask R-CNN training or rectangle bounding box coordinates for Single-Shot multi-box Detector (SSD) and YOLOv4 training. During the annotating process, the following rules were applied to ensure the quality of the annotated set:
The annotated sexual objects must be clear and visible, and the annotations must not overlap with other annotations.
The female breast is annotated only when its nipples are visible.
The area of the annotated objects cannot be less than 20 x 20 pixels.
The annotation process was performed in two phases:
Phase 1: We annotated 30,000 explicit pornographic and hentai images from scratch. These annotations were then available for training object detection models.
Phase 2: The Mask R-CNN model was trained on the annotated data from phase 1. The trained model then predicted the segmentation mask on the remaining 20,000 images to expand our annotated set. The annotations were re-checked by five authors to ensure their correctness. The checking was performed by (1) removing false segmentations, (2) modifying incorrect segmentations, and (3) annotating wrongly detected objects.
Benchmarking
Classification task
In binary classification tasks, visual content is most suitably evaluated through the confusion matrix, which divides the predicted data into four indexes: number of true positives (TP), number of true negatives (TN), number of false positives (FP), and number of false negatives (FN). Based on these four values, we define the accuracy, precision and recall.
Detection and segmentation task
To evaluate the detection and segmentation tasks, we selected the mean average precision (mAP)
Disclaimer
THIS DATABASE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The videos, and images provided were produced by third-parties, who may have retained copyrights. They are provided strictly for non-profit research purposes, and limited, controlled distributed, intended to fall under the fair-use limitation. We take no guarantees or responsibilities, whatsoever, arising out of any copyright issue. Use at your own risk.
Download
The LSPD dataset used to support the findings of this study were supplied by CEAI group under license and so cannot be made freely available. Requests for access to these data should be made to CEAI group via email: duypd@uit.edu.vn.
Acknowledgement
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM).
Citation
If you use our LSPD dataset, please cite it as:
@article{phanlspd,
title={LSPD: A Large-Scale Pornographic Dataset for Detection and Classification},
author={Phan, Dinh Duy and Nguyen, Thanh Thien and Nguyen, Quang Huy and Tran, Hoang Loc and Nguyen, Khac Ngoc Khoi and Vu, Duc Lung},
journal = {International Journal of Intelligent Engineering and Systems},
publisher = {Intelligent Networks and Systems Society},
volume = {15},
issue = {1},
pages = {198--231}
}
Publications
@article{phanlspd,
title={LSPD: A Large-Scale Pornographic Dataset for Detection and Classification},
author={Phan, Dinh Duy and Nguyen, Thanh Thien and Nguyen, Quang Huy and Tran, Hoang Loc and Nguyen, Khac Ngoc Khoi and Vu, Duc Lung},
journal = {International Journal of Intelligent Engineering and Systems},
publisher = {Intelligent Networks and Systems Society},
volume = {15},
issue = {1},
pages = {198--231}
}
@inproceedings{tran2020additional,
title={Additional learning on object detection: A novel approach in pornography classification},
author={Tran, Hoang-Loc and Nguyen, Quang-Huy and Phan, Dinh-Duy and Nguyen, Thanh-Thien and Nguyen, Khac-Ngoc-Khoi and Vu, Duc-Lung},
booktitle={International Conference on Future Data and Security Engineering},
pages={311--324},
year={2020},
organization={Springer}
}
@INPROCEEDINGS{9140734,
author={Nguyen, Quang-Huy and Nguyen, Khac-Ngoc-Khoi and Tran, Hoang-Loc and Nguyen, Thanh-Thien and Phan, Dinh-Duy and Vu, Duc-Lung},
booktitle={2020 RIVF International Conference on Computing and Communication Technologies (RIVF)},
title={Multi-level detector for pornographic content using CNN models},
year={2020},
volume={},
number={},
pages={1-5},
doi={10.1109/RIVF48685.2020.9140734}}
@article{Phan2021,
title = {A Novel Pornographic Visual Content Classifier based on Sensitive Object Detection},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0120591},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0120591},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {5},
author = {Dinh-Duy Phan and Thanh-Thien Nguyen and Quang-Huy Nguyen and Hoang-Loc Tran and Khac-Ngoc-Khoi Nguyen and Duc-Lung Vu}
}
@article{phan2022joint,
title={Joint inter-intra representation learning for pornographic video classification},
author={Phan, Dinh-Duy and Nguyen, Quang-Huy and Nguyen, Thanh-Thien and Tran, Hoang-Loc and Vu, Duc-Lung},
journal={Indonesian Journal of Electrical Engineering and Computer Science},
volume={25},
number={3},
ages={1481--1488},
year={2022}
}
Contacts
Mr. Dinh-Duy Phan
Faculty of Computer Engineering, University of Information Technology, VNU-HCMC
Email: duypd@uit.edu.vn
Tel: +84 977 989 077