Download

File Instruction

The instructions for downloading VFP290K is listed as follows: 

METADATA Information

Our metadata for the VFP290K dataset consists of 7 definitions, including light conditions, camera heights, background, location, view,  scene, occlusion, and the number of frames:

Light conditions (M_day, night). Light conditions are one of the most important features impacting the fallen person detection performance, as the quality of the image is highly dependent on it. We consider two light conditions, M_day and M_night. In VFP290K dataset, it is represented 0 for M_day and 1 for M_night. 

Camera heights (M_low, high). Camera height is directly related to model CCTV and indoor camera environments. It is not trivial to model all the possible height requirements for different CCTV devices, as there are many different types of CCTV with varying positions. To address this issue, we film the video into several camera heights with M_low and M_high, where M_low is a camera height about 1 ~ 3m and M_high represents higher than 3m.

Background (M_street, park, building).  One of the main contributions of our approach is to have numerous backgrounds, from public areas to indoor environments. We divide the background into the following three sub-categories: street, park, and building. We assign 0, 1, 2 for the street, park, and building backgrounds, respectively.

Location (M_location) & View (M_view). We identify each location in the background by enumerating the videos. The total number of locations for each background is 13, 30, and 6 for a street, a park, and a building, respectively.  Also, we report each view in the location by enumerating the videos. The minimum, average, and a maximum number of the viewpoints are 1, 2, 6, and 15, respectively. We try to differ the viewpoints in a certain area, such as a playground.

Scene (M_Scene).  Along with the three background categories, we also categorize the filming locations as scenes. We enumerate each scene by each location divided by each view, composing 131 different scenes in detail. We record each scene by varying the filming view dramatically in each location.

Occlusion (M_occlusion).  We specify whether the video contains occlusion or not, assigning 0 for the video without occlusion and 1 for the video with occlusion.

Number of Frames (M_Scene). We specify the number of frames for each video to control the distribution of the benchmark.

If you use our dataset, please use the citation.

@inproceedings{an2021vfp290k,

  title={VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection},

  author={An, Jaeju and Kim, Jeongho and Lee, Hanbeen and Kim, Jinbeom and Kang, Junhyung and Shin, Saebyeol and Kim, Minha and Hong, Donghee and Woo, Simon S},

  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},

  year={2021}

}

Release history


The total files is about 190 GB

The data organization is listed as follows:

"VFP290K": {

| - - - Video 1

| - - - images

| - - - clean_xml

| - - - Video 2

. . .

| - - - Video 178

}

NOTE: The name of each video is 'GOPRXXXX'.

XML File

The description for XML files:

folder       | - - - <main action of video. (In VFP290K, all actions are "swoon")> 

filename | - - - <corresponded image name>

size            | - - - <size of matched image>

object      | - - - name       | - - - <class label (0: non-fallen person, 1: fallen person)>

                     | - - - bndbox  | - - - xmin 

                                                    | - - - ymin

                                                    | - - - xmax

                                                    | - - - ymax