Vi-Fi Multi-modal Dataset
A large-scale multi-modal dataset to facilitate research and studies that concentrate on vision-wireless systems.
About Vi-Fi dataset
The Vi-Fi dataset is a large-scale multi-modal dataset that consists of vision, wireless and smartphone motion sensor data of multiple participants and passer-by pedestrians in both indoor and outdoor scenarios. In Vi-Fi, vision modality includes RGB-D video from a mounted camera. Wireless modality comprises smartphone data from participants including WiFi FTM and IMU measurements.
The presence of Vi-Fi dataset facilitates and innovates multi-modal system research, especially, vision-wireless sensor data fusion, association and localization.
(Data collection was in accordance with IRB protocols and subject faces have been blurred for subject privacy.)
The Vi-Fi dataset has been used for various tasks to tackle real-world challenges in several successful publications, including (1) Multimodal Association of vision and phone data (Liu et al. [Mobisys'21 Demo] [IPSN'22], Cao et al. [SECON'22] [SECON'22 demo]); (2) Visual Trajectory Reconstruction from Phone Data (Cao et al. [ISACom'23 @MobiCom'23]) and (3) Out-of-Sight Trajectory Estimation (Zhang et al. [CVPR'24]). We welcome researchers to propose their novel tasks. What's your next new task?
Download Vi-Fi Dataset
Download Synchronized Vi-Fi Dataset
Related Research & Applications
ViFiT
Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In the computer vision domain, tracking is usually achieved by first detecting subjects, then associating detected bounding boxes across video frames. Typically, frames are transmitted to a remote site for processing, incurring high latency and network costs. To address this, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer's ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real-world scenes, including indoor and outdoor environments. Results demonstrate that ViFiT outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator and achieves a high frame reduction rate as 97.76% with IMU and Wi-Fi data.
[ISACom'23 @MobiCom'23 paper], [arXiv], [code], [slides], [video]
ViTag
ViTag associates a sequence of vision tracker generated bounding boxes with Inertial Measurement Unit (IMU) data and Wi-Fi Fine Time Measurements (FTM) from smartphones. We formulate the problem as association by sequence to sequence (seq2seq) translation. In this two-step process, our system first performs cross-modal translation using a multimodal LSTM encoder-decoder network (X-Translator) that translates one modality to another, e.g. reconstructing IMU and FTM readings purely from camera bounding boxes. Second, an association module finds identity matches between camera and phone domains, where the translated modality is then matched with the observed data from the same modality. In contrast to existing works, our proposed approach can associate identities in multi-person scenarios where all users may be performing the same activity. Extensive experiments in real-world indoor and outdoor environments demonstrate that online association on camera and phone data (IMU and FTM) achieves an average Identity Precision Accuracy (IDP) of 88.39% on a 1 to 3 seconds window, outperforming the state-of-the-art Vi-Fi (82.93%). Further study on modalities within the phone domain shows the FTM can improve association performance by 12.56% on average. Finally, results from our sensitivity experiments demonstrate the robustness of ViTag under different noise and environment variations.
Vi-Fi Association
We present a multi-modal system that leverages a user’s smartphone WiFi Fine Timing Measurements (FTM) and inertial measurement unit (IMU) sensor data to associate the user detected on a camera footage with their corresponding smartphone identifier (e.g. WiFi MAC address). Our approach uses a recurrent multi-modal deep neural network that exploits FTM and IMU measurements along with distance between user and camera (depth) information to learn affinity matrices. As a baseline method for comparison, we also present a traditional non deep learning approach that uses bipartite graph matching. Using association accuracy as the key metric for evaluating the fidelity of Vi-Fi in associating human users on camera feed with their phone IDs, we show that Vi-Fi achieves between 81% (real-time) to 90% (offline) association accuracy.
More interesting work is yet to come!
Data collection
The data collection setup consists of a mounted Stereolabs ZED2(RGB-D) camera set to record video at 10fps, which can collect depth information from 0.2m to 20m away from the camera. The smartphones are set to exchange FTM messages at 3 Hz frequency with a Google Nest WiFi Access Point anchored besides the camera. Each smart phone also logs its IMU sensor data at 50 Hz. The smartphones and camera are connected to the Internet to achieve coarse synchronization using network time synchronization.
We collect 90 sequences of multi-modal data through experiments. We divide this dataset into two categories: Category A constitutes data from experiments conducted in one controlled indoor office space environment involving 5 legitimate users and no passersby; Category B constitutes data from experiments conducted in 5 different real-world outdoor environments with 2-3 actual users and rest of the humans in view are passersby (up to 12 in our dataset).
Each collected video sequence lasts 3 minutes and contains RGB-D frames (captured at 10 frames/sec), FTM, and 9-axis IMU sensor data (accelerometer, gyroscope, and magnetometer) of up a Google Pixel 3a smartphone device. Each of the legitimate users (3 for Dataset A and 5 for Dataset B) carried a Pixel-3a smartphone. The users were not restricted in how they carried the phones (in hand or in their pocket). Our dataset is representative of a diverse set of scenarios with participants holding smartphone devices exchanging FTM messages with the AP and recording IMU measurements, as well as a varying number of passerby whose phone devices did not opt in to the FTM and IMU recording.
Labeling and Annotation
To mark ground truth for evaluating cross modal association accuracy, we manually annotate the participants in the (dataset) video frames with bounding boxes. We use a tracking module from ZED SDK's API to annotate the pedestrians in the video scene and perform 2 total passes over the data. During the first pass, the ZED tracker outputs a track ID for each tracked person and a bounding box for that person at each frame, where the ground truth bounding box labels are manually matched with these track IDs every 10 frames. We perform a second pass in which each frame's ground truth label for the pedestrians is manually reviewed and corrected where necessary using the Visual Object Tagging Tool (VoTT) .
Dataset Format
Vi-Fi Dataset/
Category A/
...
Category B/
[sequence_name].tar.gz
Depth/
Dist/
GND/
RGB_anonymized/
WiFi/
IMU/
time_offsets.txt
valid_frame_range.txt
zedBox_3D_[sequence_name].txt
zedBox_3d_gnd_match.txt
zedBox_[sequence_name].txt
zedBox_gnd_match.txt
Acknowledgement
This research has been supported by the National Science Foundation (NSF) under Grant No. CNS-1901355, CNS-1910170, CNS-1901133, CNS-2055520.
Thanks to Rashed Rahman,Shardul Avinash, Abbaas Alif, Bhagirath Tallapragada and Kausik Amancherla for their help with data labeling.
Contacts
Feel free to contact us if you have any questions about our dataset!
Hansi Liu (Rutgers University) <hansiiii AT winlab DOT rutgers DOT edu>
Abrar Alali (Old Dominion University) <aalal003 AT odu DOT edu>
Bryan Bo Cao (Stony Brook University) <boccao AT cs DOT stonybrook DOT edu>
Nicholas Meegan (Rutgers University) <njm146 AT scarletmail DOT rutgers DOT edu>