v2e
Realistic dynamic vision sensor event camera data synthesis from frame-based video
Welcome
v2e is a python software tool associated with the CVPR EVENTVISION2021 workshop paper V2E: From video frames to realistic DVS event camera streams that synthesizes realistic dynamic vision sensor event camera data from any real (or synthetic) conventional frame based video using an accurate DVS pixel model that includes DVS nonidealities. v2e optionally uses Super-SloMo synthetic slow motion to up-sample standard frame camera video.
v2e can be used for transfer-learning training and evaluation datasets for event cameras from conventional frame-based datasets, and is currently the only tool that can realistically model DVS under low illumination conditions.
Our paper (below) describes v2e, but its most important contribution is to debunk myths about event cameras that pervade the current computer vision literature.
(17 Aug 2021)
v2e on GitHub
v2e is hosted at https://github.com/SensorsINI/v2e
Vote for new features in this v2e feature poll.
You can try v2e on Google colab
Click the button below
v2e was developed by the Sensors Group of the Inst. of Neuroinformatics, Univ. of Zurich and ETH Zurich.
Information about other datasets and tools are on the Sensors Group webpage.
v2e was awarded a paper finalist honor from the 3rd International Workshop on Event-Based Vision (CVPR-W) https://tub-rip.github.io/eventvision2021/
Jury citation: "For providing an accurate simulation model of DVS pixels under low-light conditions and demonstrating how training using simulated low-light events can improve model performance. Low-light operation is often cited as a strength of event-based sensors and this work provides a valuable simulation and training tool to help the event-based sensing community deliver on this promise."
Credits
Publications using v2e should cite the following paper
Y. Hu, S. C. Liu, and T. Delbruck, “v2e: From video frames to realistic DVS event camera streams,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021 [Online]. Available: http://arxiv.org/abs/2006.07722.
Note that there are two arxiv versions. v1 is much longer and more detailed but v2 corrects and clarifies some points for CVPR-W reviewers.
@INPROCEEDINGS{Hu2021-v2e-cvpr-workshop-eventvision2021,
title = "v2e: From Video Frames to Realistic {DVS} Events",
booktitle = "2021 {IEEE/CVF} Conference on Computer Vision and Pattern
Recognition Workshops ({CVPRW})",
author = "Hu, Y and Liu, S C and Delbruck, T",
publisher = "IEEE",
year = 2021,
url = "http://arxiv.org/abs/2006.07722"
}
Creators
v2e was created by Tobi Delbruck, Yuhuang Hu and Zhe He
Contact Yuhuang Hu (yuhuang@ini.uzh.ch) or Tobi Delbruck (tobi@ini.uzh.ch)
Sponsor
This work was funded by University of Zurich, NCCR Robotics, and the Samsung Global Research Neuromorphic Processor Project.
Additional seminal background papers
The original DAVIS paper
Brandli, C., Berner, R., Yang, M., Liu, S.-C., and Delbruck, T. (2014). A 240x180 130 dB 3 us Latency Global Shutter Spatiotemporal Vision Sensor. IEEE Journal of Solid-State Circuits 49, 2333–2341. doi:10.1109/JSSC.2014.2342715.
The original DVS paper
Lichtsteiner, P., Posch, C., and Delbruck, T. (2008). A 128x128 120dB 15us Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-State Circuits 43, 566–576. doi:10.1109/JSSC.2007.914337.
Super-SloMo
"Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation" by Jiang H., Sun D., Jampani V., Yang M., Learned-Miller E. and Kautz J. [Project] [Paper]
Principle of conversion
The model is illustrated above. v2e optionally uses Super-SloMo deep-learning based artificial slow motion to generate intermediate frames from the original frames . From these intermediate frames, it synthesizes DVS events by realistically modeling the DVS pixel brightness change detection mechanism.
See our arxiv paper"V2E: From video frames to realistic DVS event camera streams" in arXiv preprint, https://arxiv.org/abs/2006.07722 for details.
v2e vs ESIM
v2e has a simpler front-end than ESIM (v2e only can process movies, rather than offering a variety of simulators) but a more realistic DVS pixel model. v2e can process any movie, synthetic or real, so any simulation can be used as input; see examples below. ESIM is c++. v2e is pure python and is tested on Linux and Windows.
v2e models the following effects in real DVS pixels
Pixel to pixel Gaussian temporal contrast threshold variation
Finite, intensity-dependent photoreceptor bandwidth
Leak events (intensity-dependent background activity noise)
Intensity-dependent temporal noise
v2e outputs several formats
AVI video DVS movies, exposed with either constant-duration, constant-count, or area-count methods.
AEDAT-2.0, txt, and hdf5 event files
Converting synthetic video
v2e can convert any video file that OpenCV can read.
The video below shows v2e conversion of a driving scene from "Playing for Benchmarks, ICCV'17".
This video shows conversion from video to synthetic slow motion to clean and noisy DVS events
A small moving white dot was created in Adobe Animate, then rendered to DVS with clean and noisy DVS model parameters
Example of converting a video using v2e's GUI gooey interface
Using v2e in colab notebook
Example conversions
Source videos for many of the examples below are available in this folder, embedded below.
DDD17+ DAVIS driving dataset
The v2e events (right) are very similar to the real DVS events (center) that are synthesized from the recorded DAVIS intensity frames (left).
Input APS Frames
Ground-truth DVS events
v2e events
Horse in motion
The murderer Muybridge example of the Horse in Motion shows that SuperSloMo cannot properly interpolate the frames when the leg moves very quickly, but otherwise it is OK. The samples below show v2e output using default and --dvs_params=noisy options
![](https://www.google.com/images/icons/product/drive-32.png)
Original
![](https://www.google.com/images/icons/product/drive-32.png)
SuperSloMo
![](https://www.google.com/images/icons/product/drive-32.png)
default
![](https://www.google.com/images/icons/product/drive-32.png)
--dvs_params=noisy
UCF-101
Original video (left) from the UCF-101 action recognition dataset is undersampled quite severely, but SuperSloMo does a good job of interpolating frames. The resulting v2e events realistically show smooth timing.
Input APS Frames (10x slower)
Converted w/o SloMo
v2e events
Human motion
The example below was recorded by a DAVIS246 with APS frame rate of 20Hz (50ms frame interval). The left movie shows the real DVS events at effective frame rate of 200Hz (i.e. an accumulation time of 5ms per frame). The middle movie shows events emulated from the base APS frames. You can see they are very bursty because they are generated only every 50ms. The right movie shows the output of v2e using a slowdown_factor of 20, with an output DVS frame rate of 200Hz. It is much closer to the real DVS data. Data courtesy Gemma Taverni and Enrico Calabrese, Sensors Group, INI, UZH-ETH Zurich.
Real DVS, at 200Hz frame rate
Without slowdown, generating events using original frames at 20Hz
v2e events using 20x slowdown, at 200Hz
Modeling low light: Tennis backhand
Original video shot at 1280x960 60 FPS is undersampled. Converted to DAVIS346 events with slowdown factor of 10, to equivalent 1.66ms DVS event timestamp resolution. Conversion is good except for some artifacts on the undersampled tennis racket head during last part of forward swing. The lower right conversion uses the v2e 'noisy' conversion settings that mimic DVS under very low illumination, where there is a lot more noise and the pixel bandwidth is reduced.
gray scale frames
v2e events from "ideal" DVS pixel
interpolated with SuperSloMo
v2e events with DVS 'noisy' model that emulates DVS under low illumination
Effect of realistic pixel modeling
The pixel-to-pixel threshold mismatch, finite photoreceptor bandwidth, and leak events affect the output significantly. The example below compares real DVS data with v2e ideal DVS pixel and realistic DVS pixel effects included. Including the effects makes the v2e output appear more like the real DVS.
DAVIS frames input to v2e
Real DVS at 10ms integration time
v2e with ideal pixel
v2e with 15Hz cutoff, 5% threshold mismatch, and 0.1Hz leak event rate
Limitations
If the source video is badly exposed orhas excessive motion blur or aliasing, then the results will not be very realiistic. The example below is for a spinning black bar. For the first part of the motion, the DAVIS frames are sufficiently dense for SuperSloMo to interpolate the frames. During the last part of the motion, SuperSloMo breaks up the bar into separate objects.
Original DAVIS frames
SuperSloMo interpolated frames
v2e DVS
Real DVS