v2e

Realistic dynamic vision sensor event camera data synthesis from frame-based video

Welcome

v2e is a python software tool associated with the CVPR EVENTVISION2021 workshop paper V2E: From video frames to realistic DVS event camera streams that synthesizes realistic dynamic vision sensor event camera data from any real (or synthetic) conventional frame based video using an accurate DVS pixel model that includes DVS nonidealities. v2e optionally uses Super-SloMo synthetic slow motion to up-sample standard frame camera video.

v2e can be used for transfer-learning training and evaluation datasets for event cameras from conventional frame-based datasets, and is currently the only tool that can realistically model DVS under low illumination conditions.

Our paper (below) describes v2e, but its most important contribution is to debunk myths about event cameras that pervade the current computer vision literature.

(17 Aug 2021)

v2e on GitHub

v2e is hosted at https://github.com/SensorsINI/v2e

Vote for new features in this v2e feature poll.

You can try v2e on Google colab

Click the button below

v2e was developed by the Sensors Group of the Inst. of Neuroinformatics, Univ. of Zurich and ETH Zurich.

Information about other datasets and tools are on the Sensors Group webpage.

v2e was awarded a paper finalist honor from the 3rd International Workshop on Event-Based Vision (CVPR-W) https://tub-rip.github.io/eventvision2021/

Jury citation: "For providing an accurate simulation model of DVS pixels under low-light conditions and demonstrating how training using simulated low-light events can improve model performance. Low-light operation is often cited as a strength of event-based sensors and this work provides a valuable simulation and training tool to help the event-based sensing community deliver on this promise."

Credits

Publications using v2e should cite the following paper

Y. Hu, S. C. Liu, and T. Delbruck, “v2e: From video frames to realistic DVS event camera streams,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021 [Online]. Available: http://arxiv.org/abs/2006.07722.

Note that there are two arxiv versions. v1 is much longer and more detailed but v2 corrects and clarifies some points for CVPR-W reviewers.

@INPROCEEDINGS{Hu2021-v2e-cvpr-workshop-eventvision2021,

title = "v2e: From Video Frames to Realistic {DVS} Events",

booktitle = "2021 {IEEE/CVF} Conference on Computer Vision and Pattern

Recognition Workshops ({CVPRW})",

author = "Hu, Y and Liu, S C and Delbruck, T",

publisher = "IEEE",

year = 2021,

url = "http://arxiv.org/abs/2006.07722"

}


Creators

v2e was created by Tobi Delbruck, Yuhuang Hu and Zhe He

Contact Yuhuang Hu (yuhuang@ini.uzh.ch) or Tobi Delbruck (tobi@ini.uzh.ch)

Sponsor

This work was funded by University of Zurich, NCCR Robotics, and the Samsung Global Research Neuromorphic Processor Project.

Additional seminal background papers

The original DAVIS paper

Brandli, C., Berner, R., Yang, M., Liu, S.-C., and Delbruck, T. (2014). A 240x180 130 dB 3 us Latency Global Shutter Spatiotemporal Vision Sensor. IEEE Journal of Solid-State Circuits 49, 2333–2341. doi:10.1109/JSSC.2014.2342715.

The original DVS paper

Lichtsteiner, P., Posch, C., and Delbruck, T. (2008). A 128x128 120dB 15us Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-State Circuits 43, 566–576. doi:10.1109/JSSC.2007.914337.

Super-SloMo

"Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation" by Jiang H., Sun D., Jampani V., Yang M., Learned-Miller E. and Kautz J. [Project] [Paper]

Principle of conversion

The model is illustrated above. v2e optionally uses Super-SloMo deep-learning based artificial slow motion to generate intermediate frames from the original frames . From these intermediate frames, it synthesizes DVS events by realistically modeling the DVS pixel brightness change detection mechanism.

See our arxiv paper"V2E: From video frames to realistic DVS event camera streams" in arXiv preprint, https://arxiv.org/abs/2006.07722 for details.

v2e vs ESIM

v2e has a simpler front-end than ESIM (v2e only can process movies, rather than offering a variety of simulators) but a more realistic DVS pixel model. v2e can process any movie, synthetic or real, so any simulation can be used as input; see examples below. ESIM is c++. v2e is pure python and is tested on Linux and Windows.

v2e models the following effects in real DVS pixels

  • Pixel to pixel Gaussian temporal contrast threshold variation

  • Finite, intensity-dependent photoreceptor bandwidth

  • Leak events (intensity-dependent background activity noise)

  • Intensity-dependent temporal noise

v2e outputs several formats

  • AVI video DVS movies, exposed with either constant-duration, constant-count, or area-count methods.

  • AEDAT-2.0, txt, and hdf5 event files

Converting synthetic video

v2e can convert any video file that OpenCV can read.

The video below shows v2e conversion of a driving scene from "Playing for Benchmarks, ICCV'17".

This video shows conversion from video to synthetic slow motion to clean and noisy DVS events

A small moving white dot was created in Adobe Animate, then rendered to DVS with clean and noisy DVS model parameters

Example of converting a video using v2e's GUI gooey interface

Using v2e in colab notebook

Example conversions

Source videos for many of the examples below are available in this folder, embedded below.

DDD17+ DAVIS driving dataset

The v2e events (right) are very similar to the real DVS events (center) that are synthesized from the recorded DAVIS intensity frames (left).

Input APS Frames

Ground-truth DVS events

v2e events

Horse in motion

The murderer Muybridge example of the Horse in Motion shows that SuperSloMo cannot properly interpolate the frames when the leg moves very quickly, but otherwise it is OK. The samples below show v2e output using default and --dvs_params=noisy options

horse_orig.avi

Original

horse_slomo.avi

SuperSloMo

horse-clean.avi

default

horse-noisy.avi

--dvs_params=noisy

UCF-101

Original video (left) from the UCF-101 action recognition dataset is undersampled quite severely, but SuperSloMo does a good job of interpolating frames. The resulting v2e events realistically show smooth timing.

Input APS Frames (10x slower)

Converted w/o SloMo

v2e events

Human motion

The example below was recorded by a DAVIS246 with APS frame rate of 20Hz (50ms frame interval). The left movie shows the real DVS events at effective frame rate of 200Hz (i.e. an accumulation time of 5ms per frame). The middle movie shows events emulated from the base APS frames. You can see they are very bursty because they are generated only every 50ms. The right movie shows the output of v2e using a slowdown_factor of 20, with an output DVS frame rate of 200Hz. It is much closer to the real DVS data. Data courtesy Gemma Taverni and Enrico Calabrese, Sensors Group, INI, UZH-ETH Zurich.

Real DVS, at 200Hz frame rate

Without slowdown, generating events using original frames at 20Hz

v2e events using 20x slowdown, at 200Hz

Modeling low light: Tennis backhand

Original video shot at 1280x960 60 FPS is undersampled. Converted to DAVIS346 events with slowdown factor of 10, to equivalent 1.66ms DVS event timestamp resolution. Conversion is good except for some artifacts on the undersampled tennis racket head during last part of forward swing. The lower right conversion uses the v2e 'noisy' conversion settings that mimic DVS under very low illumination, where there is a lot more noise and the pixel bandwidth is reduced.

gray scale frames

v2e events from "ideal" DVS pixel

interpolated with SuperSloMo

v2e events with DVS 'noisy' model that emulates DVS under low illumination

Effect of realistic pixel modeling

The pixel-to-pixel threshold mismatch, finite photoreceptor bandwidth, and leak events affect the output significantly. The example below compares real DVS data with v2e ideal DVS pixel and realistic DVS pixel effects included. Including the effects makes the v2e output appear more like the real DVS.

DAVIS frames input to v2e

Real DVS at 10ms integration time

v2e with ideal pixel

v2e with 15Hz cutoff, 5% threshold mismatch, and 0.1Hz leak event rate

Limitations

If the source video is badly exposed orhas excessive motion blur or aliasing, then the results will not be very realiistic. The example below is for a spinning black bar. For the first part of the motion, the DAVIS frames are sufficiently dense for SuperSloMo to interpolate the frames. During the last part of the motion, SuperSloMo breaks up the bar into separate objects.

Original DAVIS frames

SuperSloMo interpolated frames

v2e DVS

Real DVS