首页

Demos of Neurips'24 submission "VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation"

Comparison with baseline (both trained on 8 x NVIDIA A100 GPUs for 2 epochs)

VideoLLM-online+ (Improved baseline)

Training cost: 24 hours with 48.29T FLOPs

VideoLLM-MoD (Ours)

Training cost: 14 hours with 30.74T FLOPs

1.7x training speedup with similar even better performance

Case1: Cutting the broccoli stem (Ours shows better temporal sensitivity)

teaser_live1+.mp4

VideoLLM-online+ (Improved baseline) miss the finish of the ''cutting'' event

teaser_fast1+.mp4

VideoLLM-MoD (Ours) notices the finish of the ''cutting'' event at 9.00s precisely

Case2: Repair the bicycle (Ours shows better recognition capability on spatial details)

case1_live1+.mp4

VideoLLM-online+ (Improved baseline) mistakes action for ''pick up the screwdriver'' at 14.00s

case1_fast1+.mp4

VideoLLM-MoD (Ours) correctly captures the ''fix the screw on the bicycle bell'' detail at 16.50s

Case3: Pickup the box (Ours shows less hallucination)

case2_live1+.mp4

VideoLLM-online+ (Improved baseline) incorrectly depicts the action of ''pick up the tire'' at 4.50s

case2_fast1+.mp4

VideoLLM-MoD (Ours) properly recognizes the action as ''pick up the box'' at 4.50s though the tire is more obvious

Page updated

Google Sites

Report abuse