Demos of Neurips'24 submission "VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation"
Demos of Neurips'24 submission "VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation"
Comparison with baseline (both trained on 8 x NVIDIA A100 GPUs for 2 epochs)
VideoLLM-online+ (Improved baseline)
Training cost: 24 hours with 48.29T FLOPs
VideoLLM-MoD (Ours)
Training cost: 14 hours with 30.74T FLOPs
1.7x training speedup with similar even better performance
Case1: Cutting the broccoli stem (Ours shows better temporal sensitivity)
VideoLLM-online+ (Improved baseline) miss the finish of the ''cutting'' event
VideoLLM-MoD (Ours) notices the finish of the ''cutting'' event at 9.00s precisely
Case2: Repair the bicycle (Ours shows better recognition capability on spatial details)
VideoLLM-online+ (Improved baseline) mistakes action for ''pick up the screwdriver'' at 14.00s
VideoLLM-MoD (Ours) correctly captures the ''fix the screw on the bicycle bell'' detail at 16.50s
Case3: Pickup the box (Ours shows less hallucination)
VideoLLM-online+ (Improved baseline) incorrectly depicts the action of ''pick up the tire'' at 4.50s
VideoLLM-MoD (Ours) properly recognizes the action as ''pick up the box'' at 4.50s though the tire is more obvious