Multi-Modal Attention based Transformer for Video Captioning
Multi-modal Hierarchical Attention-based Dense Video Captioning