Multimodal Learning from Videos