Enhancing Vision-Language Navigation with Multimodal Event Knowledge

from Real-World Indoor Tour Videos