Course Description: This course delves into the complex domains of visual recognition and language understanding within the field of artificial intelligence. Students will explore and develop the necessary skills to construct machine learning and deep learning models capable of analyzing images and text, enabling tasks such as generating image descriptions, visual question answering, and image retrieval. Emphasis will be placed on leveraging advanced models, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer networks (e.g., BERT, GPT-3, ViTs) to achieve state-of-the-art performance in vision and language tasks.
Learning Objectives:
1. Cultivate an intuitive understanding of the intricate relationship between language and vision.
2. Acquire foundational knowledge in representation learning for images and text.
3. Familiarize oneself with cutting-edge models employed in vision and language tasks.
4. Gain hands-on experience implementing these models through practical exercises.
Prerequisites: While no formal prerequisites are required for this course, a basic familiarity with machine learning, deep learning, or computer vision is recommended. Proficiency in linear algebra, differential calculus, and basic statistics and probability is expected. Additionally, students should possess some level of proficiency in Python programming or demonstrate a willingness to learn Python.