WE ARE HIRING! If you are interested in learning more about or joining our lab, please contact Prof. Kim at <kyunam.kim [at] skku [dot] edu>.
Human learning and comprehension occur through multiple channels, such as reading, listening, observing, and hands-on practice—so why not AI? In our lab, we aim to develop multi-modal AI that can process information from various channels simultaneously, with the goal of integrating these capabilities into mobile and robotic platforms.
The goal of Visual Question Answering (VQA) is to develop a network-based model that can respond to user questions by integrating both visual and language information. To achieve this, the model must understand the given image, interpret the related question, and generate an appropriate answer. VQA is particularly challenging because it requires multi-modal reasoning and the ability to coherently integrate visual and linguistic data.
Our current research focuses on optimizing VQA models to handle various types of questions. For example, some questions can be answered by directly extracting information from the image (e.g., "what color is the car?"), while others may require additional knowledge that is not present in the visual content (e.g., "what brand is this car?"). In such cases, external information must be sourced to provide accurate answers. We are currently working on systematically defining question types based on the knowledge required to answer them, and optimizing VQA models to effectively tackle these different question types.
Related work:
Cho, J., and Kim, K., "MetricVQA: Integrating Visual Question Answering and Object Distance Estimation on Monocular Images," 2024 KIISS-APCIM Joint Conference, Seoul, South Korea, Oct. 31-Nov. 2, 2024.
AI models have been developed as experts in recognition, inference, and decision-making, but they have limited action capabilities when physically interacting with an environment and require integration with a physical embodiment. This has led to the emergence of Embodied AI, which addresses challenges like real-time processing, safety, and generalization. In particular, Vision-Language-Action (VLA) models aim to tackle these challenges by enabling vision-language multi-modal models to directly output action commands that are transmitted to an embedded controller of the physical system. These models have applications in robotics and human-robot interaction, paving the way for more intuitive and capable AI systems.
We believe that a robotic system equipped with embodied AI holds significant potential to advance both robotics and AI, while offering solutions to many of today’s complex real-world challenges. Our current research is focused on developing a VLA-driven medical robot capable of working collaboratively with medical staff to alleviate their workload, improve the quality of medical services, and enhance patient safety.