MICCAI 2025 Tutorial

From Foundational to Multimodal Models for Medical Imaging (FMLLM)

MICCAI 2025

Date: September 23rd, 2025

Daejeon Convention Center Room DCC1-1F-112
8:00AM - 17:30PM

Description

Generative AI and large-scale self-supervised foundation models are poised to have a profound impact on human decision making across occupations. Healthcare is one such area where such models have the capacity to impact patients, clinicians, and other care providers. Medical imaging could benefit from these technologies for many applications ranging from phantom models to precision AI for interventional imaging. The latest work in AI is all surrounding foundation models for language, vision, etc. Building healthcare-specific foundation models is relevant to our community as we have learned from experience that the standard deep learning models still need a good amount of conditioning before they will be relevant to medical imaging. Learning these techniques in a timely fashion by our MICCAI community members will help accelerate not only their adoption in our field but also advance the science of AI by providing adequate requirements for such systems. This is an emerging topic with little systematic courses organized at many universities and hence will be a benefit to our MICCAI community members.

In this tutorial, we will explore the fundamentals of training, adaptation, evaluation, and deployment of foundation models and generative AI, with a focus on addressing current and future medical imaging needs. The tutorial will cover models used in natural language processing, computer vision, and multi-modal models, as well as their applicability to medical imaging. We will explore models trained on non-healthcare domains and their adaptation to domain-specific problems in healthcare. In addition to the fundamentals of these models, we will provide practical demonstrations so that the audience could get hands-on experience.

The morning session will cover Foundational Models, including Vision-Language Models (VLMs) and Generative Models for medical image analytics. We will cover the fundamentals aspects of these models, both in pre-training and adaptation, and discuss several practical medical use cases so that the audience can get hands-on experience. In the afternoon, we will focus on recent advancements in medical multimodal large language models (MLLMs), such as MedGemini, which enable the integration of diverse data types, clinical reports, medical images, and graph representations, within a unified framework. Focusing on radiology applications, the audience will learn to leverage open-source MLLMs and to implement advanced techniques such as knowledge graph infusion through both lectures and practical demonstrations.

Description

Morning Session (M)

M1. Introduction to Foundation Models -- 8 to 9:30 AM

a. Evolution of Machine learning models
b. Definition of Foundation models
c. What makes a model foundational?
d. Examples of foundational models
e. Frameworks: Self-supervised learning, contrastive learning, masked auto-encoders

M2. Vision-Language Models (VLMs) -- 9:30 to 10:00 AM

a. Zero-shot Contrastive Language-Image Pre-training (CLIP)
b. Zero-shot and few-shot inference
c. Vision-language models for medical imaging (e.g., embedding domain knowledge)

Coffee break: 10 to 10:30 AM

M3. Fine-tuning foundation models - 10:30 to 11:10 AM

a. Prompt learning
b. Adapters
c. Linear-probing baselines
d. Parameter-efficient fine-tuning (e.g., low-rank approximation)
e. Transduction helps VLMs.

M4. Foundational models for segmentation -- 11:10 to 11:50 AM

a. Types of foundation models: a data perspective.

b. Learning/usages based classification.

c. Zero shot / adaptation oriented volumetric foundation models.

M5. Techniques for Improving LLM performance --11:50-12:10 AM

a. LORA tuning

b. Instruction tuning

c. Retrieval-augmented generation

d. Fact-checking

M6. Deployment considerations of generative AI -- 12:10 AM -12:30 PM

a. Datasets for training foundational models

b. Evaluation of foundational models

c. Agentic deployments

Lunch break: 12:30 PM to 13:30 PM

Afternoon Session (A)

A1. Expanding Large Language Models to Vision: Multimodal LLMs (MLLMs) -- 13:30PM to 14:00 PM

a. Understanding the impact and limitations of ChatGPT on healthcare data
b. Overview of the open-source multimodal models

A2. Multimodal LLMs for Radiology -- 14:00 to 14:45 PM

a. Overview of data construction for radiology MLLMs
b. Visual instruction tuning in radiology MLLMs
c. Reasoning enhancement in radiology MLLMs
d. Applications of radiology MLLMs

A3. Multimodal LLMs for Pathology -- 14:45 to 15:30 PM

a. Patch models and downstream tasks
b. WSI models and downstream tasks
c. Evaluation metrics

Coffee break: 15:30 to 16:00 PM

A4. Multimodal LLMs for Radiology Report -- 16:00 to 16:30 PM

a. Overview of CXR interpretation and diagnosis
b. Overview of radiograph report generation
c. Current research on VLMs for report generation

A5. Report Evaluation and Error Detection --16:30-17:00 PM

a. Overview of report evaluation and error detection
b. Evaluation metrics (ROUGE, GREEN, RaTEScore)
c. Practical Demonstration: How to use LLMs to evaluate generated reports as well as error detection

A6. Error Detection and Fact Checking with Knowledge Graph -- 17:00 PM -17:30 PM

a. Introduction to Knowledge Graph
b. Integration of biomedical knowledge graphs with large language models
c. Current research on knowledge graphs used to enhance MLLMs contextual reasoning

Familiarity with machine learning principles at a graduate level is expected of the participants.

Learning objectives

- - To become familiar with the latest foundation models, both pre-training and adaptation aspects (e.g., parameter-efficient finetuning)
  - To learn how foundation models could be relevant for multimodal medical imaging research.
  - To have hands-on experience in using the models for some standard tasks in healthcare
  - Develop a comprehensive understanding of Foundation Models (FMs) and Multimodal Large Language Models (MLLMs) and their transformative impact on medical image analytics.
  - Acquire practical skills in deploying open-source FMs and MLLMs for medical image analytics.
  - Implement advanced techniques such as leveraging eye gaze and knowledge graph for enhancing models’ performance.