MICCAI 2025
Date: September 23rd, 2025
Daejeon Convention Center Room DCC1-1F-112
8:00AM - 17:30PM
Generative AI and large-scale self-supervised foundation models are poised to have a profound impact on human decision making across occupations. Healthcare is one such area where such models have the capacity to impact patients, clinicians, and other care providers. Medical imaging could benefit from these technologies for many applications ranging from phantom models to precision AI for interventional imaging. The latest work in AI is all surrounding foundation models for language, vision, etc. Building healthcare-specific foundation models is relevant to our community as we have learned from experience that the standard deep learning models still need a good amount of conditioning before they will be relevant to medical imaging. Learning these techniques in a timely fashion by our MICCAI community members will help accelerate not only their adoption in our field but also advance the science of AI by providing adequate requirements for such systems. This is an emerging topic with little systematic courses organized at many universities and hence will be a benefit to our MICCAI community members.
In this tutorial, we will explore the fundamentals of training, adaptation, evaluation, and deployment of foundation models and generative AI, with a focus on addressing current and future medical imaging needs. The tutorial will cover models used in natural language processing, computer vision, and multi-modal models, as well as their applicability to medical imaging. We will explore models trained on non-healthcare domains and their adaptation to domain-specific problems in healthcare. In addition to the fundamentals of these models, we will provide practical demonstrations so that the audience could get hands-on experience.
The morning session will cover Foundational Models, including Vision-Language Models (VLMs) and Generative Models for medical image analytics. We will cover the fundamentals aspects of these models, both in pre-training and adaptation, and discuss several practical medical use cases so that the audience can get hands-on experience. In the afternoon, we will focus on recent advancements in medical multimodal large language models (MLLMs), such as MedGemini, which enable the integration of diverse data types, clinical reports, medical images, and graph representations, within a unified framework. Focusing on radiology applications, the audience will learn to leverage open-source MLLMs and to implement advanced techniques such as knowledge graph infusion through both lectures and practical demonstrations.
Morning Session (M)
M1. Introduction to Foundation Models -- 8 to 9:30 AM
a. Evolution of Machine learning models
b. Definition of Foundation models
c. What makes a model foundational?
d. Examples of foundational models
e. Frameworks: Self-supervised learning, contrastive learning, masked auto-encoders
M2. Vision-Language Models (VLMs) -- 9:30 to 10:00 AM
a. Zero-shot Contrastive Language-Image Pre-training (CLIP)
b. Zero-shot and few-shot inference
c. Vision-language models for medical imaging (e.g., embedding domain knowledge)
Coffee break: 10 to 10:30 AM
M3. Fine-tuning foundation models - 10:30 to 11:10 AM
a. Prompt learning
b. Adapters
c. Linear-probing baselines
d. Parameter-efficient fine-tuning (e.g., low-rank approximation)
e. Transduction helps VLMs.
M4. Foundational models for segmentation -- 11:10 to 11:50 AM
a. Types of foundation models: a data perspective.
b. Learning/usages based classification.
c. Zero shot / adaptation oriented volumetric foundation models.
M5. Techniques for Improving LLM performance --11:50-12:10 AM
a. LORA tuning
b. Instruction tuning
c. Retrieval-augmented generation
d. Fact-checking
M6. Deployment considerations of generative AI -- 12:10 AM -12:30 PM
a. Datasets for training foundational models
b. Evaluation of foundational models
c. Agentic deployments
Lunch break: 12:30 PM to 13:30 PM
Afternoon Session (A)
A1. Expanding Large Language Models to Vision: Multimodal LLMs (MLLMs) -- 13:30PM to 14:00 PM
a. Understanding the impact and limitations of ChatGPT on healthcare data
b. Overview of the open-source multimodal models
A2. Multimodal LLMs for Radiology -- 14:00 to 14:45 PM
a. Overview of data construction for radiology MLLMs
b. Visual instruction tuning in radiology MLLMs
c. Reasoning enhancement in radiology MLLMs
d. Applications of radiology MLLMs
A3. Multimodal LLMs for Pathology -- 14:45 to 15:30 PM
a. Patch models and downstream tasks
b. WSI models and downstream tasks
c. Evaluation metrics
Coffee break: 15:30 to 16:00 PM
A4. Multimodal LLMs for Radiology Report -- 16:00 to 16:30 PM
a. Overview of CXR interpretation and diagnosis
b. Overview of radiograph report generation
c. Current research on VLMs for report generation
A5. Report Evaluation and Error Detection --16:30-17:00 PM
a. Overview of report evaluation and error detection
b. Evaluation metrics (ROUGE, GREEN, RaTEScore)
c. Practical Demonstration: How to use LLMs to evaluate generated reports as well as error detection
A6. Error Detection and Fact Checking with Knowledge Graph -- 17:00 PM -17:30 PM
a. Introduction to Knowledge Graph
b. Integration of biomedical knowledge graphs with large language models
c. Current research on knowledge graphs used to enhance MLLMs contextual reasoning
Familiarity with machine learning principles at a graduate level is expected of the participants.
To become familiar with the latest foundation models, both pre-training and adaptation aspects (e.g., parameter-efficient finetuning)
To learn how foundation models could be relevant for multimodal medical imaging research.
To have hands-on experience in using the models for some standard tasks in healthcare
Develop a comprehensive understanding of Foundation Models (FMs) and Multimodal Large Language Models (MLLMs) and their transformative impact on medical image analytics.
Acquire practical skills in deploying open-source FMs and MLLMs for medical image analytics.
Implement advanced techniques such as leveraging eye gaze and knowledge graph for enhancing models’ performance.
Ismail Ben Ayed
Full Professor at ÉTS Montréal
Tanveer Syeda-Mahmood
IBM Fellow, Chief Scientist
Razi Mahmood
PhD student at Rensselaer Polytechnic Institute
Julio Silva Rodriguez
Post-doc at ÉTS Montréal
Yunsoo Kim
PhD candidate at University College London
Weidi Xie
Associate Professor at Shanghai Jiao Tong University
Sophie Ostmeier
PostDoc at Stanford
Luping Zhou
Associate Professor at University of Sydney
Curtis Langlotz
Professor at Stanford
Chaoyi Wu
Assistant Professor at Shanghai Jiao Tong University
Honghan Wu
Professor at University of Glasgow