MICCAI 2026 Tutorial

From Foundational to Multimodal Models for Medical Imaging (FMLLM)

MICCAI 2026

Date: October 4th or 8th, 2026

TBD
8:30AM - 13:00PM

Description

Generative AI and large-scale self-supervised foundation models are poised to have a profound impact on human decision making across occupations. Healthcare is one such area where such models have the capacity to impact patients, clinicians, and other care providers. Medical imaging could benefit from these technologies for many applications ranging from phantom models to precision AI for interventional imaging. The latest work in AI is all surrounding foundation models for language, vision, etc. Building healthcare-specific foundation models is relevant to our community as we have learned from experience that the standard deep learning models still need a good amount of conditioning before they will be relevant to medical imaging. Learning these techniques in a timely fashion by our MICCAI community members will help accelerate not only their adoption in our field but also advance the science of AI by providing adequate requirements for such systems. This is an emerging topic with little systematic courses organized at many universities and hence will be a benefit to our MICCAI community members.

In this tutorial, we will explore the fundamentals of training, adaptation, evaluation, and deployment of foundation models and generative AI, with a focus on addressing current and future medical imaging needs. The tutorial will cover models used in natural language processing, computer vision, and multi-modal models, as well as their applicability to medical imaging. We will explore models trained on non-healthcare domains and their adaptation to domain-specific problems in healthcare. In addition to the fundamentals of these models, we will provide practical demonstrations so that the audience could get hands-on experience.

The morning session will cover Foundational Models, including Vision-Language Models (VLMs) and Generative Models for medical image analytics. We will cover the fundamentals aspects of these models, both in pre-training and adaptation, and discuss several practical medical use cases so that the audience can get hands-on experience. In the afternoon, we will focus on recent advancements in medical multimodal large language models (MLLMs), such as MedGemini, which enable the integration of diverse data types, clinical reports, medical images, and graph representations, within a unified framework. Focusing on radiology applications, the audience will learn to leverage open-source MLLMs and to implement advanced techniques such as knowledge graph infusion through both lectures and practical demonstrations.

Description

1. Introduction to Foundation Models -- 8:30 to 9:00 AM

a. Foundation models - Evolution, definition

b. Transformer architecture

c. Vision transformers

2. Encoders -- 9:00 to 9:30 AM

a. Variational and masked auto-encoders

b. Image Captioning models

c. Zero-shot Contrastive Language-Image Pre-training (CLIP)

d. Vision-language models for medical imaging (e.g., embedding domain knowledge)

3. Vision-language models for text generation- 9:30 to 10:30 AM

a. Llava-style models

b. Generative reporting models for medical images

c. Evaluating VLM text generation for medical images

Coffee break: 10:30 to 11:00 am

4. VLM for image generation -- 11:00 to 11:30 AM

a. Diffusion models

b. Synthetic medical image generation

5. Improving performance of foundation models --11:30-12:00 PM

a. Parameter-efficient fine-tuning (PEFT)

b. Retrieval-augmented generation

c. Inference Scaling

6. Foundational models for segmentation -- 12:00 PM -12:30 PM

a. U-net and its variants for medical image segmentation

b. SAM and MedSAM models

7. Advanced architectures -- 12:30 PM -13:00 PM

a. Multimodal LLMs for pathology

b. Multimodal fusion models based on graphs

Familiarity with machine learning principles at a graduate level is expected of the participants.

Learning objectives

- - To become familiar with the latest foundation models, both pre-training and adaptation aspects (e.g., parameter-efficient finetuning)
  - To learn how foundation models could be relevant for multimodal medical imaging research.
  - To have hands-on experience in using the models for some standard tasks in healthcare
  - Develop a comprehensive understanding of Foundation Models (FMs) and Multimodal Large Language Models (MLLMs) and their transformative impact on medical image analytics.
  - Acquire practical skills in deploying open-source FMs and MLLMs for medical image analytics.
  - Implement advanced techniques such as leveraging eye gaze and knowledge graph for enhancing models’ performance.