Large Language Models for Healthcare Applications
Department of Digital Medical Technologies
Holon Institute of Technology
Course 43005, Spring 2025, Lecturer: Dr. Alexander(Sasha) Apartsin
Course 43005, Spring 2025, Lecturer: Dr. Alexander(Sasha) Apartsin
The course "Large Language Models for Healthcare Applications" provides an in-depth exploration of Large Language Models(LLMs), a key category within Generative AI(GenAI), with a focus on Natural Language Processing(NLP) tasks relevant to healthcare contexts.
Through a balanced integration of theoretical knowledge and end-to-end implementation projects, students will develop practical skills in designing, building, and deploying NLP applications based on advanced LLM techniques, using cutting-edge software libraries and tools.
GenAI-Powered Learning for Sharper Clinical Reasoning
Mai Werthaim , Maya Kimhi
Abstract
The diagnostic process is inherently interactive, often driven by a clinician’s ability to ask the right questions in response to incomplete or ambiguous patient information. However, current diagnostic benchmarks for large language models (LLMs) typically rely on fully disclosed patient cases, overlooking the critical role of iterative inquiry and adaptive reasoning. In this work, we introduce Q4Dx, a novel benchmark designed to evaluate the capacity of large language models (LLMs) to perform interactive diagnosis through patient interrogation. Our benchmark simulates a realistic clinical scenario in which only partial symptom descriptions are initially revealed, requiring the model to engage in strategic questioning to acquire the missing diagnostic cues.
To construct Q4Dx, we sample from a structured clinical database of disease records associated with symptom sets. We generate multiple natural language case descriptions for each sampled disease that reveal varying proportions of the related symptoms, ranging from minimal to moderately detailed narratives. This variation allows us to control diagnostic difficulty and systematically evaluate model behaviour under different levels of information completeness.
Diagnosing Through the Noise: Understanding Patient Self-Descriptions
Liel Sheri , Eden Mama
Abstract
PatientSignal is a project that examines how well different NLP models can classify patient-reported symptoms when the input is messy, unclear, or written in everyday language rather than formal clinical terms. To simulate real-world communication, we took clean symptom descriptions. We used the LLaMA 3.1 model to add realistic noise, such as hesitations, repeated phrases, and off-topic comments, mimicking how elderly patients or those under stress might describe their condition. We created three versions of the dataset: clean, medium noise, and heavy noise, and used them to train and evaluate several models. These included a basic Naive Bayes classifier using TF-IDF, more advanced language models like BERT, ClinicalBERT (trained on real clinical notes), and FLAN-T5, a general-purpose text-to-text model. All models were tested on the same dataset across the three noise levels to see how accuracy changes as the input becomes more challenging to understand. The results showed that while all models perform well on clean text, FLAN-T5 is much more robust when the descriptions become noisy and confusing, making it more suitable for real-life medical scenarios. The full dataset, models, and training code are available on GitHub for others to explore.
Monitoring Medication Questions to Detect Emerging Safety Concerns
Dvora Goncharok , Arbel Shifman
Abstract
Online medical forums are a rich and underutilized source of insight into patient concerns, especially regarding medication use. Some of the many questions users pose may signal confusion, misuse, or even the early warning signs of a developing health crisis. Detecting these critical questions—those that may precede severe adverse events or life-threatening complications—is vital for timely intervention and improving patient safety.
This study introduces a novel annotated dataset of medication-related questions extracted from online forums. Each entry is manually labelled for criticality based on clinical risk factors. We benchmark the performance of six traditional machine learning classifiers using TF-IDF textual representations, alongside three state-of-the-art large language model (LLM)-based classification approaches that leverage deep contextual understanding.
Our results highlight the potential of classical and modern methods to support real-time triage and alert systems in digital health spaces. The curated dataset is made publicly available to encourage further research at the intersection of patient-generated data, natural language processing, and early warning systems for critical health events.
Detection of Adverse Drug Reactions in Clinical Sentences
Naveh Nissan , Nicole Poliak
Abstract
Identifying adverse drug reactions (ADRs) in clinical narratives is a critical task in medical natural language processing, supporting pharmacovigilance and patient safety efforts. This study focuses on sentence-level ADR classification and compares three methodological paradigms: traditional TF-IDF features with logistic regression, contextual embeddings using Sentence-BERT (SBERT), and generative large language models (LLMs) used in zero-shot and few-shot classification settings.
All methods are evaluated on annotated sentences from medical case reports using standard classification metrics, including accuracy, precision, recall, and F1-score. Beyond performance evaluation, the study explores trade-offs between predictive effectiveness and model complexity, including computational cost, data requirements, and ease of deployment in clinical pipelines. The comparative analysis aims to inform the selection of ADR classification approaches based on practical constraints and application needs in real-world healthcare settings.
Listening to Social Signals for Mental Health Insights
Dudi Saadia , Shahar Sadon, Shanel Asulin
Abstract
The rise of social media as a dominant form of online expression has created new opportunities for identifying indicators of mental health conditions such as depression and post-traumatic stress disorder (PTSD) through textual analysis. This study investigates the application of generative large language models (LLMs) for detecting signs of psychological distress in user-generated posts and comments. Leveraging the generative capabilities of modern LLMs, we explore zero-shot and few-shot classification settings to analyze linguistic patterns associated with mental health conditions, using publicly available annotated datasets.
We compare the performance of generative models against traditional discriminative baselines, focusing on classification accuracy, robustness to informal language, and adaptability to varied writing styles. Our findings demonstrate that generative LLMs offer a promising approach for scalable and nuanced mental health signal detection across diverse social media platforms.
Protecting Privacy, Empowering Mental Health Research
Uriel Atzmon, Victoria Chuykina
Abstract
Clinical notes on mental health frequently contain extended forms of Protected Health Information (PHI), including sensitive details about patients' families, workplaces, and other identifiable elements. Adequate anonymization of such data is critical to enabling data sharing and model training while preserving patient privacy. In this study, we propose a method for anonymizing mental health-related clinical texts using generative large language models (LLMs). Our approach aims to remove PHI while retaining clinically relevant information necessary for diagnostic and treatment-related tasks. To train and evaluate the system, we generate a synthetic dataset in which mental health symptoms and diverse forms of PHI are systematically injected. The anonymization process is framed as a translation task in which generative LLMs produce de-identified versions of the notes. We assess the model's performance along two axes: (1) whether clinically relevant questions can still be accurately answered from the anonymized text, and (2) whether PHI can be successfully obfuscated to prevent its retrieval. Our results demonstrate the feasibility of generative models for effective and privacy-preserving transformation of sensitive clinical narratives.
From Findings to Insight: Automated Radiology Impressions
Netanel Ohev Shalom , Yaniv Grosberg , Aviel Shmuel
Abstract
Radiology impressions provide essential diagnostic summaries from detailed report findings and are integral to clinical workflows. Automating impression generation holds promise for reducing radiologist workload and improving report consistency. In this study, we explore the use of medical language models to generate impressions from the findings sections of radiology reports. We use domain-adapted transformer architectures to compare three learning paradigms—zero-shot prompting, few-shot prompting, and full fine-tuning.
The task is formulated as a controlled text generation problem, and performance is evaluated using standard text similarity metrics such as ROUGE and BERTScore, measuring alignment with reference impressions. This comparative framework provides insight into the trade-offs between deployment complexity, training requirements, and output fidelity across different generative strategies for clinical summarization.
Analysing ER Complaints with AI to Prioritize Critical Cases Faster
Nofar Kedmi, Diana Akoshvili
Abstract
Triage is a time-critical process that determines the urgency of patient care, often based on verbal descriptions provided by patients or clinical staff in real-time. Frequently derived from speech and transcribed into free text, these descriptions form the basis for assessing whether immediate medical intervention is required. LangTriage is an NLP-based system designed to classify patient cases into urgency levels, such as urgent or non-urgent, based on natural language descriptions.
We generate synthetic case narratives to train and evaluate the system by translating structured clinical measurements (e.g., vitals, symptoms, demographics) into realistic free-text descriptions. These synthetic texts simulate real-world triage scenarios and enable the exploration of various large language model (LLM)-based classifiers. By integrating structured data translation, synthetic data generation, and urgency classification, LangTriage aims to support faster, more consistent, and scalable triage decision-making in high-demand healthcare environments.
Your AI shield against health misinformation
Sara Mangistu, Michelle Zalevsky
Abstract
The proliferation of medical misinformation on social media presents a growing threat to public health, particularly during global health emergencies. This study investigates using large language models (LLMs) to detect and classify medical misinformation in user-generated content. We frame the task as a claim verification problem, leveraging real-world datasets such as COVID19-Fake-News, PubHealth, and HealthLiesRate (HLR) to train and evaluate LLM-based classifiers. The models are tasked with determining the veracity of health-related claims, distinguishing between true and false information. By applying and adapting LLMs to this domain, we aim to enhance the scalability and accuracy of misinformation detection systems and contribute to safer online health discourse.
AI-driven sentiment analysis to understand how patients truly feel about their medications
Nikol Jabotinski, Yuval Elisha
Abstract
PharmaFeel is a Natural Language Processing (NLP) project focused on analyzing unstructured patient reviews related to medications to extract sentiment based on patient experiences. Each review is classified as positive, neutral, or negative to capture public perception and satisfaction with specific treatments. The project compares traditional machine learning approaches with modern large language model (LLM)-based techniques to evaluate their effectiveness in handling informal, subjective, and often noisy patient-generated content. By leveraging diverse NLP strategies, PharmaFeel aims to support healthcare providers, researchers, and pharmaceutical stakeholders in understanding medication efficacy and side effects from the patient perspective.
Tracking What Matters: Detecting Changes in Patient Narratives
Gabrielle Maor, Shay Sason
Abstract
Timely detection of significant changes in patient status is critical in home care settings, where early intervention can prevent deterioration and reduce hospital readmissions. This study presents an approach for identifying clinically meaningful status changes by combining caregiver-provided textual descriptions with structured vital sign measurements. We leverage large language models (LLMs) to analyze and interpret natural language reports, often informal and context-rich, in conjunction with physiological data to determine whether a significant change has occurred. The system is designed to support remote monitoring and assist clinical decision-making by flagging cases that require medical attention. By integrating unstructured and structured inputs, this LLM-based framework enhances the reliability and responsiveness of home care monitoring systems.
2025 Spring, Computer Science, Course Page