We are extending the registration deadline to 7/31 to accommodate great interests from the community, with 180+ registrants! See you all at TTIC.
8:30-9:00 Walk in & Registration & Breakfast
9:00 Welcoming remarks
9:15 - 10:15 Keynote talk: Mohit Bansal
10:15 - 11:00 Invited talk: Ranjay Krishna
11:00 - 11:15 Break
11:15 - 12:00 Invited talk: Xiaolong Wang
12:00 - 12:45 Invited talk: Pulkit Agrawal
12:45 - 1:45 Lunch
1:45 - 2:45 Keynote talk: Johan Schalkwyk
2:45 - 3:30 Invited talk: Yu Cheng
3:30 - 3:45 Break
3:45 - 4:30 Invited talk: Ye Zhu
4:30 - 5:00 Lightning Talks
5:00 - 6:00 Poster Presentations/Social Hour
8:30 - 9:00 Walk in & Registration & Breakfast
9:15 - 10:15 Keynote talk: Navdeep Jaitly
10:15 - 11:00 Invited talk: Saining Xie
11:00 - 11:15 Break
11:15 - 12:00 Invited talk: Yuxiong Wang
12:00 - 1:30 Lunch/poster from 12:30
1:30 - 2:15 Invited talk: Yunzhu Li
2:15 - 3:00 Invited talk: Byung-Cheol Min
3:00 - 3:30 Break
3:30 - 4:15 Invited talk: Manling Li
4:15 - 5:00 Panel discussion: Navdeep Jaitly, Byung-Cheol Min, Greg Shakhnarovich (TTIC), Yunzhu Li, Manling Li
5:00 - Close
Abstract: In this talk, I will present our journey of large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts, etc.) and enhancing their important aspects such as unification (for generalizability, shared knowledge, and efficiency), interpretable agent planning/programming (for controllability and faithfulness), and evaluation (of fine-grained skills, faithfulness, and social biases). We will start by discussing early cross-modal vision-and-language pretraining models (LXMERT). We will then look at early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual commonsense reasoning, captioning, and multimodal translation) by treating all tasks as text generation. We will next look at recent, progressively more unified models (with joint objectives and architecture, as well as newer unified modalities during encoding and decoding) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), interactive, interleaved, composable any-to-any text-audio-image-video multimodal generation (CoDi, CoDi-2), unified and efficient adaptation of any control to any image/video diffusion model (Ctrl-Adapter) or for generalizable video-language reasoning (CREMA). Second, we will discuss interpretable and controllable multimodal generation (to improve faithfulness) via LLM-agents based planning and programming, such as layout-controllable image generation via visual programming (VPGen), consistent multi-scene video generation via LLM-guided planning (VideoDirectorGPT), open-domain, open-platform diagram generation (DiagrammerGPT), and LLM-based adaptive environment generation for training embodied agents (EnvGen). I will conclude with important faithfulness and bias evaluation aspects of multimodal generation models, based on fine-grained skill and social bias evaluation (DALL-Eval), interpretable and explainable visual programs (VPEval), as well as reliable fine-grained evaluation via Davidsonian semantics based scene graphs (DSG).
Bio:
Dr. Mohit Bansal is the John R. & Louise S. Parker Distinguished Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at UNC Chapel Hill. He received his PhD from UC Berkeley in 2013 and his BTech from IIT Kanpur in 2008. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on multimodal generative models, grounded and embodied semantics, faithful language generation, and interpretable, efficient, and generalizable deep learning. He is a recipient of Early Career Award for Scientists and Engineers (ECASE), IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. He has been a keynote speaker for the AACL 2023, CoNLL 2023, and INLG 2022 conferences. His service includes EMNLP and CoNLL Program Co-Chair, and ACL Executive Committee, ACM Doctoral Dissertation Award Committee, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals. Webpage: https://www.cs.unc.edu/~mbansal/
In this talk we will explore the topic of large language models, specifically multimodal large language models, and how they can help to promote language inclusivity. As technology advances, it is important that we develop tools that are accessible to everyone, regardless of their native language. Multi modal large language models have the potential to bridge the gap between languages by modeling semantically across spoken, written, and image modalities. This talk will discuss the research efforts in developing these models, the challenges involved, and the potential they hold for the future.
Bio:
Johan Schalkwyk, a (former) Google Fellow, has been a leader in the speech industry for over 25 years. His passion is to make speech a usable interface that everyone in the world uses. He was instrumental in Google DeepMind’s Multimodal perception and Large Language Model efforts.
In 2008, Johan built the first search by voice experience in the world, Google Voice Search. He has led Google's speech team, bringing research innovations such as on-device and neural models to products from Google Assistant to YouTube for over 80 languages.
Most recently Johan has joined the exciting world of start-ups at Sesame AI, where he is working towards solving fluid conversations with large language models. When not building speech recognizers, Johan enjoys mountain biking around the world, cooking Vietnamese food, and baking desserts.
Multimodal models have changed the landscape of what is possible with machine learning models -- from better, more controllable generative models of images and videos described by text to joint models of speech and text, to multimodal models that can power robotics. In this talk we discuss some of the challenges in multimodal learning, and show how to handle these problems. We show how we can model language using diffusion models over latent variables and show how high resolution images can be modeled with a multi-resolution diffusion model. We also show how we can improve diversity in generations from diffusion models using autoregressive latent variables. In the domains of speech and text, we show how discretization is an effective tokenization technique. Finally, as a note of caution we show that multimodal models, in spite of their power, still have blindspots that can reduce their effectiveness. We show that visual language models do not come close to human accuracy on visual understanding tasks using in intelligent tests, such as Raven's matrices and explore some of the reasons why this is the case.
Bio:
Navdeep Jaitly is a Research Scientist at Apple Machine Learning where he leads a team of researchers working on fundamental techniques for Machine Learning with an emphasis on speech and language. He got his PhD from University of Toronto under the supervision of Geoffrey Hinton in the foundational days of Deep Learning. During a PhD internship at Google, he demonstrated how Deep Neural Networks could revolutionize speech recognition and this work was a part of a 2012 paper which received the test of time best award paper from IEEE Signal Processing Magazine in 2022. After his PhD he joined Google Brain, working on sequences models, introducing methods such as Listen Attend and Spell, Adversarial Autoencoders and PointerNetworks. He has also held machine learning research positions at Nvidia, Google Brain Robotics, D. E. Shaw and the National Labs.
This talk provides an overview of our recent work in multimodal learning. We start by exploring the visual shortcomings of multimodal large language models, followed by a discussion on how to enhance LLMs with better and more precise visual grounding. Our approach incorporates mechanisms such as visual self-supervised learning, human-like visual search and system II reasoning into multimodal LLMs. By integrating an informed visual search algorithm, we enable LLMs to identify relevant information within a multitude of stimuli and interact more effectively with real-world data. We also ground LLMs in real-life experiences using actionable environments like street view imagery, enriching their sensory grounding and resonating with urban life nuances. Finally, we'll talk about the opportunities and challenges of building Cambrian-1, our recent fully open project on multimodal foundation models. This line of research aims to empower LLMs to interact with and understand the sensory-rich world in a more realistic and meaningful way.
Bio:
Saining Xie is an Assistant Professor of Computer Science at the Courant Institute of Mathematical Sciences at New York University and is affiliated with NYU Center for Data Science. He is also a visiting faculty researcher at Google Research. Before joining NYU in 2023, he was a research scientist at FAIR, Meta. In 2018, he received his Ph.D. degree in computer science from the University of California San Diego. His research focuses on computer vision and machine learning, particularly scalable visual generation, understanding and representation learning. His work has been recognized with the Marr Prize honorable mention, CVPR best paper finalists and an Amazon research award.
Compositionality is a fundamental characteristic of both human vision as well as natural language. It allows us to recognize new scenes and understand new sentences as a composition of previously seen atoms (e.g. objects in images or words in a sentence). Although scholars have spent decades injecting compositional priors into machine learning models, these priors have fallen away with the recent rise of large-scale models trained on internet scale data. In this talk, I will first formalize the notion of compositionality for vision and language by drawing on cognitive science literature. WIth this formalization, we evaluate whether today's best models (including GPT-4V and Gemini) are compositional, uncovering that they perform close to random chance. Next, we will draw on additional priors from neuroscience and cognitive science experiments on human subjects to suggest architectural changes and training algorithms that encourage the emergence of compositionality. Next, we will utilize the same formalism to evaluate generative models, embodied AI, and tool-usage, showcasing that they too are not compositional and demonstrate mechanisms to improve them. Finally, we explore mechanisms to augment models with tools to improve their limitations.
Bio:
Ranjay Krishna is an Assistant Professor at the Paul G. Allen School of Computer Science & Engineering. His research lies at the intersection of computer vision and human computer interaction. This research has received best paper, outstanding paper, and orals at CVPR, ACL, CSCW, NeurIPS, UIST, and ECCV, and has been reported by Science, Forbes, the Wall Street Journal, and PBS NOVA. His research has been supported by Google, Amazon, Cisco, Toyota Research Institute, NSF, ONR, and Yahoo. He holds a bachelor's degree in Electrical & Computer Engineering and in Computer Science from Cornell University, a master's degree in Computer Science from Stanford University and a Ph.D. in Computer Science from Stanford University.
In recent years, CLIP has become a cornerstone in multimodal intelligence, serving as a foundational model bridging different modalities. Various versions of CLIP are widely employed, especially as the vision encoder in many multimodal large language models. However, CLIP is not without its flaws. Recent studies have found that CLIP often encodes only very coarse-grained concepts from visual inputs, ignoring much useful information and thereby complicating downstream tasks.
In this workshop, we bring a path to address this drawback of CLIP. First, the information insufficiency can be linked to the feature suppression problem in contrastive learning. When feature suppression occurs, the model may capture only a limited portion of the information from the input data while overlooking other potentially valuable content. We introduce a multistage contrastive learning (MCL) framework that, unlike standard contrastive learning which captures a single biased feature distribution, progressively learns previously unlearned features at each stage. MCL ultimately learns more useful information across different attributes to better distinguish inputs.
By fine-tuning only the MLP layers of CLIP through MCL on fine-grained, carefully aligned image-text pair data, we obtain different MLP parameter sets at each stage. These parameter sets act as experts, which are used to initialize a Mixture of Experts (MoE) CLIP, which integrates learned information across different MCL stages. This results in a stronger MoE-CLIP with significantly improved performance.
We aim for this workshop to highlight the deficiencies of CLIP as a multimodal foundation model and to provide valuable insights and ideas to the community on how to enhance it.
Bio:
Professor Yu Cheng is an Associated Professor of Computer Science and Engineering at the Chinese University of Hong Kong. From 2018-2023, he was a Principal Researcher at Microsoft Research Redmond. Before that, he was a Research Staff Member at MIT-IBM Watson AI Lab. Dr. Cheng got his Ph.D. from Northwestern University in 2015 and my B.S. degree from Tsinghua University in 2010. His research covers deep learning in general, with specific interests in model compression & efficiency, deep generative models, and large multimodal/language models. From 2021 to 2023, he led several teams to productize these techniques for Microsoft-OpenAI core models (Copilot, DALL-E-2, ChatGPT, GPT-4).
Dr. Cheng serves as a Senior Area Chair for NeurIPS and ICML, and Area Chair for CVPR, ICLR, ACL, NAACL, and EMNLP. His papers have won the Outstanding Paper Award in NeurIPS 2023, the Best Student Paper Honorable Mention in WACV 2021, and the Best Paper Finalist in SDM 2015. He is an affiliate faculty at Tsinghua University, Shanghai Jiao Tong University, Fudan University, and University of Science and Technology of China.
As robots become more integrated into our daily lives, accurate and contextually relevant room segmentation is essential for enhancing robotic navigation. In this talk, I will introduce SeLRoS: Semantic Layering in Room Segmentation via Large Language Models (LLMs), an innovative approach that enriches traditional 2D map-based segmentation with semantic data. Conventional methods focus primarily on geometric aspects, often leading to inaccuracies due to furniture and obstacles. SeLRoS addresses these limitations by integrating semantic information, such as object identification and spatial relationships, using LLMs. This approach provides a new framework for interpreting and organizing complex information about each segmented area, significantly improving accuracy. I will discuss how SeLRoS enhances room segmentation and its effectiveness, demonstrated through application across 30 diverse 3D environments. This talk aims to redefine room segmentation by integrating semantic information, ultimately contributing to more efficient and intelligent robotic systems.
Bio:
Dr. Byung-Cheol Min is an Associate Professor and University Faculty Scholar in the Department of Computer and Information Technology at Purdue University. He is the inaugural director of the Applied AI Research Center and leads his research lab, SMART Lab (www.smart-laboratory.org). Dr. Min received a B.S. degree in Electronics Engineering and an M.S. in Electronics and Radio Engineering with a specialization in Automatic Control from Kyung Hee University in 2008 and 2010, respectively. He earned his Ph.D. in Computer and Information Technology with a specialization in Robotics from Purdue University in 2014. He also served as a postdoctoral fellow at the Robotics Institute of Carnegie Mellon University from 2014 to 2015. Dr. Min’s research interests lie at the intersection of human-robot interaction and multi-robot systems. He explores problems of planning and control, algorithms, and robot learning, applying them to field robotics and assistive technology. His work focuses on designing algorithms and systems to enable multiple robots to collaborate in a distributed manner and to function as part of multi-human-multi-robot teams. He also investigates how learning methods can enable robots to flexibly interact with humans in diverse situations. Dr. Min received the NSF CAREER Award in 2019 and has garnered numerous accolades from Purdue University, including the Purdue PPI Outstanding Faculty Award in Discovery in 2019, the Purdue CIT Outstanding Graduate Mentor Award in 2019, and the Purdue PPI Interdisciplinary Research Collaboration Award in 2021. He was named a Purdue University Faculty Scholar in 2021.
Foundation models, such as GPT-4 Vision, have marked significant achievements in the fields of natural language and vision, demonstrating exceptional abilities to adapt to new tasks and scenarios. However, physical interaction—such as cooking, cleaning, or caregiving—remains a frontier where foundation models and robotic systems have yet to achieve the desired level of adaptability and generalization. In this talk, I will discuss the opportunities for incorporating foundation models into classic robotic pipelines to endow robots with capabilities beyond those achievable with traditional robotic tools. The talk will focus on three key improvements in (1) task specification, (2) low-level, and (3) high-level scene modeling. The core idea behind this series of research is to introduce novel representations and integrate structural priors into robot learning systems, incorporating the commonsense knowledge learned from foundation models to achieve the best of both worlds. I will demonstrate how such integration allows robots to interpret instructions given in free-form natural language and enables category-level generalization for free. I will also show how the foundation models can be augmented with additional memory mechanisms, like an action-conditioned scene graph, for a wide range of real-world manipulation tasks involving rigid, articulated, and nested objects (e.g., Matryoshka dolls), and deformable objects. Towards the end of the talk, I will discuss the limitations of the current foundation models, challenges that still lie ahead, and potential avenues to address these challenges.
Bio:
Yunzhu Li is an Assistant Professor of Computer Science at Columbia University. Before joining Columbia, he was an Assistant Professor at UIUC CS and spent time as a Postdoc at Stanford, collaborating with Fei-Fei Li and Jiajun Wu. Yunzhu earned his PhD from MIT under the guidance of Antonio Torralba and Russ Tedrake. His work stands at the intersection of robotics, computer vision, and machine learning, with the goal of helping robots perceive and interact with the physical world as dexterously and effectively as humans do. Yunzhu’s work has been recognized through the Best Systems Paper Award and the Finalist for Best Paper Award at the Conference on Robot Learning (CoRL). Yunzhu is also the recipient of the Sony Faculty Innovation Award, the Adobe Research Fellowship, and was selected as the First Place Recipient of the Ernst A. Guillemin Master’s Thesis Award in Artificial Intelligence and Decision Making at MIT. His research has been published in top journals and conferences, including Nature, Science, NeurIPS, CVPR, and RSS, and featured by major media outlets, including CNN, BBC, The Wall Street Journal, Forbes, The Economist, and MIT Technology Review.
Having a humanoid robot operating like a human has been a long-standing goal in robotics. The humanoid robot provides a generalized purpose platform to conduct diverse tasks we do in our daily lives. In this talk, we study learning-based approaches for both the mobility and manipulation skills of the humanoid robot, with the goal of generalization to diverse tasks, objects, and scenes. I will discuss how to perform whole-body control in humanoids with rich, diverse, and expressive motions. I will also share some lessons we learned from developing three different teleoperation systems to operate the humanoid robots.
Bio:
Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco.
Data fuels modern computer vision models. But, the challenge of limited supervision from human-created annotations never ends. The traditional pre-training-and-fine-tuning paradigm becomes inadequate when developing models for fine-grained visual recognition and localization, generalist visual comprehension, and more. In this talk, I discuss our recent efforts towards bridging the gap between limited supervision and increasingly complex visual and multimodal tasks, based on a variety of foundation models including visual foundation models, large language models, large multimodal models, and generative models. We develop versatile strategies to adapt or transfer knowledge from these foundation models, minimizing dependency on expensive human supervision. We address several key questions that have recently arisen in computer vision: 1) Can we develop foundation models capable of tackling more complex tasks with reduced supervision? 2) Is there inherent synergy between models trained on different modalities, and can we further leverage such synergy to create a more powerful supermodel by composing heterogenous foundation models? 3) How can we advance existing foundation models into the 3D world? Throughout the talk, I demonstrate the potential of scaling up in-the-wild visual and multimodal learning but with minimal human supervision.
Bio:
Yuxiong Wang is an Assistant Professor in the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. He is also affiliated with the National Center for Supercomputing Applications (NCSA). He received a Ph.D. in robotics from Carnegie Mellon University. His research interests lie in computer vision, machine learning, and robotics, with a particular focus on meta-learning, open-world perception, multimodal learning, and generative modeling. He is a recipient of awards including the Amazon Faculty Research Award, the ECCV Best Paper Honorable Mention Award, and the CVPR Best Paper Award Finalists. He is selected to participate in the National Academy of Engineering’s (NAE) Frontiers of Engineering symposium. For details: https://yxw.cs.illinois.edu/.
Multimodality generation tasks have attracted much research attention in recent years, empowered by the state-of-the-art diffusion generative models. On the downstream side of text-to-image diffusion models, many existing works seek to control the generated output by fine-tuning the pre-trained models with auxiliary supervisions or learning extra neural networks to impose better interpretability. In this talk, I will introduce several of our recent works that tackle those applications in a learning-free paradigm, with high-level design philosophy inspired by the fundamental studies in mathematics and thermodynamics. This line of work showcases the great potential of leveraging theoretical knowledge to enable cutting-edge state-of-the-art multimodal generations, opening up the exploration of Math4ML and Physics4ML.
Bio:
Dr. Ye Zhu is currently a postdoctoral research associate in Computer Science at Princeton University. Her main research focuses on multimodality generative models, computer vision, and machine learning for astrophysics. She holds her Ph.D in Computer Science from Illinois Tech; M.S. and B.S. in Mechanical Engineering from Shanghai Jiao Tong University; Diplome d’Ingenieur from Ecole Polytechnique in France. She is a recipient of the NeurIPS scholar award, ACM Women scholarship, a co-organizer of the Responsible Generative AI workshop at CVPR, and a regular reviewer for NeurIPS, ICLR, ICML, CVPR, ECCV, ICCV.
While Large Language Models excel in language processing, Large Agent Models are designed to interact with the environment. This transition poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for effective goal interpretation and decision-making. Despite the impressive performance of LLMs/VLMs on various benchmarks, these models perceive images as bags of words (semantic concepts). In detail, they use semantic understanding as a shortcut but lacks ability to recognize geometric structures or solve spatial problems such as mazes.
To enhance acquire knowledge of the physical world, we focus on two dimensions: (1) From high-level semantic to low-level geometric understanding: We introduce a low-level visual description language that serves as geometric tokens, allowing the abstraction of multimodal low-level geometric structures. (2) From fast-thinking to slow-thinking: We propose to quantify long-horizon reasoning by incorporating Markov Decision Process (MDP) based decision-making. The key difference between language models and agent models lies in their decision-making capabilities. This fundamental difference necessitates a shift in how we approach the development of large agent models, focusing on both geometric understanding and long-term planning to create more capable embodied AI agents.
Bio:
Manling Li is a postdoc at Stanford University and is an incoming Assistant Professor at Northwestern University. She obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on intersection of language, vision, and robotics. Her work on multimodal knowledge extraction won the ACL'20 Best Demo Paper Award, and the work on scientific information extraction from COVID literature won NAACL'21 Best Demo Paper Award. She was a recipient of Microsoft Research PhD Fellowship in 2021, an EE CS Rising Star in 2022, a DARPA Riser in 2022, etc. She served as Organizing Committee of ACL 25 and EMNLP 2024, and delivered tutorials about multimodal knowledge at IJCAI'23, CVPR'23, NAACL'22, AAAI'21, ACL'21, etc. Additional information is available at https://limanling.github.io/.
Unlike natural language and image processing where internet data is easily available for training foundation models, data for robot learning is unavailable. I will discuss how simulators can be used to learn complex and generalizable sensorimotor skills in a manner that reduces human effort and is easily scaled to many tasks. I will elaborate using the following case studies:
(i) a dexterous manipulation system capable of re-orienting novel objects of complex shapes and peeling vegetables.
(ii) a quadruped robot capable of fast locomotion, manipulation, and whole-body control on diverse natural terrains.
(iii) a lifelong learning robotic agent that can request and learn new rigid-object manipulation skills in a few minutes.
Next, I will discuss some algorithmic ideas aimed at mitigating human effort in reward design, hyper-parameter tuning and enabling seamless combination of learning signals from demonstrations, rewards, and the agent's self-exploration. The resulting framework provides a way to collect high-quality data for multiple tasks that can be used to train a foundation model for robot intelligence.
Bio:
Pulkit Agrawal is an Associate Professor in the Department of Electrical Engineering and Computer Science at MIT, where he directs the Improbable AI Lab. He is interested in robotics and learning methods for control. Pulkit's work received the Best Paper Award at the Conference on Robot Learning 2021 and the Best Student Paper Award at the Conference on Computer Supported Collaborative Learning 2011. He is a recipient of the Sony Faculty Research Award, Salesforce Research Award, Amazon Research Award, and a Fulbright fellowship. Before joining MIT, Pulkit received his Ph.D. from UC Berkeley and a Bachelor's degree from IIT Kanpur, where he was awarded the Directors Gold Medal.
Yuyang Jiang (University of Chicago), Chacha Chen (University of Chicago), Dang Minh Nguyen, Benjamin M. Mervak, Chenhao Tan
Abstract:
GPT-4V’s purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-RAY. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V’s performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.
Abstract:
Large language models are moving into a multi-modal space covering text, images, audio and other media, yet scientific datasets possess features that are not represented by this paradigm. In cosmology, for instance, galaxies are embedded within dark matter-dominated collapsed objects, termed halos. A galaxy can be described in many ways: star formation history (SFH) characterizing the evolution of stellar content, magnitudes describing luminosity across different wavelengths (MAG), images, or matter distribution of the hosting halo. Despite their differences, these descriptions represent the same cosmological object. We propose the Object Foundation Model (OFM) to learn disparate, yet synonymous representations of objects in scientific domains. OFM learns the general underlying representation of sparse and indeterminate data and facilitates robust predictions on diverse inputs. Unlike traditional deep learning with architecture-specific predictions, the OFM predictions are requested via a language-based key that varies with the user query. We find that the embedded space of a contrastively trained model joins the disparate embeddings into a more unified space, improving the interpolation across object representations. The final model can make predictions of a galaxies stellar mass given star formation history or magnitudes with similar performance.
Abstract:
Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they may require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem.
To tackle this challenge, we introduce the BABEL framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. BABEL serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of BABEL involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment.
Evaluation demonstrates BABEL's outstanding performance on eight human activity recognition datasets, compared to various baselines, e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. BABEL not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by BABEL, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.
Niksa Praljak (University of Chicago), Hugh Yeh, Miranda Moore, Michael Socolich, Andrew Ferguson, Rama Ranganathan
Abstract:
Evolution-based deep generative models represent a new strategy for capturing the fundamental rules underlying protein folding and function. To date, these models have been purely sequence-based , learning patterns of amino acid interactions through analysis of statistical analysis of databases of protein sequences. Here, we introduce a multimodal model called PenCL that uses a database of ~45M annotations to integrate two large language models – a protein sequence model and a scientific literature model – in a joint embedding space. Once trained, PenCL can guide an autoregressive diffusion process to generate novel candidate sequences from text prompts. We show that PenCL enables natural language text- prompted retrieval of protein functional properties and phylogenetic relationships that go beyond simple homology-based annotation approaches, and most importantly, generates artificial signaling proteins and enzymes that display natural-like functional properties, both in vitro and in vivo. Interestingly, we show that the information content of the text prompt can control the novelty and diversity of functional proteins. This work opens up a path for natural language prompt engineering as a general strategy for protein engineering and design.
Abstract:
The remarkable capabilities of Large Language Models (LLMs) in text generation have been widely recognized. However, their inefficiency in generating text at the token level leaves room for improvement, and adapting these models to new data remains a challenging task. To tackle these challenges, we introduce a novel approach to language modeling -- Chunk-Distilled Language Modeling (CD-LM). By integrating deep neural networks with a straightforward retrieval module, our method allows the generation of text chunks containing fine-grained information through multiple tokens at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of pre-trained or fine-tuned models, or incorporating expert insights from human-annotated corpus. This adaptability allows for enhanced control over language model distribution without necessitating additional training. We present a formal formulation of our CD-LM framework, along with quantifiable performance metrics, demonstrating its efficacy in optimizing language model performance and efficiency across a diverse set of downstream tasks, including language modeling, text generation, and domain adaptation.
Pushkar Shukla(TTIC), Aditya Chinchure, Gaurav Bhatt, Kiri Salij, Leonid Sigal, Kartik Hosanagar, Matthew A. Turk
Abstract:
Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model's ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, our paper extends quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.
Linzhan Mou, Jun-Kun Chen (UIUC), Yu-Xiong Wang
Abstract:
Recent advances in multimodal intelligence have enabled groundbreaking applications across various domains. Inspired by these developments, we propose a novel approach for editing 4D (dynamic 3D) environments using natural language instructions. By treating 4D scenes as pseudo-3D, we apply video and static 3D editing techniques to address temporal and spatial consistency challenges. Our method enhances multimodal text-to-image diffusion models like Stable Diffusion and Instruct-Pix2Pix, with an anchor-aware attention module for batch processing and consistent frame-to-frame edits. We incorporate optical flow-guided appearance propagation within a sliding window framework for precise, coherent modifications, capturing motion smoothly. Additionally, a depth-based projection system and a correspondence-aware attention mechanism manage pseudo-3D scene data, maintaining spatial integrity and geometric consistency. An iterative editing process ensures each modification converges towards a cohesive and polished final result. By developing this dynamic 4D scene editing tool, our research aims to set a new standard in multimodal AI. This tool is applicable to both monocular and complex multi-camera scenarios, advancing the capabilities of natural language processing, computer vision, and robotics. Our work bridges the gap between user-friendly interaction and complex scene editing, paving the way for more realistic and interactive virtual environments. Code and more results are available at https://immortalco.github.io/Instruct-4D-to-4D/.