Papers

Poster Session #1:

Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance

Authors: Omer Reingold, Judy Hanwen Shen, Aditi Talati 

Abstract: While explainability is a desirable characteristic of increasingly complex black-box models, modern explanation methods have been shown to be inconsistent and contradictory. The semantics of explanations is not always fully understood – to what extent do explanations “explain” a decision and to what extent do they merely advocate for a decision? Can we help humans gain insights from explanations accompanying correct predictions and not over-rely on incorrect predictions advocated for by explanations? With this perspective in mind, we introduce the notion of dissenting explanations: conflicting predictions with accompanying explanations. We first explore the advantage of dissenting explanations in the setting of model multiplicity, where multiple models with similar performance may have different predictions. In such cases, providing dissenting explanations could be done by invoking the explanations of disagreeing models. Through a pilot study, we demonstrate that dissenting explanations reduce overreliance on model predictions, without reducing overall accuracy. Motivated by the utility of dissenting explanations we present both global and local methods for their generation.

Poster link.


A More Robust Baseline for Active Learning by Injecting Randomness to Uncertainty Sampling

Authors: Po-Yi Lu, Chun-Liang Li, Hsuan-Tien Lin

Abstract: Active learning is important for human-computer interaction in the domain of machine learning. It strategically selects important unlabeled examples that need human annotation, reducing the labeling workload. One strong baseline strategy for active learning is uncertainty sampling, which determines importance by model uncertainty. Nevertheless, uncertainty sampling sometimes fails to outperform random sampling, thus not achieving the fundamental goal of active learning. To address this, the work investigates a simple yet overlooked remedy: injecting some randomness into uncertainty sampling. The remedy rescues uncertainty sampling from failure cases while maintaining its effectiveness in success cases. Our analysis reveals how the remedy balances the bias in the original uncertainty sampling with a small variance. Furthermore, we empirically demonstrate that injecting a mere 10% of randomness achieves competitive performance across many benchmark datasets. The findings suggest randomness-injected uncertainty sampling can serve as a more robust baseline and a preferred choice for active learning practitioners.

Poster link.


Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning

Authors: Shivangana Rawat, Chaitanya Devaguptapu, Vineeth N. Balasubramanian 

Abstract: Supervised learning is an effective approach to machine learning, but it can be expensive to acquire labeled data. Active learning (AL) and partial label learning (PLL) are two techniques that can be used to reduce the annotation costs of supervised learning. AL is a strategy for reducing the annotation budget by selecting and labeling the most informative samples, while PLL is a weakly supervised learning approach to learn from partially annotated data by identifying the true hidden label. In this paper, we propose a novel approach that combines AL and PLL techniques to improve annotation efficiency. Our method leverages AL to select informative binary questions and PLL to identify the true label from the set of possible answers. We conduct extensive experiments on various benchmark datasets and show that our method achieves state-of-the-art (SoTA) performance with significantly reduced annotation costs. Our findings suggest that our method is a promising solution for cost-effective annotation in real-world applications.

Poster link.


ConceptEvo: Interpreting Concept Evolution in Deep Learning Training

Authors: Haekyu Park, Seongmin Lee, Benjamin Hoover, Austin P Wright, Omar Shaikh, Rahul Duggal, Nilaksh Das, Kevin Li, Judy Hoffman, Duen Horng Chau

Abstract: We present CONCEPTEVO, a unified interpretation framework for deep neural networks (DNNs) that reveals the inception and evolution of learned concepts during training. Our work fills a critical gap in DNN interpretation research, as existing methods focus on post-hoc interpretation after training. CONCEPTEVO presents two novel technical contributions: (1) an algorithm that generates a unified semantic space that enables sideby-side comparison of different models during training; and (2) an algorithm that discovers and quantifies important concept evolutions for class predictions. Through a large-scale human evaluation with 260 participants and quantitative experiments, we show that CONCEPTEVO discovers evolutions across different models that are meaningful to humans and important for predictions. CONCEPTEVO works for both modern (ConvNeXt) and classic DNNs (e.g., VGGs, InceptionV3).

Poster link. 


Human-in-the-Loop Out-of-Distribution Detection with False Positive Rate Control

Authors: Harit Vishwakarma*, Heguang Lin*, Ramya Korlakai Vinayak 

Abstract: Robustness to Out-of-Distribution (OOD) samples is essential for the successful deployment of machine learning models in the open world. Since it is not possible to have a priori access to a variety of OOD data before deployment, several recent works have focused on designing scoring functions to quantify OOD uncertainty. These methods often find a threshold that achieves 95% true positive rate (TPR) on the In-Distribution (ID) data used for training and uses this threshold for detecting OOD samples. However, this can lead to very high FPR as seen in a comprehensive evaluation in the Open-OOD benchmark, the FPR can range between 60 to 96% on several ID and OOD dataset combinations. In contrast, practical systems deal with a variety of OOD samples on the fly and critical applications, e.g., medical diagnosis, demanding guaranteed control of the false positive rate (FPR). To meet these challenges, we propose a mathematically grounded framework for human-in-the-loop OOD detection, wherein expert feedback is used to update the threshold. This allows the system to adapt to variations in the OOD data while adhering to the quality constraints. We propose an algorithm that uses any time-valid confidence intervals based on the Law of Iterated Logarithm (LIL). Our theoretical results show that the system meets FPR constraints while minimizing the human feedback for points that are in-distribution. Another key feature of the system is that it can work with any existing posthoc OOD uncertainty-quantification methods. We evaluate our system empirically on a mixture of benchmark OOD datasets in image classification tasks on CIFAR-10 and CIFAR-100 as in distribution datasets and show that our method can maintain FPR at most 5% while maximizing TPR.

Poster link.


Personalized Prediction of Recurrent Stress Events Using Self-Supervised Learning on Multimodal Time-Series Data

Authors: Tanvir Islam, Peter Washington 

Abstract: Chronic stress can significantly affect physical and mental health. The advent of wearable technology allows for the tracking of physiological signals, potentially leading to innovative stress prediction and intervention methods. However, challenges such as label scarcity and data heterogeneity render stress prediction difficult in practice. To counter these issues, we have developed a multimodal personalized stress prediction system using wearable biosignal data. We employ selfsupervised learning (SSL) to pre-train the models on each subject’s data, allowing the models to learn the baseline dynamics of the participant’s biosignals prior to fine-tuning the stress prediction task. We test our model on the Wearable Stress and Affect Detection (WESAD) dataset, demonstrating that our SSL models outperform non-SSL models while utilizing less than 5% of the annotations. These results suggest that our approach can personalize stress prediction to each user with minimal annotations. This paradigm has the potential to enable personalized prediction of a variety of recurring health events using complex multimodal data streams.

Poster link.


Are Good Explainers Secretly Human-in-the-Loop Active Learners?

Authors: Emma Thuong Nguyen, Abhishek Ghose 

Abstract: Explainable AI (XAI) techniques have become popular for multiple use-cases in the past few years. Here we consider its use in studying model predictions to gather additional training data. We argue that this is equivalent to Active Learning, where the query strategy involves a human-in-theloop. We provide a mathematical approximation for the role of the human, and present a general formalization of the end-to-end workflow. This enables us to rigorously compare this use with standard Active Learning algorithms, while allowing for extensions to the workflow. An added benefit is that their utility can be assessed via simulation instead of conducting expensive userstudies. We also present some initial promising results.

Poster link.


Prediction without Preclusion Recourse Verification with Reachable Sets

Authors: Avni Kothari, Bogdan Kulynych, Tsui-Wei Weng, Berk Ustun 

Abstract: Machine learning models are often used to decide who will receive a loan, a job interview, or a public service. Standard techniques to build these models use features that characterize people but overlook their actionability. In domains like lending and hiring, models can assign predictions that are fixed—-meaning that consumers who are denied loans and interviews are permanently locked out from access to credit and employment. In this work, we introduce a formal testing procedure to flag models that assign these “predictions without recourse," called recourse verification. We develop machinery to reliably test the feasibility of recourse for any model given user-specified actionability constraints. We demonstrate how these tools can ensure recourse and adversarial robustness in real-world datasets and use them to study the infeasibility of recourse in real-world lending datasets. Our results highlight how models can inadvertently assign fixed predictions that permanently bar access and the need to design algorithms that account for actionability when developing models and providing recourse.

Poster link.


Interactively Optimizing Layout Transfer for Vector Graphics

Authors: Jeremy Warner, Shuyao Zhou, Bjorn Hartmann 

Abstract: Vector graphics are an industry-standard way to represent and share a broad range of visual designs. Designers often explore layout alternatives and generate them by moving and resizing elements. The motivation for this can range from establishing a different visual flow, adapting a design to a different aspect ratio, standardizing spacing, or redirecting the design’s visual emphasis. Existing designs can serve as a source of inspiration for layout modification across these goals. However, generating these layout alternatives still requires significant manual effort in rearranging large groups of elements. We present VLT, short for Vector Layout Transfer, a novel tool that provides new techniques (Table 1) for transforming designs which enables the flexible transfer of layouts between designs. It provides designers with multiple levels of semantic layout editing controls, powered by automatic graphics correspondence and layout optimization algorithms.

Poster link. 


Towards Semantically-Aware UI Design Tools: Design, Implementation and Evaluation of Semantic Grouping Guidelines

Authors: Peitong Duan, Bjorn Hartmann, Karina Nguyen, Yang Li, Marti Hearst, Meredith Ringel Morris 

Abstract: A coherent semantic structure, where semantically-related elements are appropriately grouped, is critical for proper understanding of a UI. Ideally, UI design tools should help designers establish coherent semantic grouping. To work towards this, we contribute five semantic grouping guidelines that capture how human designers think about semantic grouping and are amenable to implementation in design tools. They were obtained from empirical observations on existing UIs, a literature review, and iterative refinement with UI experts’ feedback. We validated our guidelines through an expert review and heuristic evaluation; results indicate these guidelines capture valuable information about semantic structure. We demonstrate the guidelines’ use for building systems by implementing a set of computational metrics. These metrics detected many of the same severe issues that human design experts marked in a comparative study. Running our metrics on a larger UI dataset suggests many real UIs exhibit grouping violations.

Poster link.


Mitigating Label Bias via Decoupled Confident Learning

Authors: Yunyi Li, Maria De-Arteaga, Maytal Saar-Tsechansky 

Abstract: Growing concerns regarding algorithmic fairness have led to a surge in methodologies to mitigate algorithmic bias. However, such methodologies largely assume that observed labels in training data are correct. This is problematic because bias in labels is pervasive across important domains, including healthcare, hiring, and content moderation. In particular, humangenerated labels are prone to encoding societal biases. While the presence of labeling bias has been discussed conceptually, there is a lack of methodologies to address this problem. We propose a pruning method—Decoupled Confident Learning (DeCoLe)—specifically designed to mitigate label bias. After illustrating its performance on a synthetic dataset, we apply DeCoLe in the context of hate speech detection, where label bias has been recognized as an important challenge, and show that it successfully identifies biased labels and outperforms competing approaches.

Poster link.


Towards Mitigating Spurious Correlations in Image Classifiers with Simple Yes-no Feedback

Authors: Seongmin Lee, Ali Payani, Duen Horng Chau 

Abstract: Modern deep learning models have achieved remarkable performance. However, they often rely on spurious correlations between data and labels that exist only in the training data, resulting in poor generalization performance. We present CRAYON (Correlation Rectification Algorithms by Yes Or No), effective, scalable, and practical solutions to refine models with spurious correlations using simple yes-no feedback on model interpretations. CRAYON addresses key limitations of existing approaches that heavily rely on costly human intervention and empowers popular model interpretation techniques to mitigate spurious correlations in two distinct ways: CRAYONATTENTION guides saliency maps to focus on relevant image regions, and CRAYON-PRUNING prunes irrelevant neurons to remove their influence. Extensive evaluation on three benchmark image datasets and three state-of-the-art methods demonstrates that our methods effectively mitigate spurious correlations, achieving comparable or even better performance than existing approaches that require more complex feedback.

Poster link.


Semi-supervised Concept Bottleneck Models

Authors: Jeeon Bae, Sungbin Shin, Namhoon Lee 

Abstract: Concept bottleneck models (CBMs) enhance the interpretability of deep neural networks by adding a concept layer between the input and output layers. However, this improvement comes at the cost of labeling concepts, which can be prohibitively expensive. To tackle this issue, we develop a semisupervised learning (SSL) approach to CBMs that can make accurate predictions given only a handful of concept annotations. Our approach incorporates a strategy for effectively regulating erroneous pseudo-labels within the standard SSL approaches. We conduct experiments on a range of labeling scenarios and present that our approach can reduce the labeling cost quite significantly without sacrificing the prediction performance.

Paper link.

Poster link.


Demystifying the Role of Feedback in GPT Self-Repair for Code Generation

Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama 

Abstract: Large Language Models (LLMs) have shown remarkable aptitude in generating code from natural language specifications, but still struggle on challenging programming tasks. Self-repair—in which the user provides executable unit tests and the model uses these to debug and fix mistakes in its own code—may improve performance in these settings without significantly altering the way in which programmers interface with the system. However, existing studies on how and when self-repair works effectively have been limited in scope, and one might wonder how self-repair compares to keeping a software engineer in the loop to give feedback on the code model’s outputs. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. We find that when the cost of generating both feedback and repaired code is taken into account, performance gains from self-repair are marginal and can only be seen with GPT-4. In contrast, when human programmers are used to provide feedback, the success rate of repair increases by as much as 57%. These findings suggest that selfrepair still trails far behind what can be achieved with a feedback-giving human kept closely in the loop.

Poster link.


Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Authors: Gang Li, Yang Li 

Abstract: Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen—the focus—as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.

Poster link.


SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

Authors: Charumathi Badrinath, Weiwei Pan, Finale Doshi-Velez 

Abstract: A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for dimensionality reduction of text corpora, like Latent Dirichlet Allocation (LDA), often produce projections that do not capture human notions of document similarity. We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections. On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided. On a real corpus, we obtain qualitatively similar results.

Poster link.


Active Reinforcement Learning from Demonstration in Continuous Action Spaces

Authors: Ming-Hsin Chen, Si-An Chen, Hsuan-Tien Lin 

Abstract: Learning from Demonstration (LfD) is a humanin-the-loop paradigm that aims to overcome the limitations of safety considerations and weak data efficiency in Reinforcement Learning (RL). Active Reinforcement Learning from Demonstration (ARLD) takes LfD a step further by actively involving the human expert only during critical moments, reducing the costs associated with demonstrations. While successful ARLD strategies have been developed for RL environments with discrete actions, their potential in continuous action environments has not been thoroughly explored. In this work, we propose a novel ARLD strategy specifically designed for continuous environments. Our strategy involves estimating the uncertainty of the current RL agent directly from the variance of the stochastic policy within the state-of-the-art Soft Actor-Critic RL model. We demonstrate that our strategy outperforms both a naive attempt to adapt existing ARLD strategies to continuous environments and the passive LfD strategy. These results validate the potential of ARLD in continuous environments and lay the foundation for future research in this direction.

Paper link.

Poster link.


HateXplain2.0: An Explainable Hate Speech Detection Framework Utilizing Subjective Projection from Contextual Knowledge Space to Disjoint Concept Space

Authors: Md Fahim, Md Shihab Shahriar, Mohammad Sabik Irbaz, Syed Ishtiaque Ahmed, Mohammad Ruhul Amin 

Abstract: Finetuning large pre-trained language models on specific datasets is a popular approach in (Natural Language Processing) NLP classification tasks. However, this can lead to overfitting and reduce model explainability. In this paper, we propose a framework that uses the projection of sentence representations onto task-specific conceptual spaces for improved explainability. Each conceptual space corresponds to a class and is learned through a transformer operator optimized during classification tasks. The dimensions of the concept spaces can be trained and optimized. Our framework shows that each dimension is associated with specific words which represent the corresponding class. To optimize the training of the operators, we introduce intra- and interspace losses. Experimental results on two datasets demonstrate that our model achieves better accuracy and explainability. On the HateXplain dataset, our model shows at least a 10% improvement in various explainability metrics.

Poster link.


Iterative Disambiguation: Towards LLM-Supported Programming and System Design

Authors: J.D. Zamfirescu-Pereira, Bjorn Hartmann 

Abstract: LLMs offer unprecedented capabilities for generating code and prose; creating systems that take advantage of these capabilities can be challenging. We propose an artifact-centered iterative disambiguation process for using LLMs to iteratively refine an LLM-based system of subcomponents, each of which is in turn defined and/or implemented by an LLM. A system implementing this process could expand the experience of end-user computing to include user-defined programs capable of nearly any computable activity; here, we propose one approach to explore iterative disambiguation for end-user system design.

Poster link.


Adaptive interventions for both accuracy and time in AI-assisted human decision making

Authors: Siddharth Swaroop, Zana Buçinca, Finale Doshi-Velez 

Abstract: In settings where users are both time-pressured and need high accuracy, such as doctors working in Emergency Rooms, we want to provide AI assistance that both increases accuracy and reduces time. However, different types of AI assistance have different benefits: some reduce time taken while increasing overreliance on AI, while others do the opposite. We therefore want to adapt what AI assistance we show depending on various properties (of the question and of the user) in order to best trade off our two objectives. We introduce a study where users have to prescribe medicines to aliens, and use it to explore the potential for adapting AI assistance. We find evidence that it is beneficial to adapt our AI assistance depending on the question, leading to good tradeoffs between time taken and accuracy. Future work would consider machine-learning algorithms (such as reinforcement learning) to automatically adapt quickly.

Poster link.


Do Users Write More Insecure Code with AI Assistants?

Authors: Neil Perry*, Megha Srivastava*, Deepak Kumar, Dan Boneh 

Abstract: We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI’s codex-davinci-002 model wrote less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI Assistants, we provide an in-depth analysis of participants’ language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

Poster link.


Exploring Mobile UI Layout Generation using Large Language Models Guided by UI Grammar

Authors: Yuwen Lu, Ziang Tong, Anthea Qinyi Zhao, Chengzhi Zhang, Toby Jia-Jun Li 

Abstract: The recent advances in Large Language Models (LLMs) have stimulated interest among researchers and industry professionals, particularly in their application to tasks concerning mobile user interfaces (UIs). This position paper investigates the use of LLMs for UI layout generation. Central to our exploration is the introduction of UI grammar –– a novel approach we proposed to represent the hierarchical structure inherent in UI screens. The aim of this approach is to guide the generative capacities of LLMs more effectively and improve the explainability and controllability of the process. Initial experiments conducted with GPT-4 showed the promising capability of LLMs to produce high-quality user interfaces via in-context learning. Furthermore, our preliminary comparative study suggested the potential of the grammar-based approach in improving the quality of generative results in specific aspects.

Poster link.


Discovering User Types: Characterization of User Traits by Task-Specific Behaviors in Reinforcement Learning

Authors: Lars Lien Ankile*, Brian Ham*, Kevin Mao, Eura Shin, Siddharth Swaroop, Finale Doshi-Velez, Weiwei Pan

Abstract: When assisting human users in reinforcement learning (RL), we can represent users as RL agents and study key parameters, called user traits, to inform intervention design. We study the relationship between user behaviors (policy classes) and user traits. Given an environment, we introduce an intuitive tool for studying the breakdown of “user types”: broad sets of traits that result in the same behavior. We show that seemingly different real-world environments admit the same set of user types and formalize this observation as an equivalence relation defined on environments. By transferring intervention design between environments within the same equivalence class, we can help rapidly personalize interventions.

Poster link.


Symbiotic Co-Creation with AI

Authors: Ninon Lizé Masclef 

Abstract: The quest for symbiotic co-creation between humans and artificial intelligence (AI) has received considerable attention in recent years. This paper explores the challenges and opportunities associated with human-AI interaction, focusing on the unique qualities that distinguish symbiotic interactions from conventional human-to-tool relationships. The role of representation learning and multimodal models in enabling symbiotic cocreation is discussed, emphasising their potential to overcome the limitations of language and tap into deeper layers of symbolic representation. In addition, the concept of AI as design material is explored, highlighting how the latent spatial representation of generative models becomes a field of possibilities for human creators. It also explores novel creative affordances of AI interfaces, including combinational, exploratory and transformational creativity. The paper concludes by highlighting the transformative potential of AI in enhancing human creativity and shaping new frontiers of collaborative creation.

Poster link.


Designing interactions with AI to support the scientific peer review process

Authors: Lu Sun, Stone Tao, Junjie Hu, Steven Dow

Abstract: Peer review processes include a series of activities from review writing to mata-review authoring. Recent advances in AI exhibit the potential to augment complex human writing activities. However, it is still not clear how to design interactive systems that leverage AI to support the scientific peer review process and what are the potential trade-offs. In this paper, we prototype a system – MetaWriter, which uses three forms of AI to support meta-review authoring and offers useful functionalities including review aspect highlights, viewpoint extraction, and hybrid draft generations. In a within-subjects experiment, 32 participants wrote meta-reviews using MetaWriter and a baseline environment with no machine support. We show that MetaWriter can expedite and improve the meta-review authoring process. But participants raised concerns about trust, over-reliance, and agency. We further discuss insights on designing interactions with AI to support the scientific peer review process.

Poster link.

Poster Session #2:


feather - a Python SDK to share and deploy models

Authors: Nihir Vedd*, Paul Riga*

Abstract: At its core, feather was a tool that allowed model developers to build shareable user interfaces for their models in under 20 lines of code. Using the Python SDK, developers specified visual components that users would interact with. (e.g. a FileUpload component to allow users to upload a file). Our service then provided 1) a URL that allowed others to access and use the model visually via a user interface; 2) an API endpoint to allow programmatic requests to a model. In this paper, we discuss feather’s motivations and the value we intended to offer AI researchers and developers. For example, the SDK can support multi-step models and can be extended to run automatic evaluation against held out datasets. We additionally provide comprehensive technical and implementation details. N.B. feather is presently a dormant project. We have open sourced our code for research purposes: https://github.com/feather-ai/. 

Poster link.


How vulnerable are doctors to unsafe hallucinatory AI suggestions? A framework for evaluation of safety in clinical human-AI cooperation

Authors: Paul Festor*, Myura Nagendran*, Anthony C Gordon, Matthieu Komorowski, Aldo A. Faisal 

Abstract: As artificial intelligence-based decision support systems aim at assisting human specialists in highstakes environments, studying the safety of the human-AI team as a whole is crucial, especially in the light of the danger posed by hallucinatory AI treatment suggestions from now ubiquitous large language models. In this work, we propose a method for safety assessment of the human-AI team in high-stakes decision-making scenarios. By studying the interactions between doctors and a decision support tool in a physical intensive care simulation centre, we conclude that most unsafe (i.e. potentially hallucinatory) AI recommendations would be stopped by the clinical team. Moreover, eye-tracking-based attention measurements indicate that doctors focus more on unsafe than safe AI suggestions.

Poster link.


Informed Novelty Detection in Sequential Data by Per-Cluster Modeling

Authors: Linara Adilova, Siming Chen, Michael Kamp

Abstract: Novelty detection in discrete sequences is a challenging task, since deviations from the process generating the normal data are often small or intentionally hidden. In many applications data is generated by several distinct processes so that models trained on all the data tend to overgeneralize and novelties remain undetected. We propose to approach this challenge through decomposition: by clustering the data we break down the problem, obtaining simpler modeling tasks in each cluster which can be modeled more accurately. However, this comes at a cost, since the amount of training data per cluster is reduced. This is a particular problem for discrete sequences where state-of-the-art models are data-hungry. The success of this approach thus depends on the quality of the clustering, i.e., whether the individual learning problems are sufficiently simpler than the joint problem. In this paper we adapt a state-of-the-art visual analytics tool for discrete sequence clustering to obtain informed clusters from domain experts, since clustering discrete sequences automatically is a challenging and domain-specific task. We use LSTMs to further model each of the clusters. Our empirical evaluation indicates that this informed clustering outperforms automatic ones and that our approach outperforms standard novelty detection methods for discrete sequences in three real-world application scenarios.

Poster link.


Workflow Discovery from Dialogues in the Low Data Regime

Authors: Amine El hattami, Issam H. Laradji, Stefania Raimondo, David Vazquez, Pau Rodriguez, Christopher Pal

Abstract: Text-based dialogues are now widely used to solve real-world problems. In cases where solution strategies are already known, they can sometimes be codified into workflows and used to guide humans or artificial agents through the task of helping clients. In this work, we introduce a new problem formulation that we call Workflow Discovery (WD) in which we are interested in the situation where a formal workflow may not yet exist. Still, we wish to discover the set of actions that have been taken to resolve a particular problem. We also examine a sequence-to-sequence (Seq2Seq) approach for this novel task using multiple Seq2Seq models. We present experiments where we extract workflows from dialogues in the Action-Based Conversations Dataset (ABCD) and the MultiWOZ dataset. We propose and evaluate an approach that conditions models on the set of possible actions, and we show that using this strategy, we can improve WD performance in the out-of-distribution setting. Further, on ABCD a modified variant of our Seq2Seq method achieves state-of-the-art performance on related but different tasks of Action State Tracking (AST) and Cascading Dialogue Success (CDS) across many evaluation metrics.

Poster link.


Uncertainty Fingerprints: Interpreting Model Decisions with Human Conceptual Hierarchies

Authors: Angie Boggust, Hendrik Strobelt, Arvind Satyanarayan

Abstract: Understanding machine learning model uncertainty is essential to comprehend model behavior, ensure safe deployment, and intervene appropriately. However model confidences treat the output classes independently, ignoring relationships between classes that can reveal reasons for uncertainty, such as model confusion between related classes or an input with multiple valid labels. By leveraging human knowledge about related classes, we expand model confidence values into a hierarchy of concepts, creating an uncertainty fingerprint. An uncertainty fingerprint describes the model’s confidence in every possible decision, distinguishing how the model proceeded from a broad idea to its precise prediction. Using hierarchical entropy, we compare fingerprints based on the model’s decision-making process to categorize types of model uncertainty, identify common failure modes, and update dataset hierarchies.

Poster link.


LeetPrompt: Leveraging Collective Human Intelligence to Study LLMs

Authors: Sebastin Santy, Ayana Bharadwaj, Sahith Dambekodi, Alex Albert, Cathy Lang Yuan, Ranjay Krishna

Abstract: Writing effective instructions (or prompts) is rapidly evolving into a dark art, spawning websites dedicated to collecting, sharing, and even selling instructions. Yet, the research efforts evaluating large language models (LLMs) either limit instructions to a predefined set or worse, make anecdotal claims without rigorously testing sufficient instructions. In reaction to this cottage industry of instruction design, we introduce LEETPROMPT: a platform where people can interactively explore the space of instructions to solve problems. LEETPROMPT automatically evaluates human-LLM interactions to provide insights about both LLMs as well as human-interaction behavior. With LEETPROMPT, we conduct a withinsubjects user study (N = 20) across 10 problems from 5 domains: biology, physics, math, programming, and general knowledge. By analyzing 1178 instructions used to invoke GPT-4, we present the following findings: First, we find that participants are able to design instructions for all tasks, including those that problem setters deemed unlikely to be solved. Second, all automatic mechanisms fail to generate instructions to solve all tasks. Third, the lexical diversity of instructions is significantly correlated with whether people were able to solve the problem, highlighting the need for diverse instructions when evaluating LLMs. Fourth, many instruction strategies are unsuccessful, highlighting the misalignment between participant’s conceptual model of the LLM and its functionality. Fifth, participants with prompting and math experience spend significantly more time on LEETPROMPT. Sixth, we find that people use more diverse instruction strategies than these automatic baselines. Finally, LEETPROMPT facilitates a learning effect: participants self-reported improvement as they solved each subsequent problem.

Poster link.


State trajectory abstraction and visualization method for explainability in reinforcement learning

Authors: Yoshiki Takagi, Roderick Tabalba, Jason Leigh

Abstract: Explainable AI (XAI) has demonstrated the potential to help reinforcement learning (RL) practitioners to understand how RL models work. However, XAI for users who have considerable domain knowledge but lack machine learning (ML) expertise, is understudied. Solving such a problem would enable RL experts to communicate with domain experts in producing ML solutions that better meet their intentions. This study examines a trajectory-based approach to the problem. Trajectory-based XAI appears promising in enabling non-RL experts to understand a RL model’s behavior by viewing a visual representation of the behavior that consists of trajectories that depict the transitions between the major states of the RL models. This paper proposes a framework to create and evaluate a visual representation of RL models’ behavior that is easy to understand between both RL and non-RL experts.

Poster link.


How Can AI Reason Your Character?

Authors: Dongsu Lee, Minhae Kwon

Abstract: Inference of decision preferences through others’ behavior observation is a crucial skill for artificial agents to collaborate with humans. While some attempts have taken in this realm, the inference speed and accuracy of current methods still need improvement. The main obstacle to achieving higher accuracy lies in the stochastic nature of human behavior, a consequence of the stochastic reward system underlying human decision-making. To address this, we propose the development of an instant inference network (IIN), surmising the partially observable agents’ stochastic character. The agent’s character is parameterized by weights assigned to reward components in reinforcement learning, resulting in a singular policy for each character. To train the IIN for inferring diverse characters, we develop a universal policy comprising a set of policies reflecting different characters. Once the IIN is trained to cover diverse characters using the universal policy, it can return character parameters instantly by receiving behavior trajectories. The simulation results confirm that the inference accuracy of the proposed solution outperforms state-of-the-art algorithms, despite having lower computational complexity.

Poster link.


Human-Aligned Calibration for AI-Assisted Decision Making

Authors: Nina L. Corvelo Benz, Manuel Gomez Rodriguez

Abstract: Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values—an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker’s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker’s confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.

Poster link.


Breadcrumbs to the Goal: Goal-Conditioned Exploration from Human-in-the-loop feedback

Authors: Marcel Torne Villasevil, Max Balsells i Pamies, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, Abhishek Gupta

Abstract: Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision making tasks with a non-trivial element of exploration requires either specifying carefully designed reward functions or relying on novelty seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous highquality human feedback, which is expensive and impractical to obtain. In this work, we present a technique called Human Guided Exploration (HuGE), which uses low-quality feedback from non-expert users that may be sporadic, asynchronous, and noisy. HuGE guides exploration for reinforcement learning not only in simulation, but also in the real world, all without meticulous reward specification. The key concept involves bifurcating human feedback and policy learning: human feedback steers exploration, while selfsupervised learning from the exploration data yields unbiased policies. This procedure can leverage noisy, asynchronous human feedback to learn policies with no hand-crafted reward design or exploration bonuses. HuGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots.

Poster link.


Designing Data: Proactive Data Collection and Iteration for Machine Learning Using Reflexive Planning, Monitoring, and Density Estimation

Authors: Aspen K Hopkins, Fred Hohman, Luca Zappella, Dominik Moritz, Xavier Suau

Abstract: Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track & manage data collection iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model using density estimation. We instantiate designing data through our own data collection and applied ML case study. We find models trained on “designed” datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.

Poster link.


Language Models can Solve Computer Tasks

Authors: Geunwoo Kim, Pierre Baldi, Stephen McAleer

Abstract: Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a taskspecific reward function. Furthermore, we demonstrate RCI prompting’s effectiveness in enhancing LLMs’ reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https: //github.com/posgnu/rci-agent.

Appendix link.

Poster link.


Computational Approaches for App-to-App Retrieval and Design Consistency Check

Authors: Seokhyeon Park*, Wonjae Kim*, Young-Ho Kim, Jinwook Seo

Abstract: Extracting semantic representations from mobile user interfaces (UI) and using the representations for designers’ decision-making processes have shown the potential to be effective computational design support tools. Current approaches rely on machine learning models trained on small-sized mobile UI datasets to extract semantic vectors and use screenshot-to-screenshot comparison to retrieve similar-looking UIs given query screenshots. However, the usability of these methods is limited because they are often not open-sourced and have complex training pipelines for practitioners to follow, and are unable to perform screenshot set-to-set (i.e., app-to-app) retrieval. To this end, we (1) employ visual models trained with large web-scale images and test whether they could extract a UI representation in a zero-shot way and outperform existing specialized models, and (2) use mathematically founded methods to enable app-to-app retrieval and design consistency analysis. Our experiments show that our methods not only improve upon previous retrieval models but also enable multiple new applications.

Poster link.


CHILLI: A data context-aware perturbation method for XAI

Authors: Saif Anwar, Nathan Griffiths, Abhir Bhalerao, Thomas Popham, Mark Bell

Abstract: The trustworthiness of Machine Learning (ML) models can be difficult to assess, but is critical in high-risk or ethically sensitive applications. Many models are treated as a ‘black-box’ where the reasoning or criteria for a final decision is opaque to the user. To address this, some existing Explainable AI (XAI) approaches approximate model behaviour using perturbed data. However, such methods have been criticised for ignoring feature dependencies, with explanations being based on potentially unrealistic data. We propose a novel framework, CHILLI, for incorporating data context into XAI by generating contextually aware perturbations, which are faithful to the training data of the base model being explained. This is shown to improve both the soundness and accuracy of the explanations.

Poster link.


Towards Never-ending Learning of User Interfaces

Authors: Jason Wu, Rebecca Krosnick, Eldon Schoop, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

Abstract: Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets of static screenshots that are labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, workers labeling whether a UI element is “tappable” from a screenshot must guess using visual signifiers, and do not have the benefit of tapping on the UI element in the running app and observing the effects. In this paper, we present the Never-ending UI Learner, an app crawler that automatically installs real apps from a mobile app store and crawls them to infer semantic properties of UIs by interacting with UI elements, discovering new and challenging training examples to learn from, and continually updating machine learning models designed to predict these semantics. The Never-ending UI Learner so far has crawled for more than 5,000 device-hours, performing over half a million actions on 6,000 apps to train a highly accurate tappability model.

Poster link.


Neuro-Symbolic Models of Human Moral Judgment: LLMs as Automatic Feature Extractors

Authors: Joe Kwon, Sydney Levine, Joshua B. Tenenbaum

Abstract: As AI systems gain prominence in society, concerns about their safety become crucial to address. There have been repeated calls to align powerful AI systems with human morality. However, attempts to do this have used black-box systems that cannot be interpreted or explained. In response, we introduce a methodology leveraging the natural language processing abilities of large language models (LLMs) and the interpretability of symbolic models to form competitive neuro-symbolic models for predicting human moral judgment. Our method involves using LLMs to extract morally-relevant features from a stimulus and then passing those features through a cognitive model that predicts human moral judgment. This approach achieves state-of-the-art performance on the MoralExceptQA benchmark, improving on the previous F1 score by 20 points and accuracy by 18 points, while also enhancing model interpretability by baring all key features in the model's computation. We propose future directions for harnessing LLMs to develop more capable and interpretable neuro-symbolic models, emphasizing the critical role of interpretability in facilitating the safe integration of AI systems into society.

Poster link.


Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

Authors: Q. Vera Liao, Ziang Xiao

Abstract: The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as “generalpurpose”, model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

Poster link.


Large Language Models as a Proxy For Human Evaluation in Assessing the Comprehensibility of Disordered Speech Transcription

Authors: Katrin Tomanek, Jimmy Tobin, Subhashini Venugopalan, Richard Cave, Katie Seaver, Rus Heywood, Jordan R Green

Abstract: Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered speech. Even so, erroneous transcripts from ASR models can help people with disordered speech be better understood. Evaluating the efficacy of ASR for this use case requires a methodology for measuring the impact of transcription errors on the intended meaning and comprehensibility. Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. Here, we tuned and evaluated large language models (LLMs) and found them to be a better proxy for human evaluators compared to typical sentence similarity metrics. We further present a case-study of using our approach to make ASR model deployment decisions in a live video conversation setting.

Poster link.


Designing Decision Support Systems Using Counterfactual Prediction Sets

Authors: Eleni Straitouri, Manuel Gomez Rodriguez

Abstract: Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology based on the successive elimination algorithm that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption on the experts’ predictions over the prediction sets to achieve an exponential improvement in regret in comparison with vanilla successive elimination. We conduct a large-scale human subject study (n = 2,751) to verify our counterfactual monotonicity assumption and compare our methodology to several competitive baselines. The results suggest that decision support systems that limit experts’ level of agency may be practical and may offer greater performance than those allowing experts to always exercise their own agency.

Poster link.


PromptCrafter: Crafting Text-to-Image Prompt through Mixed-Initiative Dialogue with LLM

Authors: Seungho Baek, Hyerin Im, Jiseung Ryu, Juhyeong Park,Takyeon Lee

Abstract: Text-to-image generation model is able to generate images across a diverse range of subjects and styles based on a single prompt. Recent works have proposed a variety of interaction methods that help users understand the capabilities of models and utilize them. However, how to support users to efficiently explore the model’s capability and to create effective prompts are still open-ended research questions. In this paper, we present PromptCrafter, a novel mixed-initiative system that allows step-by-step crafting of textto-image prompt. Through the iterative process, users can efficiently explore the model’s capability, and clarify their intent. PromptCrafter also supports users to refine prompts by answering various responses to clarifying questions generated by a Large Language Model. Lastly, users can revert to a desired step by reviewing the work history. In this workshop paper, we discuss the design process of PromptCrafter and our plans for follow-up studies.

Poster link.


Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Authors: Zhenhui Ye*, Ziyue Jiang*, Yi Ren*, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

Abstract: We are interested in a novel task, namely lowresource text-to-talking avatar. Given only a fewminute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from outof-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (AdaTTA), which (1) designs a generic zero-shot multispeaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.

Paper link to be added.

Poster link.


Participatory Personalization in Classification

Authors: Hailey James, Chirag Nagpal, Katherine A Heller, Berk Ustun

Abstract: Machine learning models are often personalized with information that is protected, sensitive, selfreported, or costly to acquire. These models use information about people, but do not facilitate nor inform their consent. Individuals cannot opt out of reporting personal information to a model, nor tell if they benefit from personalization in the first place. We introduce a family of classification models, called participatory systems, that let individuals opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for personalization with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, benchmarking them with common approaches for personalization and imputation. Our results demonstrate that participatory systems can facilitate and inform consent while improving performance and data use across all groups who report personal data.

Poster link.


Toward Model Selection Through Measuring Dataset Similarity on TensorFlow Hub

Authors: SeungYoung Oh, Hyunmin Lee, JinHyun Han, Hyunggu Jung 

Abstract: For novice developers, it is a challenge to select the most appropriate model without prior knowledge of artificial intelligence (AI) development. The main goal of our system is to provide an automated approach to presenting models through dataset similarity. We present a system that allows novel developers to select the best model among existing models from a TensorFlow Hub (TF Hub) online community. Our strategy was to use the similarity of two datasets as a measure to determine the best model. Through a systematic review, we identified several limitations, each of which corresponds to a function to be implemented in our proposed system. We then created a model selection system that enables novice developers to select the most appropriate ML model without prior knowledge of AI by implementing the identified functions. The analysis of this study reveals that our proposed system performed better by successfully addressing three out of six identified limitations.

Poster link.


Creating a Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks

Authors: Hyunmin Lee, SeungYoung Oh, JinHyun Han, Hyunggu Jung

Abstract: Artificial Intelligence (AI) models have created many benefits and achievements in our time. However, they also have the potential to cause unexpected consequences if the models are biased. One of the reasons why AI models are biased is due to data poisoning attacks. Therefore, it is important for AI model developers to understand how biased their training data is when preparing a training dataset in order to develop fair AI models. While researchers have reported several data sets for the purpose of training datasets, the existing studies have not taken into account the possibility of data poisoning attacks that the dataset may have due to the bias in the dataset. To address this gap, we created and validated a dataset that reflects the possibility of bias in individual reviews of food delivery apps. This work contributes to the community of AI model developers who aim to create fair AI models by proposing a bias-free dataset of food delivery app reviews with data poisoning attacks as an example.

Poster link.


Co-creating a globally interpretable model with human input

Authors: Rahul Nair 

Abstract: We consider an aggregated human-AI collaboration aimed at generating a joint interpretable model. The model takes the form of Boolean decision rules, where human input is provided in the form of logical conditions or as partial templates. This focus on the combined construction of a model offers a different perspective on joint decision making. Previous efforts have typically focused on aggregating outcomes rather than decisions logic. We demonstrate the proposed approach through two examples and highlight the usefulness and challenges of the approach.

Poster link.


An Interactive Human-Machine Learning Interface for Collecting and Learning from Complex Annotations

Authors: Jonathan Matthew Erskine, Raul Santos-Rodriguez, Alexander Hepburn, Matt Clifford 

Abstract: Human-Computer Interaction has been shown to lead to improvements in machine learning systems by boosting model performance, accelerating learning and building user confidence. In this work, we propose a human-machine learning interface for binary classification tasks with the goal of allowing humans to provide richer forms of supervision and feedback that go beyond standard binary labels as annotations for a dataset. We aim to reverse the expectation that human annotators adapt to the constraints imposed by labels, by allowing for extra flexibility in the form that supervision information is collected. For this, we introduce the concept of task-oriented metaevaluations and propose a prototype tool to efficiently capture the human insights or knowledge about a task. Finally we discuss the challenges which face future extensions of this work.

Poster link.

Video: https://drive.google.com/file/d/1drjwbFzTQGhCV_mzgujt9REIrq5fcJqc/view?usp=share_link 

Virtual Presentations:


Adaptive User-centered Neuro-symbolic Learning for Multimodal Interaction with Autonomous Systems

Authors: Amr Gomaa, Michael Feld

Abstract: Recent advances in machine learning, particularly deep learning, have enabled autonomous systems to perceive and comprehend objects and their environments in a perceptual subsymbolic manner. These systems can now perform object detection, sensor data fusion, and language understanding tasks. However, there is a growing need to enhance these systems to understand objects and their environments more conceptually and symbolically. It is essential to consider both the explicit teaching provided by humans (e.g., describing a situation or explaining how to act) and the implicit teaching obtained by observing human behavior (e.g., through the system’s sensors) to achieve this level of powerful artificial intelligence. Thus, the system must be designed with multimodal input and output capabilities to support implicit and explicit interaction models. In this position paper, we argue for considering both types of inputs, as well as human-in-the-loop and incremental learning techniques, for advancing the field of artificial intelligence and enabling autonomous systems to learn like humans. We propose several hypotheses and design guidelines and highlight a use case from related work to achieve this goal.

Poster link.

Video: https://youtu.be/_Bba6s_qkmA 


Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees

Authors: Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi

Abstract: We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering that does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantees, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings.

Poster link.

Video: https://youtu.be/vP_z8phL2K8


Black-Box Batch Active Learning for Regression

Authors: Andreas Kirsch

Abstract: Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets, which repeatedly acquires labels for a batch of data points. However, many recent batch active learning methods are white-box approaches limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches: cruicially, our method only relies on model predictions. This approach is compatible with a wide range of machine learning models including regular and Bayesian deep learning models and non-differentiable models such as random forests. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We evaluate our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.

Video link to be added.


The corrupting influence of AI as a boss or Counterparty

Authors: Hal Ashton*, Matija Franklin*

Abstract: In a recent article Kobis et al. (2021) propose a framework to identify four different primary roles in which Artificial Intelligence (AI) cause unethical or corrupt human behaviour; namely - role model, delegate, partner, and advisor. In this article we propose two further roles - AI as boss and AI as counterparty. We argue that the AI boss exerts coercive power over its employees whilst the different perceptual abilities of an AI counterparty provide an opportunity for humans to behave differently towards them than they would with human analogues. Unethical behaviour towards the AI counterparty is rationalised because it is not human. For both roles, the human will will typically not have any choice about their participation in the interaction.

Video link to be added.


LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

Authors: Pengzhi Li, Qinxuan Huang, Yikang Ding, Zhiheng Li 

Abstract: Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a largescale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing

Video: https://youtu.be/cFK0n6htzXo 


ConvGenVisMo: Evaluation of conversational generative vision models

Authors: Narjes Nikzad Khasmakhi, Meysam Asgari-chenaghlu, Nabiha Asghar, Philipp Schaer, Dietlind Zühlke 

Abstract: Conversational generative vision models (CGVMs) like Visual ChatGPT (Wu et al., 2023) have recently emerged from the synthesis of computer vision and natural language processing techniques. These models enable more natural and interactive communication between humans and machines, because they can understand verbal inputs from users and generate responses in natural language along with visual outputs. To make informed decisions about the usage and deployment of these models, it is important to analyze their performance through a suitable evaluation framework on realistic datasets. In this paper, we present ConvGenVisMo, a framework for the novel task of evaluating CGVMs. ConvGenVisMo introduces a new benchmark evaluation dataset for this task, and also provides a suite of existing and new automated evaluation metrics to evaluate the outputs. All ConvGenVisMo assets, including the dataset and the evaluation code, will be made available publicly on GitHub.

Video: https://www.youtube.com/watch?v=P6P551Y57Qw


Give Weight to Human Reactions: Optimizing Complementary AI in Practical Human-AI Teams

Authors: Syed Hasan Amin Mahmood*, Zhuoran Lu*, Ming Yin 

Abstract: With the rapid development of decision aids that are driven by AI models, the practice of humanAI joint decision making has become increasingly prevalent. To improve the human-AI team performance in decision making, earlier studies mostly focus on enhancing humans’ capability in better utilizing a given AI-driven decision aid. In this paper, we tackle this challenge through a complementary approach—we aim to adjust the designs of the AI model underlying the decision aid by taking humans’ reaction to AI into consideration. In particular, as humans are observed to accept AI advice more when their confidence in their own decision is low, we propose to train AI models with a human-confidence-based instance weighting strategy, instead of solving the standard empirical risk minimization problem. Under an assumed, threshold-based model characterizing when humans will adopt the AI advice, we first derive the optimal instance weighting strategy for training AI models. We then validate the efficacy of our proposed method in improving the humanAI joint decision making performance through systematic experimentation on both synthetic and real-world datasets.

Video: https://drive.google.com/file/d/1Xsh8z64Z8_P-bKGljEGPSZ8yBT2aJchs/view?usp=sharing 


Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection

Authors: Jaturong Kongmanee, Mark Chignell, Khilan Jerath, Abhay Raman 

Abstract: Exfiltration of data via email is a serious cybersecurity threat for many organizations. Detecting data exfiltration (anomaly) patterns typically requires labeling, most often done by a human annotator, to reduce the high number of false alarms. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling, especially when detecting rare cases of interest is crucial. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the classifier benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the classifier in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three highly unbalanced UCI benchmarks and on one real-world redacted email data set.

Video link to be added.


Exploring Open Domain Image Super-Resolution through Text

Authors: Kanchana Vaishnavi Gandikota*, Paramanand Chandramouli*

Abstract: In this work, we propose for the first time a zeroshot approach for flexible open domain extreme super-resolution of images which allows users to interactively explore plausible solutions by using language prompts. Our approach exploits a recent diffusion based text-to-image (T2I) generative model. We modify the generative process of the T2I diffusion model to analytically enforce data consistency of the solution and explore diverse contents of null-space using text guidance. Our approach results in diverse solutions which are simultaneously consistent with input text and the low resolution images.

Paper link.

Poster link.

Video: https://drive.google.com/file/d/1Il7Agt2ke-8GBE1MtABSNWy1kPTfXLOK/view?usp=sharing