The 1st Workshop on Robust Machine Learning for Distribution Shifts (RobustMLDS’24)
The 1st Workshop on Robust Machine Learning for Distribution Shifts (RobustMLDS’24)
Held in conjunction with IEEE BigData 2024
Dec 17th, 2024, Washington D.C., USA
Introduction
Despite well-documented success in numerous applications from academia and industry, a visual recognition system learned by modern machine learning techniques can fail catastrophically when presented with out-of-distribution data during inference. The failure is particularly pernicious in safety-critical applications. It further reveals that state-of-the-art methods are vulnerable to variation across source and target domains caused by various distribution shifts, such as covariate shift, label shift, concept shift, etc. These shifts characterize the extent to which the marginal distributions over data samples and their labels, as well as the sample-conditional distributions on labels, differ between training and testing domains. Therefore, to address such domain challenges in real-world applications, experts and researchers are urgently needed at prestigious venues to discuss developing practical and trustworthy models. This workshop will provide a premium platform for research and industry from different backgrounds to exchange ideas on opportunities, cutting-edge techniques, and future directions on domain challenges.
The workshop will be held online via the link provided below on December 17, 2024, at 8:00 am, U.S. Eastern Time.
https://baylor.zoom.us/j/2399260431?pwd=djEwUTRzOHFJVFJoK2VZRWdrZC9lQT09
Meeting ID: 239 926 0431, Passcode: 388007
Call for Papers
Important Dates:
Following are the proposed important dates for the workshop. All deadlines are due 11:59 pm USA Eastern Standard Time.
Paper submission: October 1, November 16, 2024
Notification of decision: November 4, November 19, 2024
Camera-ready due: November 17, November 23, 2024
Topics of Interest:
We encourage submissions in various degrees of progress, such as new results, visions, techniques, innovative application papers, and progress reports under the topics that include, but are not limited to, the following broad categories:
Domain adaptation and generalization
Distributional robust optimization
Causal inference
Transfer learning
Data augmentation and generalization
Disentangled representation learning
Invariant learning
Out-of-distribution detection, novelty detection, anomaly detection
Open set recognition
Uncertainty quantification
And with particular challenges and applications of domain shifts for (but not limited to) these applications:
Time series data
Computer vision (e.g., domain generalization in biomedical imaging)
Natural language processing (e.g., Cross-Lingual)
Reinforcement learning (e.g., environment generalization, offline reinforcement learning)
Large language models (LLMs)
Submission Guidelines:
Submissions are limited to a total of 5 pages, including all content and references. There will be no page limit for supplemental materials. All submissions must be in PDF format and formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (two-column format).
Template guidelines are here: https://www.ieee.org/conferences/publishing/templates.html.
Following the IEEE BigData conference submission policy, reviews are single-blind. Submitted papers will be assessed based on their novelty, technical quality, potential impact, and clarity of writing. For papers that rely heavily on empirical evaluations, the experimental methods and results should be clear, well-executed, and repeatable. Authors are strongly encouraged to make data and code publicly available whenever possible. The accepted papers will be posted on the workshop website but will not be included in the IEEE BigData proceedings.
Submit your papers through the website: here.
Upon notification, we ask that authors of accepted works make any final changes and then submit a camera-ready version to the submission site. The workshop website will then be updated with links to accepted papers. Note that accepted works will be included in the IEEE BigData proceedings and will be formally published.
Any questions regarding submissions can be directed to chen_zhao@baylor.edu.
Accepted Papers
Automated Synthesis of Distributed Code from Sequential Snippets Using Deep Learning
Arun Sanjel, Bikram Khanal, Pablo Rivas, Greg Speegle
Abstract: Processing big data poses a significant challenge when transitioning from sequential to distributed code, primarily due to the extensive scale at which data is handled. This complexity arises from both syntax and semantic differences between the codes. Unfortunately, current methods are inefficient and require more effective automated solutions. To address this problem, we utilized the Transformer-based BERT model because of its exceptional capability to understand and capture deep contextual relationships in large datasets. Our method involved creating two comprehensive datasets containing 10k and 100k sequential code snippets paired with their corresponding PySpark API calls. We optimized the BERT model by fine-tuning it to predict distributed API calls for previously unseen sequential snippets. For the 10k dataset, the model demonstrated robustness, achieving a training accuracy of 99.36%, a test accuracy of 99.7%, a Balanced Accuracy of 99.78%, and an F1 Score of 0.997. In contrast, the 100k dataset metrics were equally impressive, with a training accuracy of 89.48%, a test accuracy of 99.99%, a Balanced Accuracy of 99.98%, and an F1 Score of 0.999. Our work demonstrates the feasibility of automating the sequential-to-distributed code transition with notable precision.
Assessing Membership Inference Attacks under Distribution Shifts
Yichuan Shi, Viktor Reshniak, Olivera Kotevska, Amir Sadovnik
Abstract: Membership inference attacks (MIAs) present a serious privacy threat to machine learning models by inferring whether a data point was part of the model’s training set, even with limited black-box access. State-of-the-art MIAs typically depend on the attacker accurately approximating the target model’s training distribution. This study explores how distributional shifts between the target model and the attacker’s model affect the success of MIAs. By evaluating five types of distribution shifts at varying intensities, we reveal that these shifts do not uniformly impact MIA effectiveness, highlighting the nuanced relationship between distributional differences and attack success.
Clustering of Students' Behavior Using the GPT-2 Module Based on the Macroscopic Attention Model
Wanghu Chen, Qi Fan, Siqi Zeng, Jing Li
Abstract: A reasonable group classification of students is beneficial for the process management of university students on campus. To this end, a Macroscopic Attention (MA) model is developed to characterize individual students. Existing clustering methods find it difficult to take into account the temporal characteristics of the data and the relationship between different dimensions. Consequently a time series clustering method, GPTK, is proposed to optimize the clustering effect by capturing the complex temporal dependencies and interactions between the MA quality features in the time series data through the GPT2 module. Experiments indicate that the method has a great advantage over other clustering methods like AE-K and CNNK in terms of Silhouette Coefficient, Davies-Bouldin Index and Calinski-Harabaz Index. We analysis the results of student group classification, it is found that the probability of students in one group to receive scholarships is significantly higher than that of students in another group. This means that there is a significant gap in academic performance between the two groups. Based on the research in this paper, early intervention and guidance can be provided to students in a timely manner.
A Streamlining Deployment of Machine Learning Models with Docker and Jenkins: An In-depth Analysis of Containerization Strategies
Grandhi guna sai hari krishna Grandhi, Vinai Kumar Mandala, Akhil Bhagyesh Tunuguntla, Manjunadh Challa
Abstract: Implementation of machine learning (ML) models can now be simplified with the help of containerization, this leading-edge technology provides flexible and efficient solutions in computing systems To optimize the deployment pipeline, this article focuses on integrating Docker and Jenkins for a thorough test of containerizing machine learning activities. The importance of containerization in today’s ML deployment environment is first emphasized in the abstract, focusing on how it improves reproducibility, scalability, and resource efficiency Then, it presents Docker as a top containerization technology, which outlines its main characteristics and the advantages of containing dependencies and ML environments. the essential features of the Jenkins-managed container learning pipeline, including deployment techniques, automated testing, version control integration ML models, libraries, dependencies, and benefits of packaging with Docker containers are covered, and a development -Enables smooth deployment in test and manufacturing environments. In addition, the abstract presents conclusions from research findings and industry examples to establish practical use cases and best practices for integrating the Dockerized ML pipeline with Jenkins It ensures scalability, the performance is good with security issues and concerns in containerized machine learning installations, and how it works well to mitigate risks It also provides practical tips to adapt. The revolutionary impact of Jenkins automation was the acceleration of containerization in the use of machine learning models. Organizations can gain agility, reliability, and scalability in their ML development and implementation processes by using these technologies and techniques. This will enable data scientists and engineers to deliver value-added solutions in a competitive and agile environment.
Workshop Schedule
The workshop will be held online via the link provided below on December 17, 2024, at 8:00 am, U.S. Eastern Time.
https://baylor.zoom.us/j/2399260431?pwd=djEwUTRzOHFJVFJoK2VZRWdrZC9lQT09
Meeting ID: 239 926 0431, Passcode: 388007
8:00 am - 8:10 am
Opening Remarks
8:10 am - 8:55 am
Keynote Talk 1: Neural Consumer Choice Modeling
Kunpeng Zhang, University of Maryland, College Park
Abstract: Consumer choice modeling is a fundamental task in business research. The objective of choice models is to understand the mechanisms driving consumers’ purchase decisions and to forecast future choice behavior. Conventional choice models like the hierarchical multinomial logit model are widely used due to their simplicity, micro-foundations, and interpretability. However, these models are constrained by restrictive parametric assumptions, limiting their capability to capture complex temporal dynamics and non-linearity. In contrast, purely machine learning models, such as neural networks, exhibit superior predictive performance but lack micro-foundations and interpretability, thereby preventing them from deriving deeper business insights beyond mere predictions. We propose a neural consumer choice modeling framework that integrates the strengths of deep learning with the micro-foundations of consumer choice theories. Through extensive simulation studies, we demonstrate that this approach more effectively captures the temporal dynamics of parameters of interest (such as brand preference and price sensitivity) while maintaining accurate predictions of consumer choices. We further illustrate the practical value of our proposed method through a downstream task – price optimization. Additionally, the prediction performance of our framework is validated using real-world transaction data. This theory-driven, deep learning-based framework represents a methodological advancement by providing a more accurate and interpretable understanding of consumer choices.
8:55 am - 9:40 am
Keynote Talk 2: Towards Continual Learning on Graphs
Dongjin Song, University of Connecticut
Abstract: Over the past years, deep learning on graphs has made significant progress in various areas, e.g., e-commerce, social networks, and healthcare. However, most existing graph learning tasks assume graphs are static, while real-world graphs may constantly grow or evolve. Therefore, it is crucial to study how to constantly adapt a graph learning model to new patterns/tasks over graphs without forgetting the previously learned knowledge. To this end, in this talk, I will introduce the newly emerging area of continual graph learning (CGL). Specifically, I will (1) introduce different continual graph learning settings and key challenges in the context of e-commerce/social networks, (2) present a general framework, i.e., Parameter Decoupled Graph Neural Networks (PDGNNs) with Topology-aware Embedding Memory (TEM), to perform continual learning over growing graphs, and (3) develop a Structural Knowledge Informed Continual Learning (SKI-CL) framework to perform multivariate time series forecasting under the continual learning setting, which leverages the structural knowledge to characterize the dynamic variable dependencies within each regime.
9:40 am - 10:25 am
Keynote Talk 3: Fortifying Federated Learning: Privacy Preservation and Resilience Against Poisoning Attacks
Runhua Xu, Beihang University
Abstract: Federated Learning (FL) has emerged as a revolutionary distributed machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data. However, this framework faces significant security and privacy challenges. This talk delves into two critical aspects of federated learning: privacy leakage risks and susceptibility to adversarial attacks, such as model poisoning. We will explore state-of-the-art privacy preservation mechanisms, along with defense strategies against model poisoning attacks launched by malicious participants. We will demonstrate mechanisms that can ensure model robustness and reliability while protecting privacy
10:25 am - 11:25 am
Accepted Paper Talks
Automated Synthesis of Distributed Code from Sequential Snippets Using Deep Learning
Arun Sanjel, Bikram Khanal, Pablo Rivas, Greg Speegle
Assessing Membership Inference Attacks under Distribution Shifts
Yichuan Shi, Viktor Reshniak, Olivera Kotevska, Amir Sadovnik
Clustering of Students' Behavior Using the GPT-2 Module Based on the Macroscopic Attention Model
Wanghu Chen, Qi Fan, Siqi Zeng, Jing Li
A Streamlining Deployment of Machine Learning Models with Docker and Jenkins: An In-depth Analysis of Containerization Strategies
Grandhi guna sai hari krishna Grandhi, Vinai Kumar Mandala, Akhil Bhagyesh Tunuguntla, Manjunadh Challa
11:25 am - 12:10 pm
Keynote Talk 4: Uncertainty Quantification and Reasoning for Large Language Models
Xujiang Zhao, NEC Laboratories America
Abstract: The rapid advancement of Large Language Models (LLMs) has greatly expanded their applications across diverse domains. However, managing and reasoning about uncertainty remains a crucial challenge for ensuring their reliability, interpretability, and robustness. This keynote, titled Uncertainty Quantification and Reasoning for Large Language Models, explores innovative methodologies to tackle these challenges. It focuses on two key aspects: Uncertainty Quantification and Decomposition for In-Context Learning of Large Language Models and Uncertainty Propagation for LLM Agents.
12:10 pm - 12:55 pm
Keynote Talk 5: Robust LLM Driven Software Applications
Tarik Borogovac, Amazon Web Services
Abstract: Our group at AWS builds software systems and applications used by knowledge workers who are subject matter experts in managing complex sales accounts and relationships. The large language model (LLM), as a building block, represents a revolution in building software. Its emergent and surprising abilities allow it to do jobs inside applications that used to only be done by human operators. The LLM can apply its learned knowledge of the world, combine information from multiple variable and unstructured data sources, perform research, generate content, give feedback, follow directions, use tools and make decisions. However, the LLM also poses tremendous challenges for making the application robust and reliable. This talk will give examples of several of our experiences, with specific challenges, and how we overcame them.
12:55 pm - 1:00 pm
Closing
Invited Speakers
Kunpeng Zhang, University of Maryland, College Park
Short bio: Dr. Kunpeng 'KZ' Zhang is an Associate professor of Information Systems at the University of Maryland, College Park. He received his PhD in computer science from Northwestern University. His research focuses on developing machine learning algorithms to analyze unstructured data for business decision support. In particular, he is interested in LLM-based representation learning for video understanding, network analysis, and financial document mining. He has been closely working with industry partners, currently a research collaborator at Meta.
Dongjin Song, University of Connecticut
Short bio: Dr. Dongjin Song has been an Assistant Professor in the School of Computing at the University of Connecticut since Fall 2020. Previously, he was a Research Staff Member at NEC Labs America in Princeton, NJ. He earned his Ph.D. in Electrical and Computer Engineering (ECE) from the University of California, San Diego (UCSD) in 2016. His research interests include machine learning, data science, and their applications in time series data analysis and graph representation learning. His work has been published in top-tier data science and artificial intelligence venues, including NeurIPS, ICML, ICLR, KDD, ICDM, SDM, AAAI, IJCAI, CVPR, and ICCV. Three of his papers have been recognized as the most influential papers by paperdigest.org. He serves as an Associate Editor for Pattern Recognition and Neurocomputing, and has contributed as an Area Chair or Senior Program Committee Member for conferences such as AAAI, IJCAI, ICDM, and CIKM. He has also co-organized the AI for Time Series (AI4TS) Workshop at IJCAI, AAAI, ICDM, and SDM, as well as the MiLeTS workshops at KDD. He won with the prestigious NSF CAREER Award and the Frontiers of Science Award (FSA) in 2024.
Runhua Xu, Beihang University
Short bio: Dr. Runhua Xu is a professor at the School of Computer Science and Engineering at Beihang University. Previously, he was a Research Staff Member at IBM Research - Almaden Lab. Dr. Xu earned his Ph.D. from the University of Pittsburgh. His research focuses on enhancing privacy and trustworthiness in various computing domains, specializing in AI security and privacy solutions. He received the ACM CCS 2023 Outstanding Paper Award and the IEEE CLOUD 2022 Best Paper Award. Additionally, he serves as an associate editor on the youth editorial board of the Chinese Journal of Electronics and as a guest editor for IET Blockchain.
Xujiang Zhao, NEC Laboratories America
Short bio: Dr. Xujiang Zhao is a research staff member at NEC Laboratories America. He received his Ph.D. in the Computer Science Department at The University of Texas at Dallas in 2022. Dr. Zhao has published his work in top-tier machine learning and data mining conferences, including NeurIPS, AAAI, ICDM, and EMNLP. He also served on technical program committees for several high-impact venues, such as ICML, NeurIPS, ICLR, KDD, and AAAI.
Tarik Borogovac, Amazon Web Services
Short bio: Tarik Borogovac is a Science Manager at Amazon Web Services, where he works on developing AI enabled software products, with a recent focus on using large language models (LLM), as a building block to perform various tasks within our systems and applications. Previously he worked on AI and ML in areas of web application security (HAProxy Technologies) and energy (General Electric, FirstFuel Software). Tarik holds a PhD in Systems Engineering from Boston University.
Organizers
Baylor University
NEC Laboratories America
Fudan University
Tianjin University
Amazon
Publicity Chair
NEC Laboratories America
Program Committee (Reviewers)
Sarat Chandra (GE Aerospace)
Shaif Chowdhury (Baylor University)
Saswata Paul (GE Aerospace)
Linlin Yu (University of Texas at Dallas)