ICLR 2024 Workshop on Navigating and Addressing

Data Problems for Foundation Models

(DPFM)

May 11, 2024, Saturday https://iclr.cc/virtual/2024/workshop/20585

Messe Wien Exhibition Congress Center, Vienna, Austria + Zoom | OpenReview

[Updated May-3: Checkout ICLR Event Page for Up-to-date Program and Schedules 🔗 ]

[Updated May-9: Checkout Our Workshop Schedule Page 📅⏰ ]

[Room: Stolz 0 🍪☕] Looking forward!

Cover image generated with DALL·E. Prompt: "This creative portrayal shows an artist painting a canvas with streams of data, symbolizing a data scientist transforming raw data into meaningful patterns. The background is a modern laboratory, blending art, science, and technology in machine learning."

OVERVIEW

Foundation Models (FMs, e.g., GPT-3/4, LLaMA, DALL-E, Stable Diffusion, etc.) have demonstrated unprecedented performance across a wide range of downstream tasks. Following the rapid evolution, as researchers strive to keep up with the understanding of the capabilities and limitations of FMs as well as their implications, attention is now shifting to the emerging notion of data-centric AI.

Curation of training data is crucially important for the performance and reliability of FMs and a wealth of recent works demonstrate that data-perspective research sheds light on a promising direction toward critical issues such as safety, alignment, efficiency, security, privacy, interpretability, etc.

To move forward, this workshop aims to discuss and explore a better understanding of the new paradigm for research on data problems for foundation models.

We look forward to meeting communities and researchers on data problems (e.g., data-centric AI, dataset/data curation, data market), foundation models (alignment, safety/trustworthiness, fairness/ethics), practitioners of downstream applications, tech companies providing innovative solutions, and beyond! We strive to build a community behind this essential topic and provide the platform to connect, share ideas, explore for consensus, and create collaboration opportunities.

Our technical agenda is composed of four modules.

TECHNICAL AGENDA

[Module A] Data Quality, Dataset Curation, and Data Generation

A. Data Quality, Dataset Curation, and Data Generation–Recent Achievements and Current Efforts
- [Data Quality] How to quantify data quality in the context of FMs or select a good subset?
  - What are the aspects to consider and what are the quantitative metrics?
  - How this can be performed considering the scale or nature of data for FMs?
- [Data Influence] How to model the influence of data throughout the lifecycle of FMs (pre-training, fine-tuning, deployment, etc.)? What’s the impact of each part and how does data influence interaction?
  - What are the data-perspective efforts in adapting FMs to target tasks/scenarios for deployment? A particular interest is data curation/labeling for fine-tuning, the role of in-context learning, and adaptation in dynamic environments.
- [Data Generation] The capability of foundation models brings it to a new level for generating data. How to control the generation process to produce high-quality or task-relevant data and what can it be used for?
  - Is it good for directly being used to train a model in the same way as natural data? What problem this may cause?
  - How can it help with alignment, improving fairness/safety, or/and adaptation in low-resource scenarios where labeled data is scarce?
- [Data Quality] [Scalability] For FMs pre-trained on massive and broad data, how to consider data acquisition/composition/quality/resource efficiency at scale?
  - Resources requirements for training and deployment of FMs may be out of reach for many, what does this mean for researchers and practitioners working on data problems and how to adapt to the new norm?
  - [Scaling Laws] Scaling laws help in a lot of scenarios and enable research on large models with a small budget, but with complications that some unique capabilities for very large models such as chain-of-thoughts only emerge after a sufficiently large scale. What complications will this cause?

[Module B] A Data Perspective to Efficiency, Interpretability, and Alignment

B. A Data Perspective to Efficiency, Interpretability, and Alignment–Latest Advancement and Breakthroughs
- [Data Efficiency] Other than the impressive capability and generalizability, foundation models have reached unprecedented scales in terms of model size and training data, pronouncing the efficiency problems from all aspects.
  - This includes data efficiency in model training such as the typically resource-intensive pre-training, or fine-tuning at deployment which often has limited labeled data,
  - [Attribution at Scale] [Interpretability] and also the efficiency of inference methods for interpretability, explainability, fact tracing, etc. How do existing approaches scale up to foundation models and what are the new solutions to these problems?
- [Data and Alignment] Alignment is one of the most active topics for FMs. How can data-perspective research best contribute to important issues such as data attribution/interpretability, harmlessness/truthfulness, AI safety(fake/harmful contents), etc?

[Module C] A Data Perspective to Safety and Ethics–Risks, Limitations, and Opportunities

C. A Data Perspective to Safety and Ethics–Risks, Limitations, and Opportunities
- [Safety and Trustworthiness] The unprecedented size and capability of FMs pose unprecedented challenges for safety/trustworthy issues (e.g., jailbreaking, security loopholes, harmful contents, misuse, privacy violations, etc.). What are the risks and current limitations?
  - [Data and Safety] [Data and Ethics] How data-perspective research can help and which issues benefit most from data-perspective research?
- [Evaluation Techniques] How does data-perspective research contribute to the evaluation of FMs (e.g., fairness/ethics, defects of FMs/failure cases)?
  - [Data and Evaluation] How to improve these issues from the data side?

[Module D] Copyright, Legal Issues, and Data Economy–A Broader Landscape

D. Copyright, Legal Issues, and Data Economy–A Broader Landscape
- [Data Copyright] [Legal Challenges and Practical Risks] Copyright issues and privacy concerns are the sword of Damocles for the deployment of FMs. What are the current risks and limitations?
  - [Data Research and Technical, Economic, and Governance Solutions] How data-perspective research can contribute technical, economic, and governance solutions to this topic?
- [Data Economy] [Data Acquisition] What is the perspective of data economy? What are the potential market solutions for the acquisition of data?
  - [Data Valuation] [Data Exchange] What are the research opportunities for data problems associated with it? How to quantify the value of data, schemes for data exchange, etc.