The 1st Workshop on Datasets for Software Security

co-located with IEEE Security & Privacy

May 21, 2026 at San Francisco, CA

Registration is now open!

About Data4SoftSec

Datasets are fundamental to the development, evaluation, and benchmarking of techniques that drive progress across almost all domains of software security research. Despite ongoing open-science efforts to promote dataset sharing, software security researchers still encounter substantial obstacles in constructing, releasing, and reusing such datasets. The unique characteristics of security-related data such as sensitivity, confidentiality, and ethical constraints pose challenges to data collecting, sharing and reproducibility. Moreover, the scarcity and limited scale of available datasets hinder their usage for data-driven techniques, particularly in training and testing AI-based security applications. The sustainability of many existing datasets also remains a critical concern, as they tend to become outdated rapidly due to the evolving threats, vulnerabilities, and attacks.

Data4SoftSec is dedicated to bringing together researchers from academia, industry, and government to explore new directions for improving software security research datasets. The workshop will provide a venue to report novel methodologies to enhance data quality for security research tasks, exchange experiences and lessons learned from constructing and using security research datasets, discuss best practices for long-term dataset maintenance and reproducibility, and identify new opportunities for advancing data-driven software security research.

Partially supported by an NSF CIRC Planning-C project, Data4SoftSec aims to plan infrastructure and datasets for software vulnerability management. Specifically, a panel discussion will be hosted from 3:15 to 4:30pm on workshop day to convene experts in discussing practical needs and possible design directions for advancing software vulnerability research and practice. The panel will also provide an opportunity to explore potential collaborations and team-building for a future implementation proposal on top of this planning effort.

Keynote Speakers

Yan Shoshitaishvili

Arizona State University

Xiaojing Liao

University of Illinois Urbana-Champaign

Sergej Dechand

Code Intelligence, MPI-SP

Panelists

Alvaro Cardenas

UC Santa Cruz

Suman Jana

Columbia University

Sagar Samtini

Indiana University

Tudor Dumitras

University of Maryland

Program

Call for Papers

Important Dates (AoE)

Paper Submission Due: Feb 13, 2026 Feb 27, 2026

Acceptance Notification: Mar 20, 2026

Publication-ready Submission Due: Apr 1, 2026

Workshop Date: May 21, 2026

Topics of Interest

We invite researchers from academia, industry, and government to submit papers introducing novel methods, insights, or tools for improving software security research datasets. Relevant topics include, but are not limited to:

Software Security Dataset Construction and Quality Enhancement
- Novel methodologies for collecting and labeling high-quality software security datasets
- Approaches to address data imbalance, bias, and noise or ensuring data accuracy, completeness, and representativeness
- Synthetic or simulated data generation for data-scarce software security tasks using data augmentation, generative models, etc.
Dataset Sharing, Sustainability, and Ethics
- Tools, frameworks, or infrastructures supporting long-term dataset maintenance, reproducibility, and accessibility
- Strategies for dataset anonymization, sanitization, and ethical release
- Ethical and responsible use of security datasets in research and industry
Data-Driven Software Security Techniques
- Applications of AI/LLMs in software security supported by new or existing datasets
- Evaluation benchmarks for assessing AI-based models on software security related tasks
- Cross-domain, transfer learning, and multimodal approaches leveraging heterogeneous datasets
Empirical Studies and Experience Reports
- Empirical evaluations of dataset usability, reliability, and impact on existing research outcomes
- Case studies and best practices from dataset creation, curation, maintenance, or deployment
- Lessons learned from public release of widely used software security datasets
- Dataset needs and challenges in emerging software security areas (e.g., open-source software security, AI-assisted software security testing, etc.)

This list is not exhaustive. Topics in less related areas will be also welcome if a clear connection to enhance dataset for software security research is demonstrated.

In particular, we encourage the submission of papers related to our core topic theme for our 1st installation: Enhancing Datasets for Software Vulnerability Detection and Patching.

Submission Guidelines

Regular Papers (up to 10 pages): We invite submissions of original research papers that have not been previously published and are not under review elsewhere. Regular research-track papers should be up to 10 pages (with a minimum of 6 pages), anonymized, and formatted in the standard two-column IEEE proceedings style, excluding bibliography and appendices. Submissions are expected to make a clear research contribution by addressing an important problem, proposing a compelling solution, and presenting experimental evaluation of relevant techniques or discussion of their validity and real-world applicability.

Practical Experience or Tool Demonstration Papers (up to 6 pages): We invite submissions presenting practitioner experiences or empirical analyses of field data and the use of tools to address dataset-related challenges. Papers in this category should provide new insights or lessons that inform the research and practice of robust dataset construction and management. Submissions should be no longer than 6 pages (with a minimum of 2 pages), anonymized, and formatted in the standard two-column IEEE proceedings style.

Review Process

The review process will be double-blind, with each submission reviewed by at least three committee members or external reviewers with relevant expertise. Acceptance or rejection decisions will follow discussions moderated by the program chairs. We will also include a shepherding option for conditional acceptance for promising papers requiring minor revisions, where reviewers will provide a list of items to be addressed in the final version. The TPC will also select 1–3 Distinguished Paper Awards based on reviews and discussions.

Presentation Format

One author of each accepted paper is expected to present their work at the workshop. Each paper will be showcased through a traditional conference-style format, and, depending on the topic and content, selected papers will be followed by roundtable discussions to facilitate interactive feedback and brainstorming. All accepted papers, however, will be considered with equal importance. After acceptance notifications, authors will receive additional details regarding presentation schedules, speaking times, and other logistical information.

Organization

Organizing Committee

Kun Sun, George Mason University

Baishakhi Ray, Columbia University

Xinda Wang, University of Texas at Dallas

Program Committee

Laurie Williams, North Carolina State University

Xiaojing Liao, University of Illinois Urbana-Champaign

Jacques Klein, University of Luxembourg

Yizheng Chen, University of Maryland

Saikat Dutta, Cornell University

Yibo Hu, Illinois Institute of Technology

Wenbo Guo, University of California, Santa Barbara

Zhen Li, Huazhong University of Science and Technology

Yangruibo Ding, University of California, Los Angeles

Joshua Garcia, University of California, Irvine

Christophe Hauser, Dartmouth College

Mukund Raghothaman, University of Southern California

Yu Nong, University at Buffalo

Zion Basque, Arizona State University

Page updated

Google Sites

Report abuse