Data4SoftSec
The 1st Workshop on Datasets for Software Security
Data4SoftSec
The 1st Workshop on Datasets for Software Security
Datasets are fundamental to the development, evaluation, and benchmarking of techniques that drive progress across almost all domains of software security research. Despite ongoing open-science efforts to promote dataset sharing, software security researchers still encounter substantial obstacles in constructing, releasing, and reusing such datasets. The unique characteristics of security-related data such as sensitivity, confidentiality, and ethical constraints pose challenges to data collecting, sharing and reproducibility. Moreover, the scarcity and limited scale of available datasets hinder their usage for data-driven techniques, particularly in training and testing AI-based security applications. The sustainability of many existing datasets also remains a critical concern, as they tend to become outdated rapidly due to the evolving threats, vulnerabilities, and attacks.
Data4SoftSec aims to bring together researchers from academia, industry, and government to explore new directions for improving software security research datasets. The workshop will provide a venue to report novel methodologies to enhance data quality for security research tasks, exchange experiences and lessons learned from constructing and using security research datasets, discuss best practices for long-term dataset maintenance and reproducibility, and identify new opportunities for advancing data-driven software security research.
Submission Site
https://hotcrp.data4softsec26.ieee-security.org
Important Dates (AoE)
Paper Submission Due: Feb 13, 2026 Feb 27, 2026
Acceptance Notification: Mar 20, 2026
Publication-ready Submission Due: Apr 1, 2026
Workshop Date: May 21, 2026
Topics of Interest
We invite researchers from academia, industry, and government to submit papers introducing novel methods, insights, or tools for improving software security research datasets. Relevant topics include, but are not limited to:
Software Security Dataset Construction and Quality Enhancement
Novel methodologies for collecting and labeling high-quality software security datasets
Approaches to address data imbalance, bias, and noise or ensuring data accuracy, completeness, and representativeness
Synthetic or simulated data generation for data-scarce software security tasks using data augmentation, generative models, etc.
Dataset Sharing, Sustainability, and Ethics
Tools, frameworks, or infrastructures supporting long-term dataset maintenance, reproducibility, and accessibility
Strategies for dataset anonymization, sanitization, and ethical release
Ethical and responsible use of security datasets in research and industry
Data-Driven Software Security Techniques
Applications of AI/LLMs in software security supported by new or existing datasets
Evaluation benchmarks for assessing AI-based models on software security related tasks
Cross-domain, transfer learning, and multimodal approaches leveraging heterogeneous datasets
Empirical Studies and Experience Reports
Empirical evaluations of dataset usability, reliability, and impact on existing research outcomes
Case studies and best practices from dataset creation, curation, maintenance, or deployment
Lessons learned from public release of widely used software security datasets
Dataset needs and challenges in emerging software security areas (e.g., open-source software security, AI-assisted software security testing, etc.)
This list is not exhaustive. Topics in less related areas will be also welcome if a clear connection to enhance dataset for software security research is demonstrated.
In particular, we encourage the submission of papers related to our core topic theme for our 1st installation: Enhancing Datasets for Software Vulnerability Detection and Patching.
Submission Guidelines
Regular Papers (up to 10 pages): We invite submissions of original research papers that have not been previously published and are not under review elsewhere. Regular research-track papers should be up to 10 pages (with a minimum of 6 pages), anonymized, and formatted in the standard two-column IEEE proceedings style, excluding bibliography and appendices. Submissions are expected to make a clear research contribution by addressing an important problem, proposing a compelling solution, and presenting experimental evaluation of relevant techniques or discussion of their validity and real-world applicability.
Practical Experience or Tool Demonstration Papers (up to 6 pages): We invite submissions presenting practitioner experiences or empirical analyses of field data and the use of tools to address dataset-related challenges. Papers in this category should provide new insights or lessons that inform the research and practice of robust dataset construction and management. Submissions should be no longer than 6 pages (with a minimum of 2 pages), anonymized, and formatted in the standard two-column IEEE proceedings style.
Review Process
The review process will be double-blind, with each submission reviewed by at least three committee members or external reviewers with relevant expertise. Acceptance or rejection decisions will follow discussions moderated by the program chairs. We will also include a shepherding option for conditional acceptance for promising papers requiring minor revisions, where reviewers will provide a list of items to be addressed in the final version. The TPC will also select 1–3 Distinguished Paper Awards based on reviews and discussions.
Presentation Format
One author of each accepted paper is expected to present their work at the workshop. Each paper will be showcased through a traditional conference-style format, and, depending on the topic and content, selected papers will be followed by roundtable discussions to facilitate interactive feedback and brainstorming. All accepted papers, however, will be considered with equal importance. After acceptance notifications, authors will receive additional details regarding presentation schedules, speaking times, and other logistical information.
Organizing Committee
Kun Sun, George Mason University
Baishakhi Ray, Columbia University
Xinda Wang, University of Texas at Dallas
Program Committee
Laurie Williams, North Carolina State University
Xiaojing Liao, University of Illinois Urbana-Champaign
Jacques Klein, University of Luxembourg
Yizheng Chen, University of Maryland
Saikat Dutta, Cornell University
Yibo Hu, Illinois Institute of Technology
Wenbo Guo, University of California, Santa Barbara
Zhen Li, Huazhong University of Science and Technology
Yangruibo Ding, University of California, Los Angeles
Joshua Garcia, University of California, Irvine
Christophe Hauser, Dartmouth College
Mukund Raghothaman, University of Southern California
Yu Nong, University at Buffalo
Zion Basque, Arizona State University
(Coming soon)