Introduction
In contemporary software development, the extensive adoption of third-party libraries has become a standard practice to avoid reinventing the wheel. This trend is particularly evident in pro-gramming ecosystems that lack mature package managers, such as C/C++. Within this context, Code Clone Detection (CCD) techniques have emerged as essential tools for identifying instances of code reuse, thereby mitigating security and legal concerns. Despite significant scholarly advancements aimed at enhancing the accuracy of CCD algorithms, a notable gap persists between the capabilities of academic tools and the demands of real-world applications. This disparity primarily arises from the reliance on academic tools on experimental datasets composed of a limited number of indexed repositories, whereas real-world applications require a much broader scope of code sources.
To address this challenge, we propose a novel, systematic approach for selecting appropriate open-source repositories to construct a more effective CCD feature dataset. Our method, OSSScope, employs a greedy algorithm to analyze the diversity of fingerprinting functions across 2.55 million C repositories on GitHub. The objective is to maximize the inclusion of unique fingerprinting functions while minimizing the number of repositories required. OSSScope successfully compiles an optimal feature dataset containing 61 million fingerprinting functions from 190K repositories, resulting in a 22.65% improvement in recall over SOTA tools. Moreover, the constructed dataset covers over 95% of the features from other sources, demonstrating its diversity and coverage.
Furthermore, leveraging the repositories selected by OSSScope, we introduce RClassifier, a lightweight classification model that streamlines the construction process for migration to other language ecosystems. By capturing key metrics, RClassifier directly determines whether a repository should be included in the dataset. Our comprehensive evaluation of RClassifier within the Golang ecosystem validates its effectiveness, achieving recall performance comparable to the feature dataset produced by OSSScope.
Overview
Main Contributions
OSSScope: We proposed OSSScope, a resource-consuming but very effective approach to construct a code clone feature database from open-source repositories. By implementing OSSScope, we constructed an effective-sufficient feature dataset for CCD tools in C, which contains 62m unique functions from 190K repositories (80.2% of unique functions from only 7.57% of C repositories on GitHub), which achieves 22.65% recall improvement on existing SOTA CCD tools.
RClassifier: We proposed RClassifier to select repositories for feature dataset construction for code clone detection, by only referring to repository metrics that can be easily obtained. Our evaluation of the Golang ecosystem showed that RClassifier can effectively sort out repositories for CCD feature dataset construction, with only slight compromise on recalls compared to OSSScope (3%), which proved the generalizability of RClassifier on constructing effective-sufficient feature databases for CCD in other language ecosystems.
Dataset: We have open-sourced the constructed feature dataset of C and Golang. To our best knowledge, so far they are the largest open-source feature databases for code clone detection for C and Golang.
Research Questions
RQ1: Effectiveness. How can OSSScope improve the performance of state-of-the-art SCA tools?
RQ2: Ablation Study. How does each step impact the final performance improvement?
RQ3: Characteristics. What are the characteristics of repositories included/excluded by OSSScope?
RQ4: Representative. How representative is the feature dataset generated by OSSScope compared to other data sources?
RQ5: Generalization. What is the effect of migrating OSSScope to other language ecosystems?
Validation of Tools
OSSScope: Exploratory Study in C ecosystem.
RClassifier: Migration to Golang ecosystem.
Supplement Study
Summary of repositories selection criteria from recent Open-Source Software(OSS) research.
A exploratory study of other data source.
Dataset & Resource of this paper
1. Constructed Feature Dataset
Selected Golang repositories list (OSSScope or RClassifier) for Code Clone Detection(CCD).
Function feature dataset of C and Golang. (Here released 100,000 features, The remaining features will be released after the paper is accepted.)
2. Tools and Models
3. Results of Experiments
Ground Truth dataset and the revised version
4. Other Resources and Data