Ground Truth
To address the challenge of identifying Third-Party Libraries (TPLs) without a standardized package manager or direct SBOM, we utilized a file-path mapping technique inspired by previous work. We filtered repositories based on paths containing keywords such as "/third\_party", "/3rdparty", and "/deps", identifying 3,386 TPLs. The list of Ground Truth Dataset can be found here.
However, as pointed out by the reviewer, the TPL that are not in our target path (e.g., /3rdparty, /deps) would be left out in ground truth labeling, and finally cause a False Positive in the detection. Since then, we have manually reviewed our False Positive cases. Among these 649 cases of FP, we found that 5.7% (37 cases) were mis-detected. They were correctly detected by SCA tools but unlabeled in the ground truth. After correcting the ground truth, the precision and recall of OSSScope have improved to 84.34% and 97.34%. The revised ground truth can be found here.
Reimplement Feature Dataset of OSSFP
For evaluation, we reimplemented the OSSFP algorithm and compared its performance on both our dataset and the original OSSFP dataset (which uses identical feature formats, i.e., type-2 deduplicated function hashes). Since OSSFP is not open-sourced, we re-implemented the algorithm and consulted the authors to confirm the correctness of our setups and dataset construction. There are 12,859 repositories with over 100 stars, which constructed the OSSFP dataset in the evaluation. The reimplement dataset of OSSFP can be found here.
RClassifier
Based on the results of OSSScope, we trained a classification model, named RClassifier, to simplify the repository collecting process. It determines whether a certain repository is to be collected or not based on these metrics: stars, forks, open issues, commits, contributors and action period (the days from the created date to the updated time). The original model can be found here. We also provide a demo to show how to use RClassifier, which can be found here.
Download
Original RClassifier model: https://drive.google.com/file/d/1KJRRfPIMbxYeyuQSjDxKVYl_g15C7tr5/view?usp=sharing
A usage demo of RClassifier: https://drive.google.com/file/d/1xBiph5Ov4GgRjDFB7CRrh4fmXyGlzRtc/view?usp=sharing
Ground Truth dataset: https://docs.google.com/spreadsheets/d/1wsUttG-dqfbMstsARLitfzYLa-fSokeQ/edit?usp=sharing&ouid=114153599277382692115&rtpof=true&sd=true
Revised Ground Truth dataset: https://docs.google.com/spreadsheets/d/15hBQyxTN-ziuolGaruxj0U9PtuAEvxlf/edit?usp=sharing&ouid=114153599277382692115&rtpof=true&sd=true
Reimplement of OSSFP feature dataset: https://drive.google.com/file/d/1fa9H-hydNAXy_iaPpEL3oiyEx611t97u/view?usp=sharing