Details of OSSScope
Exploratory Study on C Ecosystem & Migration on Golang
Exploratory Study on C Ecosystem & Migration on Golang
Step 1: Metric-based Filtering of Active Repositories
Given the significant computational resources required to directly compute and analyze features for each of the more than 2.5 million C repositories (till the date of 2023-06-01), we employed a strategic, resource-efficient approach. Initially, we adopted a pre-selection process rooted in established practices, aiming to identify those active repositories with a higher likelihood of duplication before engaging in the intensive computation of features and the construction of a feature dataset. To ensure the efficiency and feasibility of this pre-selection process, we prioritized metrics that are readily accessible from GitHub metadata. Consequently, we identified six metrics, categorized into three distinct aspects, to serve as proxies for estimating the likelihood of a repository being reused.
Following the application of these filtering criteria, a total of 62,868 repositories were identified as constituting the preliminary dataset for constructing the feature dataset. The active repo list can be download here.
Step 2: Function-Level Feature Extraction
Generally, we first clone all the selected repositories via GitHub APIs. For each repository, we identify the functions with adequate information and calculate the feature for these functions, to construct the feature dataset. Specifically, we first collect all GitHub versions and tags as snapshots of each repository. For those with no version and tag, we collect the main branch instead. For each snapshot, we only select the C-related files. Considering that most files are unchanged during version releases, we only keep C-related files that first appear in the version list for each snapshot, reducing the duplicated efforts.
Next, we extract the functions of these files for each snapshot of given repositories by Antlr. For each function, we removed comments and blank lines and normalized function and variable names. We also filtered the functions which are less than three lines, which are considered meaningless. Then, we generate an MD5 hash value for identifying functions. We deduplicated functions based on hash values and finally obtained 41,806,568 unique functions from 62,868 active repositories, which constructed the initial feature database.
Step 3: Greedy Supplementary of Excluded Repositories
Given that active repositories omitted in the initial filtering phase are deemed global anomalies due to their distinct characteristics, and considering that the distribution of inactive repositories is relatively more concentrated, it is imperative that the chosen detection algorithm demonstrates enhanced sensitivity to local outliers. In response to this requirement, we employ the Local Outlier Factor (LOF) algorithm. The LOF algorithm stands out as an unsupervised anomaly detection method tailored for identifying local outliers within a dataset. After adopting LOF, we finally got 469,883 outlier repositories.
The list of outlier repos can be found here.
After identifying a list of outlier repositories, we hoped to find a suitable point to stop the process of repositories expansion. So, instead of incorporating them all at once, we sequentially added them to the feature database based on their priority to monitor the effect on the feature database.
Initially, outlier repositories are organized into batches based on their average rankings derived from basic metrics. As each batch is added, we focus on the number of newly added unique functions contributed by each batch, plotting a change curve to visually represent the evolution of feature diversity. To better understand and predict the growing trend, we considered fitting the curve with various mathematical functions, including Sigmoid Function, Exponential Function, Power Law Function, Hyperbolic Function, and Logarithmic Function. RMSE, MAE, MAPE, and RMSRE are adopted to measure the fitting effect.
Given the vast landscape of 2.55 million C repositories hosted on GitHub, the fitting curve leads to an estimate of approximately 77,650,766 unique functions across all these repositories, as depicted by the green line in Figure 2. Based on this estimation, a subset of the top 385,000 repositories contributes to 95% of these unique functions. we determined the cutoff for our supplementary data collection at this threshold of 385,000 repositories.
The 385,000 repo list of expansion closure can be found here.
Step 4: Low-Value Repositories Filtering
Although the 385,000 selected repositories can cover all C language projects on GitHub from the perspective of function diversity, the value attributes of the functions themselves are not considered. There are still repositories with low value for code reusing that are included in the scope. To this end, we refine our feature database by introducing several assumptions to exclude unsuitable repositories:
Function Quantity Extremes: Repositories with either an excessively high or low number of functions, which often represent large-scale software or projects of negligible relevance, are inappropriate for use as Third-Party Libraries (TPLs)
Function Repetition Rate: Functions with no or too many similar copies across the ecosystem could have little value from the perspective of improving CCD accuracy, as they are less likely to be reused or can not indicate copy-paste reuse. Therefore, we exclude repositories with a high ratio of such functions.
Complexity of Functions: Repositories with functions that are either overly complex or overly simplistic are excluded. This criterion serves as an indirect measure of the repository’s code quality.
We used the Maintainability Index to measure the complexity of the function. For each function, we extracted the following features to calculate MI:
LOC (Lines of the code): The lines of function without blank lines.
Halstead Volume (HV): It is based on counting a program’s operators and operands
Cyclomatic Complexity (CC):It is based on the Control Flow Graph(CFG) of the function
Maintainability Index (MI): the comprehensive measurement of the function complexity based on the comprehensive calculation of the above indicators.
MI = 171−5.2×ln(HV)−0.23×(CC)−16.2×ln(LOC)
We addressed our assumptions by sorting repositories based on specific criteria and applying a filtering threshold θ. For the first assumption, repositories were ordered by their total number of functions, with both the upper and lower extremes removed. The second assumption involved sorting by the proportion of duplicate functions and similarly excluding the top and bottom segments. For the third, we sorted based on the sum of MI of owned functions, again removing the extremes. The threshold θ controlled the extent of filtering. We evaluated the values of θ within the range of [5%, 10%, 15%, 20%, 25%], corresponding to the exclusion of the top and bottom θ of repositories based on our predefined criteria.
After careful consideration, a θ value of 15 was selected, achieving the optimal F1 score of 0.89, coupled with an acceptable average time cost of 0.21 seconds. This θ value represents an effective compromise, maximizing the accuracy of our SCA experiments while maintaining efficiency. The final feature database is constructed by 61,746,893 unique functions from 193,026 repositories.
The final selected repo list can be downloaded here. The filtered repo list can be found here.
Step 5: Light-weight Model for Migration
Since the whole process requires huge computational overhead, to simplify the cost of this process, we trained a classification model based on metrics. To solve this binary classification problem, we used linear and nonlinear models for fitting, and tried to select the most appropriate model by comparing performance indicators. The models we selected include Logistic Regression (LR), Support Vector Machines (SVM), Decision Trees (DT), Random Forest (RF), Gradient Boosted Machines (GBM), Naive Bayes (NB), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA). These models are widely used in previous works. 10-fold cross-validation is adopted for the training process. The dataset is randomly divided into ten folds, of which nine are used for training and one for evaluation. We adopt precision, recall, and F1 scores to evaluate all the models. Random Forest performs best among the models. We named this model as RClassifier, and used it to migrate to other language ecosystems.
We adopted RClassifier to Golang ecosystem. Using RClassifier, we got a repositories list that contained 87,271 repositories. Based on these repositories, we constructed a function feature database with 46,739,826 function features.