Our validation is based on the data collected by two criteria:
(1) Popularity, for each popularity metrics (i.e., most stars, most forks, most downloaded in the past, most downloaded in past 3 years, most downloaded in last year), we select the top 2,000 libraries respectively.
(2) Centrality, for each centrality metric (i.e., most in-degree, most out-degree), we also select the top 2,000 libraries and top 20K versions. respectively.} For libraries, we take the highest patch version for each minor version.
Finally, 103,609 versions (almost 1% of the entire NPM ecosystem) from 15,673 libraries are sorted out.
Based on the collected data, we first collect all installation dependency trees (Install Tree) for each version from real installation (npm-ls), 82,415 of these versions are successfully collected after excluding those with installation errors.
We also collect the dependency trees (Remote Tree) from npm-remote-ls. Note that npm-remote-ls is exactly applying the technique of dependency reach to resolve dependency trees, but it misses most of NPM specific rules for dependency resolution during installations.
Moreover, to compare with real installation results, we update the graph after all Install Trees are well collected, so that all packages in the Install Trees are updated into the graph. Based on it, we further compute the dependency trees (Graph Tree) for all versions with their corresponding installation times.
(1) We first compare the dependency trees with strict rules that the dependency tree has to be exactly the same. The results are as follow:
install error: 21,194
graph error: 1,832 (exist packages in dependency tree not published on NPM registry, existing packages in dependency tree have no released time, etc.) (2.22%)
existing relation mismatch: 7,760 (9.42%)
miss data: 11 (some newly released are not captured since sometimes data pipeline could miss individual libraries due to network issues, but they are made up in later crawls) (0.01%)
has bundle dependencies: 2,429 (2.95%)
There are 2 main reasons that cause the mismatch of differences between InstallTree and GraphTree.
Installation may not be complete. There could be errors on certain dependencies, i.e., unable to install due to missing environment supports. Therefore, we have more nodes in some of the dependency trees since DTResolver explores the dependency trees ignoring the real installation.
Since the dependency tree from npm ls are deduped dependency trees (i.e., for the packages that appear more than once in the dependency tree, their dependency only appears once, etc), the location of those dependencies of deduped packages may differ, leading to false mismatches in the comparison. An example is given as follow:
The json on the left is a part of dependency tree from the installation of postcss-modules-scope@2.0.1 retrieved by "npm ls" from the result of "npm install postcss-modules-scope@2.0.1", there exists 2 dependency paths:
postcss-modules-scope@2.0.1--->postcss@7.0.27--->chalk@2.4.2--->supports-color@5.5.0
postcss-modules-scope@2.0.1--->postcss@7.0.27--->supports-color@6.1.0
Both supports-color@6.1.0 and supports-color@5.5.0 have a dependency to has-flag@3.0.0, but in the installed dependency tree, the "has-flag@3.0.0" only appears under "supports-color@6.1.0", while it is resolved under "supports-color@5.5.0" in the result of DTResolver, this is because we put the packages as closer to root package as possible.
Both of these two location selections are correct in the context of dependency tree resolution. During the real installation, all dependencies are organized as "physical tree", and both supports-color@6.1.0 and supports-color@5.5.0 depend on the same "has-flag@3.0.0".
We further extend the analysis of the results of our experiments to validate the accuracy of vulnerability detection and vulnerable path identification.
Since the InstallTree retrieved from real installation may be incomplete (i.e., some packages in dependencies are not installed due to environment issues), we evaluate the accuracy of vulnerability detection by examining how many vulnerabilities identified in InstallTree can also be found in GraphTree and RemoteTree.
In general, we find 31,913 library versions from our test set contains at least one vulnerable dependency, and 208,129 vulnerable points are found in InstallTree, and both DTResolver and npm-remote-ls have high coverage on these identified vulnerable points (98.1% v.s. 97.7%), only 3,781 and 4,638 vulnerable points are missed, respectively. The high coverage on identified vulnerabilities is because most dependency packages are resolved as the highest satisfied versions for corresponding dependency relations, which is the base principle of NPM dependency resolution.
Moreover, we have also compared the identified vulnerable paths based on these dependency trees. To be fair, we only take the vulnerable paths that direct to detected vulnerable points (the 208,129 vulnerable points) in InstallTree as ground truth. In total, we identified 324,718 individual vulnerable paths as ground truth. We found 300,691 of them are identified by DTResolver (92.60%), but only 254,298 vulnerable paths of them are identified by npm-remote-ls (78.31%).
Our extended analysis proves that even though the existing dependency resolution tool for NPM packages misses the NPM specific rules on version selection applied during installations, they can still identify most of the vulnerable dependencies, because most dependencies in the NPM ecosystem are still resolved as the highest satisfied version. But such missing rules make the vulnerable path identification compromised, which exactly influences the analysis of vulnerable propagation in dependency trees.
Note we only present the number of corresponding elements for packages (31,913) that contain at least one vulnerable points in its Install Tree in the excel below, for more detailed data, please download the raw data to inspect.
The raw data is organized as a json file for each selected package, they are organized as follow:
versionId: the library version of the package.
graph_vulnerable_points: the vulnerable points identified in the Graph Tree of this package.
remote_vulnerable_points: the vulnerable points identified in the Remote Tree of this package.
install_vulnerable_points: the vulnerable points identified in the Install Tree of this package.
graph_vul_paths: the vulnerable paths identified in the Graph Tree of this package.
remote_vul_paths: the vulnerable paths identified in the Remote Tree of this package.
install_vul_paths: the vulnerable paths identified in the Install Tree of this package.
graph_cves: the CVEs identified in the Graph Tree of this package.
remote_cves: the CVEs identified in the Remote Tree of this package.
install_cves: the CVEs identified in the Install Tree of this package.