For long-term maintenance needs, a framework including several data pipelines to collect and process NPM package metadata and CVEs with high coverage and accuracy have been set up.
Metadata Pipeline, is designed and implemented to crawl NPM package matadata from NPM Registry. It subscribes to the change stream from NPM CouchDB, and crawls the updated metadata for relevant packages. Then, a high-quality data cleaner takes over the newly crawled raw metadata and extracts the details we need. At last, a robust dependency constraint parser processes the dependency constraints and saves them into our metadata DB.
CVE Pipeline, collects CVE feeds from the NVD database. Since some information in CVE feeds, such as exact affecting libraries and versions, are usually in plain text, a CVE cleaner is designed to identify the programming languages of affected libraries and affecting version ranges as the initial result.
CVE Triage Pipeline, is a semi-automated pipeline, it helps experienced security specialists access the newly crawled CVE data, and validate or correct the key information, e.g., language, mappings of affected libraries and exact affected versions. A double-check with existing famous vulnerability databases (i.e., Snyk vulnerability database and SourceClear vulnerability database is also applied here to ensure the correctness.
Graph Pipeline, adds the metadata and mapped CVE data into DVGraph. Specifically, (1) it first insert all new coming libraries, versions and CVEs; (2) it creates has for new versions and alters existing upper and lower; (3) for new coming versions, it adds new dependencies of their own, and iterates existing depends which are originally on their upper or lower versions, and alter them if necessary, at the same time, alter the satisfying version and default accordingly; (4) it adds affects for new CVEs and check whether the old ones need modification; (5) it updates libdepends and libaffects.