ModuleGuard

Implementation of the ModuleGuard

Design Overview

We implement the ModuleGuard tool, which helps us to shed light on the impacts of module conflicts in the Python ecosystem. It comprises two main components. The first component is InstSimulator, which acquires module configuration information from various configuration files and extracts modules after installation using a simulation-based execution method that does not involve actual installation. The second component, EnvResolver, extracts direct and extra dependency data from multiple dependency configurations without installation, which supports local environment-related dependencies to obtain more accurate dependency graphs.

InstSimulator

Challenges：

PyPI has several types of packages, and each type has its own configuration files and formats. However, there is no comprehensive documentation that specifies which files and parameters are related to module information and how to parse them.
The module path can differ before and after the package is installed. The following is an example.

This is the directory structure in the PyGetWindow-0.0.9.tar.gz package before pip installation.

This is the directory structure after pip installation.

This is part of the setup.py configuration script. The fields that control the installation behavior are packages and package_dir.

Implementation：

Raw module data extraction: To solve the first challenge, we systematically study all module-related files and parse the raw module data from different types of configuration and metadata files
Installation-free module simulation: To solve the second challenge, we leverage a novel approach to simulate the installation process to obtain accurate module information without installing the packages.

Raw module data extraction：

The InstSimulator implements different parsers for each type of configuration and metadata file. Specifically, InstSimulator converts text-type files into a list, parses formatted configuration files with a format-specific parser, and for setup.py, the executable script file, InstSimulator uses AST and data flow analysis to extract module configuration parameters based on a PyCD tool (it is a tool for parsing dependency information). These parsers can extract the relevant information from the files and store them in a structured way for further analysis. In the above way, InstSimulator can avoid decompressing the package locally by reading only a few configuration files and metadata files from memory for the sake of saving time.

For example, we parsed the setup.py file mentioned in Chanllengs as the following raw module data. their semantic meanings list in the following:

packages: defines which modules to be included and its value is the find_packages function. The function's parameters are in the package_arg.
package_dir: determines the path mapping of the included folders.

You can refer to our project code for details.

Installation-free module simulation:

Construct a file tree

InstSimulator first takes the file structure in the compressed package and turns it into a virtual file tree. The file tree is constructed based on the SOURCES.txt or RECORD file in the metadata files if they are present. Else it is constructed based on the package's structure.

Solve the raw module data

InstSimulator translates each raw data semantics into operations that add, delete, and search for nodes in the file tree. The conversion rules are listed in the following.

Traversing the tree

After parsing all raw module data, we employed the DFS algorithm to traverse the file tree to module paths. Note that each path from the root node to a leaf node within this tree corresponds to a module path.

Raw module data conversion rules

ModuleGuard

EnvResolver

Challenges：

Using common static resolution methods to obtain the dependency graphs without installation is not very accurate. This is because the Python dependency graphs depend on the local environment information, and the dependencies between packages are complex due to the extra dependencies.
Previous work like Watchman, deps.dev, SmartPip either lacks up-to-date information or possesses lower accuracy. Furthermore, they do not consider the local environment-related and extra dependencies, which are common in Python packages and can affect the dependency graphs. They often parse direct dependencies from only a single dimension, such as setup.py or requires.txt, resulting in incomplete information. On the other hand, obtaining accurate dependency graphs using pip's parsing process requires the process of installations, which is time-consuming.

We found that out of 4.2 million packages, 199,200 use extra dependencies, 572,202 declare extra dependencies, 769,189 declare local environment dependencies.

Implementation：

Multidimensional dependency information extraction.
Local environment information collection.
Dependency graph resolution.

Multidimensional dependency information extraction

EnvResolver adopts a multi-dimensional approach for extracting direct dependencies from three dimensions: PyPI API, dependencies in metadata files, and dependencies in configuration files.

PyPI API:

GET https://pypi.org/pypi/{}/json: Insert the name of the project in {}, and it returns information about the latest version of the project, as well as a list of all its versions and information about the different versions of the files. Such as https://pypi.org/pypi/twine/json access to twine's all the version information.
GET https://pypi.org/pypi/{}/{}/json: Insert the name and the version of the project in {}, and it returns information about the specific version of the project. For example, https://pypi.org/pypi/requests/2.31.0/json returns the requests@2.31.0's metadata.

The dependency information is stored in the metadata['info']['requires_dist'] field. Note that, not all packages can get dependency information from the PyPI API, and we later counted that the PyPI API only provides about 50% of the dependency information. For example, "https://pypi.org/pypi/underline-turntable/0.3/json", the returned json result shows that underline-turntable@0.3 does not have any dependency. However, when we analyze the configuration file in the 'underline.turntable-0.3.tar.gz' source file, we find that it contains a dependency 'RPi.GPIO'.

Metadata files:

{project name}.egg-info/requires.txt
{project name}-{version}.dist-info/METADATA

Configuration files：

setup.py：install_requires, extras_require.
formatted configuration files including pyproject.toml, setup.cfg.

Local environment information collection.

EnvResolver collects 11 types of environmental information that may affect the dependency graphs, such as python_version, os_name, and so on. These environment variables and their values are stored in a global dictionary when resolving dependencies.

The following is an example of using local environment dependency:

opencv-python's dependency:

Dependency graph resolution.

EnvResolver uses the resolvelib framework as the core backtracking resolution algorithm. To speed up and enhance the accuracy of dependency graph resolution, EnvResolver implements the following optimizations.

Local knowledge base. Like previous work like Watchman, deps.dev, SmartPip, EnvResolver also employs a local knowledge base. However, we adopts multi-dimensional dependency information extraction, which can obtain more comprehensive dependency information than previous work.
Reduce backtracking by reordering. Since the order of dependencies does not affect the resolution result, but influences its resolution efficiency, EnvResolver sorts the dependencies to reduce backtracking times.
Extra/local environment dependency supporting. EnvResolver supports resolving extra dependencies and local environment dependencies. More specifically, during dependency resolution, an extra dependency will add an entry to the environment variable dictionary.

More Details

Local knowledge base. Using the Multidimensional dependency information extraction method, we collected 4.2 million PyPI packages as a local knowledge base.
Data source: Link
Reduce backtracking by reordering. Reordering rules:
- the pinned version dependencies first
- scope constraint
- the dependencies with no constraint.
- dependencies that are close to the root node are always resolved before dependencies that are far from the root node. This can reduce the number of backtracking and improve the resolution efficiency observably.
Extra/local environment dependency supporting.

The global variable dictionary at the begging of the dependency resolution as shown in the figure.

When EnvResolver resolves the direct dependency of package A it finds that package A has a dependency of requests[socks]. It adds it to the global dictionary.

Later, when resolving direct dependencies of requests, EnvResolver checks the global dictionary and finds that the key requests in the global dictionary has a value of socks, so PySocks is counted as a direct dependency of requests and chardet is removed.

Dependency Graph Format

We represent the dependency graph in terms of node and edge. A node records all the dependencies and their versions, and an edge records the connections between different dependencies. Here is the dependency graph for requests==28.2.1.

Node level:

'certifi': '2023.5.7',
'charset-normalizer': '2.1.1',
'idna': '3.4',
None: '',
'requests': '2.28.1',
'urllib3': '1.26.16'

Edge level (where set() means that the dependency has no out degrees, the dependency is empty, and None represents a flag node pointing to the root node.):

None: {'requests'},
'requests': {'certifi', 'charset-normalizer', 'idna', 'urllib3'}, 'charset-normalizer': set(),
'idna': set(),
'urllib3': set(),
'certifi': set()

Evaluation

Due to the lack of Python module and dependency related benchmarks, we collected module information and dependency information from two dimensions and carefully construct the benchmarks for testing.

Datasets：

Data 1. We select the top 3,000 projects from Libraries.io and PyPI Downloads Table for six months from August 2022 to February 2023, respectively and we apply a de-weighting process to the two sets to reduce the bias.
Data 2. We randomly select 5000 projects from the total list of the PyPI package.

Evaluate metrics (A is result resolved by ModuleGuard, B is the groundtruth)：

Correct (A==B). The modules or dependency graphs resolved by EnvResolver are totally consistent with the ground truth.
Miss (A<B). Some modules or some elements in dependency graphs of the ground truth do not exist in our results.
Excess (A>B). Some modules or some elements in dependency graphs resolved by EnvResolver do not exist in the ground truth.
Error (others). Other cases.

Experimental setup：

To obtain the module paths after installation, we used pip install XXX -t target --no-dependencies to install the package. This command means that the latest version package will be installed in the target folder and no dependencies will be installed. This way we could get all module paths of the installed package. Moreover, we only considered modules with .py extensions, so the data files (i.e. pictures, tables, documentation) included in the packages will be ignored.
To obtain the dependency graphs, we also use pip installation to get the ground truth. We add the following settings to obtain the exact dependency graph. First, the pip installation process depends on the repository status of the remote, which is updated in real time. In contrast, our local knowledge base is updated on a daily basis. To address this gap, we mirrored approximately 13TB of PyPI packages locally with the bandersnatch tool and tell pip to use this local mirror during installations(i.e. pip install XXX -i localhost --ignore-installed). Second, in order to obtain pip's dependency resolution results, we hook pip's dependency resolve function and write the dependency graph to files without proceeding with the rest of the installation process.

Finally, we successfully installed 4,232 and 3,989 packages for each benchmark, respectively. There are two main reasons for installation failure:

First, the local environment is not compatible with the package, e.g. Python2 package cannot be installed in the Python3 environment;
Second, an error occurred while running the installation script, causing the installation process to exit.

Result：

The results show that InstSimulator has the ability to extract module information with 95.58% and 96.11% accuracy on different datasets.

The accuracy of EnvResolver ranges from 95.18% to 98.70%.

Benchmark source：

Benchmark for module.
- InstSimulator data 1.
- InstSimulator data 2.
Benchmark for dependency.
- EnvResolver data 1.
- EnvResolver data 2.

Page updated

Google Sites

Report abuse