Since the MC problem has not been studied systematically in Python language in existing work, we empirically study the MC issues from GitHub and Stack Overflow and classify them into three patterns. We then used ModuleGuard to evaluate all 4.2 million PyPI packages and 3,711 high-star projects collected from GitHub for the presence
of MCs and their potential impacts. In summary, we propose the following research questions:
RQ1 (Issue Study). What are the common types of module conflict issues? What potential threats might they have?
RQ2 (PyPI Packages). How many of all PyPI packages have MC effects?
RQ3 (GitHub Projects). How many popular projects on GitHub are affected by MC, and what are their characteristics?
We detected three types of MC patterns for the 4.2 million PyPI packages using ModuleGuard.
For module-to-Lib conflict, we first collect 199 standard library module names from the Python official documentation. Then we analyze the module names used by all the packages in the ecosystem. It's worth noting that we cannot know the order of sys.path or the standard library in the users' environment. Therefore, we also assume that the users have 199 standard libraries available locally, all of which are loaded into the cache.
For module-in-Dep conflict, we consider all version packages for each project and resolve their dependency graphs with EnvResolver. For the nodes in the resolved dependency graphs, we check whether their modules have conflicts.
We extract 177,216,363 modules and 27,678,668 direct dependencies for 4,223,950 packages from PyPI as of March 2023 and resolve 4,223,950 dependency graphs. This includes 424,823 latest version packages with 5,419,306 modules.
Data source.
We take all the modules as primary keys and analyze the number of packages that contain the module. If the number of packages that contain the module is more than 2, we consider this module as a conflicting module that may have a potential threat.
Results.
We use the latest version packages of 424,408 projects as of 2023 March to study module-to-TPL conflicts. We find that 91,134 (21.45%) packages have module-to-TPL conflicts, affecting 386,595 (7.13% out of 5,419,306) module paths. These packages may have module overwriting or importing confusion threats depending on whether they are installed in the same or different locations. Moreover, 27,851 (6.56%) packages may have an overwriting impact in a Windows environment, involving 3,517 module paths.
Findings.
Finding 1: We observe that developers often package redundant modules that are not needed for runtime, such as testing modules (e.g., 41,095 packages have test(s)/__init__.py,) and example modules (e.g., 14,877 packages have example(s)/__init__.py). These modules are only for the development process and are more error-prone and confused. They not only increase the storage pressure on the PyPI server, but also slow down the efficiency of pip resolution due to the backtracking algorithm.
Finding 2: We identify the top 10 most common module paths in software packages as shown in table. There are over 1000 packages that include src/__init__.py and __init__.py, which are the result of the misconfiguration of the src-layout and flat-layout format packages, respectively. These two modules are stored in the project root directory without any meaning or functionality.
Finding 3: Additionally, we find that packages with conflicting modules often have similar names, which reflect their functionality. For example, out of the 404 packages that have the distributions/__init__.py module, 290 contain the substring 'distribution' in their project name. This means that conflicting packages are more likely to be installed together, because they most likely belong to the same domain or have the same functionality.
Finding 4: In addition, through exploring related GitHub Issues, we find that project maintainers who have the same module name are often reluctant to change their own module name. Changing the module name will not only break forward compatibility, but also the workload is very large, and increase the learning cost of users when used. They usually recommend that users rename module files or modify module namespaces using the syntax of import a as b when importing.
Data source.
We take all the standard libraries as primary keys and analyze the number of packages that contain the same module names as standard libraries.
Results.
We analyzed the entire ecosystem of 4.2 million packages and found that 345,068 (8.17%) packages have module-to-Lib conflicts, which may cause import errors at runtime. Moreover, we discovered that 182 (91.96%) out of 199 standard library modules are affected by these conflicts. The most frequently used standard library module names that conflict with third-party packages are types, io, and logging, which are used by 69,940, 47,214, and 35,694 packages, respectively. These results suggest that developers should be careful when choosing module names for their packages and avoid using names that already exist in the standard library.
Findings.
Finding 5: We also observed a gap between the local development and the deployment environments that can lead to import confusion issues. When a program is developed locally, the current working directory has a higher priority than the standard library modules in sys.path. However, when the program is packaged and installed by others from PyPI, the site-packages directory has a lower priority than the standard library modules in sys.path. This gap can result in unexpected runtime errors due to importing wrong modules. To address this problem, developers have two options: changing the module name or using relative path import. However, the two solutions may break backward compatibility and reduce the readability or portability of the code.
Finding 6: In addition, we notice that the number of Module-to-Lib conflict packages increased each year. To further illustrate this trend, we plot the number and percentage of Module-to-Lib conflict packages for each year from 2005 to 2022. As shown in the figure, both the number and percentage increased steadily over the years, indicating that this threat became more prevalent and severe as the Python ecosystem grew and the extensions to the standard library. This suggests that developers do not pay enough attention to the potential conflicts with the standard library modules when naming their modules, or they are unaware of the existing or newly added standard library modules that might conflict with their modules.
Data source.
Results.
We conducted an empirical study on the entire ecosystem of 4.2 million packages and detected 129,840 (3.07%) packages with module-in-Dep conflicts, involving 11,666 projects. we also find 38,371 packages involving 4,516 projects that exhibit different module file contents but the same paths, which may cause functionality errors. Moreover, we noticed that some conflicting modules may change their contents after package updates, which could introduce new problems in the future. Although these conflicting modules may not be invoked at runtime, they do have the effect of module overwriting, which compromises the integrity of packages. There is also no guarantee that these modules will not be called and used in a future version.
Findings.
Finding 7: We also observed that these conflicts often occurred either in older continuous versions (2,342 out of 4,516) or in all versions (1,819 out of 4,516) of a project. This indicates that some developers or users discovered and resolved some conflicts when they encountered functionality issues, while others did not notice or update their dependencies. This implies that module-in-Dep conflicts have a certain persistence and concealment, which may affect the reliability of Python applications. For example, the project aniposelib used opencv-python and opencv-contrib-python dependencies prior to version 0.3.7, which was fixed by maintainers in a later version due to bugs raised in issue caused by Module-in-Dep conflicts. What's more, such conflicts exist in an average of 6.5 versions. Such a large time gap can affect the functionality and maintainability of the project. Therefore, we argue that it is important to detect and prevent module-in-Dep conflicts in Python packages to ensure correct functionality.
Finding 8: From the time dimension, the number of packages with Module-in-Dep conflicts also gradually increases over time, as shown in the figure. Many older packages that didn't have conflicts before are coming back into conflict as dependencies are migrated.
Some issues that proposed by users and the maintainer fix the module-in-dep conflict.
......