Issues That Are Common to Many Packages

1. C Compiler Issue for Installing Packages Using Cython (an Issue for Users)

The Python packages including Cython files would require a C compiler to install the package. In our group, packages such as py_stringmatching, py_stringsimjoin, and py_entitymatching include Cython files. The details of detecting and installing a C compiler varies according to the operating system. In this section, for each operating system, we describe how to detect if a C compiler exists, then describe the steps to install a C compiler if it does exist.

Linux

    1. Check if gcc (default C compiler for Linux) is installed by executing the following command in command prompt

$ gcc

if the command returns gcc: no input files then gcc is installed. Else, go to step 2.

2. Install gcc using an appropriate package manager for your Linux version. For example, on Ubuntu you can install gcc using the following

command.

$ sudo apt-get install build_essential

MacOS

    1. Check if clang (default C compiler for Mac) is installed by executing the following command in command prompt

$ clang

if the command returns clang: error: no input files then clang is installed. Else, go to step 2.

2. Install XCode which bundles clang along. You can install XCode from https://developer.apple.com.

Windows

    1. Check if Visual C ++ compiler is present by installing the package that includes Cython files. If the installation exits with the following error, then it means that the C compiler is not installed.

error: Unable to find vcvarshall.bat

2. Depending on which Python version you have, you will need to download and install a different version of the compiler. The table below lists downloads for different versions of Python. Download and install appropriate compiler based on the Python version you have.

References

  • Linux

    • https://www.crybit.com/how-to-install-gcc-gnu-c-c-compiler-unixlinux/

    • MacOS

      • https://www.quora.com/How-do-I-successfully-set-up-LLVM-clang-on-Mac-OS-X-El-Capitan

    • Windows:

      • https://github.com/cython/cython/wiki/CythonExtensionsOnWindows

      • https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/

2. Avoid Using OpenMP for Multithreading (for Developers)

We recommend the developers to avoid using OpenMP (a C++ compiler extension that allows writing multithreaded code) while implementing Cython commands (to add parallelism using multithreading). Because the commands in py_stringsimjoin where initially implemented using OpenMP. But, we observed that using OpenMP resulted in many installation issues on Windows (for example, users need to install a patch for the Visual C++ compiler or use a different compiler such as MinGW). As a workaround, we decided to use multiprocessing instead. Conceptually, we did the following:

  1. Write Cython code as if it runs on a single core

  2. Write a Python wrapper that performs the following steps

    1. Split the input data into multiple chunks.

    2. Call the Cython code for each chunk in parallel. Use multiprocessing libraries such as Joblib to implement this step.

    3. Aggregate the results produced from each chunk and return this aggregated result to the user.

In general, we recommend the developers the following:

  1. Implement commands using Python.

  2. If a command is not fast enough, then identify the bottlenecks and implement them in Cython (if required).

  3. Avoid using multithreading or advanced features in Cython.

3. Installing Packages Using Conda (an Issue for Users)

We observed that the users prefer using conda to install packages, as conda makes the installation of some packages easier compared to "pip" (such as PyQt4) and also it can install library dependencies outside the Python packages (such as HDF5, MKL, etc.). Currently, some of our packages are "conda" installable from the conda-forge channel. Specifically, the packages py_stringmatching, py_stringsimjoin, and py_entitymatching can be installed using conda. For example, the py_stringmatching package can be installed using conda like this.

conda install -c conda-forge py_stringmatching

For the py_entitymatching package, conda installs all its dependencies except OpenRefine, pandastable and XGBoost. As these packages are not available from Anaconda cloud. These packages are useful for exploring/cleaning the pandas data frames and predicting the matches. These packages can be installed by the following the instructions given below.

  • To install OpenRefine, follow the instructions on this page.

  • To install pandastable, follow the instructions on this page.

  • To install XGBoost, follow the instructions on this page.