driver.read_parquet("car.parquet", engine="vroom")

4/16/21

Self teaching the different engines for parquet IO in python.

What is a parquet file?

A parquet file is a file for storing tabular data (think spreadsheet; rows and column; 2d array). The official website can be found here: https://parquet.apache.org/.

What is an engine?

An engine is the terminology used to define which "back-end" should be used reading the parquet file. This term in found within the higher DataFrame libraries such as pandas or dask.

What is pyarrow?

The Arrow Python bindings based on the C++ implementation of arrow. The docs can be found here: https://arrow.apache.org/docs/python/

What is pyarrow-dataset?

API to efficiently work with tabular data. The docs can be found here: https://arrow.apache.org/docs/python/dataset.html

What is fastparquet?

A pure python interface to parquet files. The docs can be found here: https://fastparquet.readthedocs.io/en/latest/

Create environment

$ mamba create -n test_env python=3.9 --y
$ conda activate test_env
$ mamba install -c conda-forge dask fastparquet pandas pyarrow --y

pandas

Create a dataframe:

import pandas as pd

df = pd.DataFrame({'col1': ['a', 'b'], 'col2': [1, 2], 'col3': [1.5, 2.5]})

Use .to_parquet(). For the engine arg the docstring says:

engine{‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’

Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

The pandas parquet io file is reasonably digestable: https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py

As far as I can tell engine="auto" is a placeholder for pandas to first check if pyarrow is installed followed by fastparquet. If you have pyarrow installed that will be used by default. If you don't have pyarrow install but you have fastparquet installed that will be the default. PyArrowImpl updates BaseImpl in the code.

I trust pandas judgement for choosing pyarrow over fastparquet so let's use that without knowing we are using it:

df.to_parquet("pd_df.parquet")

Check we can read it back in using .read_parquet().

df = pd.read_parquet("pd_df.parquet")

dask

The dask .to_parquet() docstring for engine says:

engine:{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.

dask defaults to engine="fastparquet". Martian Durant mentioned this is historic as fastparquet is installed by default in the Anaconda installer and pyarrow is not.

Create a dask dataframe from the pandas dataframe and store it to a parquet using the default engine.

import dask.dataframe as dd

ddf = dd.from_pandas(df, npartitions=2)

ddf.to_parquet("dd_df.parquet")

If you take a look at the df_df.parquet you'll notice it's a folder which contains _common_meta_data, _metadata and part.0.parquet. You can learn more about meta data here: https://docs.dask.org/en/latest/dataframe-design.html#metadata.

Use .read_parquet() to read it in. For the engine arg the docstring says:

engine:str, default ‘auto’

Parquet reader library to use. Options include: ‘auto’, ‘fastparquet’, ‘pyarrow’, ‘pyarrow-dataset’, and ‘pyarrow-legacy’. Defaults to ‘auto’, which selects the FastParquetEngine if fastparquet is installed (and ArrowLegacyEngine otherwise). If ‘pyarrow-dataset’ is specified, the ArrowDatasetEngine (which leverages the pyarrow.dataset API) will be used for newer PyArrow versions (>=1.0.0). If ‘pyarrow’ or ‘pyarrow-legacy’ are specified, the ArrowLegacyEngine will be used (which leverages the pyarrow.parquet.ParquetDataset API). NOTE: ‘pyarrow-dataset’ enables row-wise filtering, but requires pyarrow>=1.0. The behavior of ‘pyarrow’ will most likely change to ArrowDatasetEngine in a future release, and the ‘pyarrow-legacy’ option will be deprecated once the ParquetDataset API is deprecated.

As far as I can tell if engine="auto" it uses fastparquet. There is "pyarrow" and a couple of different flavors of this: "pyarrow-legacy" and "pyarrow-dataset". pyarrow-legacy and pyarrow refers to the ArrowLegacyEngine. pyarrow-dataset refers to ArrowDatasetEngine. pyarrow is likely to use ArrowDatasetEngine in the future. This split allows dask to work with the experiment pyarrow-datases API. It has the advantage of enabling row-wise filtering.

I recently learned that reading in using "pyarrow" allows to use multi-indexing: https://github.com/dask/dask/issues/7555

Demonstrate functionality of the engines below:

dd.read_parquet("dd_df.parquet")

dd.read_parquet("dd_df.parquet", engine="pyarrow", index=["col1", "col2"])

dd.read_parquet("dd_df.parquet", engine="pyarrow-dataset", filters=[("col1", "==", "a")])