zarr vs parquet

10/12/21 & 11/10/22

Learnings on when to store data in one format over the other.

What is zarr?

"Zarr is a format for the storage of chunked, compressed, N-dimensional arrays" [1]

What is a parquet?

See my last blog driver.read_parquet("car.parquet", engine="vroom")

Why not store everything in parquet?

A question like this was asked at the dask summit in the High-Performance Data Access for Dask session [2] and is the motivation of this blog.

Parquet is by far more popular than zarr. It's unquantifiable but I would guess around the range 100-1,000 times more popular. This ratio comes from a hunch of how many people work with tabular datasets (e.g. databases) over array datasets (e.g. images).

Transformations

Hereon, i'll interchange referring to parquet as a DataFrame and zarr as a tensor (multi-dimensional array).

A DataFrame can be thought of as a two-dimensional array. A DataFrame can also be thought of a sparse array [3]. If for example you have time as an index in a DataFrame and it isn't continuous.

Tensors can be converted to a DataFrame by reshaping them to two dimensions e.g. (3 x 4 x 5) -> (3 x 20) or (1 x 60).

The transformation are built into xarray and pandas. To demonstrate i'll take data for sea surface temperature. This is an interesting variable as it's often stored as an array with a missing value for land points: (The python environment to run the code is given in the footnotes)

import xarray as xr


da = xr.open_dataset("https://psl.noaa.gov/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc")["sst"]

This is monthly data 1854-present. As of writing this blog the data has shape 2025 (time) x 89 (lat) x 180 (lat). You can reshape to a singular columns pandas.DataFrame using xarray.DataArray,.to_dataframe()

df = da.to_dataframe()

This will give a dataframe with columns time, lat, lon (as a multi-index) and sst with 324,405,00 records.

You can round trip this using pandas.DataFrame.to_xarray() although it'll won't have the metadata in the original da:

df.to_xarray()

Storage

We can store these objects using the built in tools of xarray and pandas. Here, we'll use default args:

ds = da.to_dataset()


ds.to_zarr("sst.zarr")


df.to_parquet("sst.parquet")

We can inspect the size of this files:

!du -sh sst.zarr # 65 Mb

!du -sh sst.parquet # 89 Mb

For hereon, i'll skip the du calls and put the size next to the store command in a comment.

You can save a bit of storage on the parquet by dropping the multi-index:

df2 = df.reset_index()

df2.to_parquet("sst_no_multiindex.parquet") # 80 Mb

We are storing lots of nans here in the land points. We can drop nans to make the data smaller. This also doesn't impact the transformation as pandas will pad values with nan to make an array. Interesting this doesn't impact storage size. It will speed up future analytics though.

df2.dropna().reset_index(drop=True).to_parquet("sst_no_land.parquet") # 80 Mb

When dropping nans in xarray there a couple of options. Drop on a dimension where all values are nans (for example drop Antartica)

da.dropna(dim="lat", how="all").to_dataset().to_zarr("sst_no_antartica.zarr") # 64 Mb

Or create a mutliindex and dropna like pandas. Although this can't be stored as a zarr because of https://github.com/pydata/xarray/issues/1077

_ds = da.stack({'index': ['time', 'lat', "lon"]}).dropna(dim="index").to_dataset()

_ds.to_zarr("sst_drop_na.zarr") # NotImplementedError

Although there has been a workaround in cf-xarray (https://cf-xarray.readthedocs.io/en/latest/). This still keeps the dimensions but "index" now has a mapping to the other dimensions.

import cf_xarray


_ds2 = cf_xarray.encode_multi_index_as_compress(_ds)

_ds2.to_zarr("sst_drop_na.zarr") # 60 Mb

Compression

Compression is zarr is controlled by numcodecs. Xarray provides guidance here: http://xarray.pydata.org/en/stable/user-guide/io.html?highlight=compression#zarr-compressors-and-filters. zstd (Zstandard) is used in the example so we'll use that here. The docs says compression level for this ranges from 0-9. A lower values means faster access to the data when reading. In this case i'll use maximum compression of 9. Shuffle is given as 2 in the xarray example which means Bitshuffle. Bitshuffle is a "filter". This rearranges the bits to allow them to be store more efficient [4]:

import zarr

from numcodecs import Blosc


compressor = zarr.Blosc(cname="zstd", clevel=9, shuffle=Blosc.BITSHUFFLE)

ds.to_zarr("sst_compressed.zarr", encoding={"sst": {"compressor": compressor}}) # 59 Mb

We can try using bitinformation compression (https://www.nature.com/articles/s43588-021-00156-2) which has shown promising results for weather and climate data. Choosing the number of bits to store is a bit subjective but xbitinfo provides useful information to show how many information is at what bit level.

import xbitinfo as xb


bi = xb.get_bitinformation(ds) # bit info per dim

kb = xb.get_keepbits(bi)

xb.xr_bitround(ds, int(kb["sst"].max())).to_compressed_zarr("sst_compressed_bi.zarr") # 12 Mb

A factor of five compression is quite impressive.

Pandas has a default compression of snappy. Try other compression methods:

df2.to_parquet("sst_no_multiindex.parquet.gzip", compression="gzip") # 63 Mb

df2.to_parquet("sst_no_multiindex.parquet.brotli", compression="brotli") # 61 Mb

We can move to lower levels libraries that can offer more flexible compression for tabular data:

import pyarrow as pa


table = pa.Table.from_pandas(df2)


import pyarrow.parquet as pq


pq.write_table(table, "sst_no_multiindex_pq.parquet") # 80 Mb

pq.write_table(table, "sst_no_multiindex_pq.parquet.zstd", compression="ZSTD", compression_level=9) # 63 Mb

This could be expanded to a more thorough benchmark compression study such as that done here by Alistair Miles: http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html.

So what should I use?

Zarr is fairly new format which is a good choice if you are working with array-like data. This includes weather and climate data (https://pangeo.io/), image data (e.g. https://github.com/activeloopai/Hub is built on numcodes) and genetics data (https://pystatgen.github.io/sgkit/latest/). It can also be used to store array data which has undergone feature engineering transformations before the model fitting step in machine learning. You can see how zarr is used in the wild by searching for its usage on kaggle https://www.kaggle.com/search?q=zarr. Tools such as kerchunk can help create views of zarr stores.

A zarr store can be made up of lots of little files (chunks). This can be troublesome when trying to do embarrassingly parallel workflows as IO with zarr can use all the cores but I believe this can be configured.

Xbitinfo offers great compression for physical data as I likely can't see myself storing a weather data as zarr without compressing it first.

Parquet is well supported for tabular data and is a very good choice for machine leaning. Most libraries offer column selection and row filtering to access bytes of interest. This can also be a good option when storing weather with rotated grids and complex dimensions such as a grib file but requires some work up front.

The transformations above allow you to work with the data in a framework you are familiar with. These can be created as views using transformations in intake.

Foot notes

*1

$ conda remove --name test_env --all --y

$ mamba create -n test_env python=3.10 --y
$ conda activate test_env
$ mamba install -c conda-forge cf_xarray
netCDF4 ipython pandas pyarrow xarray xbitinfo zarr --y

References

[1] https://zarr.readthedocs.io/en/stable/

[2] https://www.youtube.com/watch?v=m5acQ2WEAUI

[3] https://en.wikipedia.org/wiki/Sparse_matrix

[4] https://www.sciencedirect.com/science/article/abs/pii/S2213133715000694?via%3Dihub#!