clustering, partitioning, z-order

Liquid clustering, partitioning and Z-order

In databricks, the old way of optimizing data storage is using partitioning and z-order.

The partitioning is done on a low cardinality column, e.g. financial year

The Z-order is done on frequently visited and high cardinality column, e.g. transaction date

In this way, querying transactions from a period within a financial year can be answered with a small scan of the delta files.

A challenge with partitioning + z order is that the query pattern could change over time.

A partition column / z order column might not be a good one any more.

In those cases, normally a table needs to be recreated with better partition and z-order columns.

A new optimizaion technique in databricks is Liquid Clustering.

The only good thing seems to be chaning the clustering columns without re-creating a table.

Simply:

CREATE TABLE table1(col0 int, col1 string, col2 string) CLUSTER BY (col0, col1)

The cluster by enables liquid clusteirng on the specified columns.

Multiple columns can be used for clustering.

The when a cluster column becomes un-suitable for clustering, simply run an Alter Table statement to change cluster by.

ALTER TABLE <table_name>

CLUSTER BY (<clustering_columns>)

However, same as partition and z-order, it is recommended to run OPTIMIZE command regularly to re-order the cluster / partition / z-order.