Overwrite delta table partitions
Sometimes it is easier to overwrite partitions of a table, instead of doing an expensive full table merge.
A merge involves a full outer join between source and target tables so a lot data transferred, while overwriting only a few partitions may be simple, robust and minimize data transfer. This is especially useful when only small partitions of data, e.g. financial months, are updated while the older data is static.
A quick example is:
(df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.save("/tmp/delta/people10m")
)
A more detailed implementation in below.
Note for the first load, it can't have the ".option("replaceWhere" ... ". This is known as a spark bug currently.
So for the first data load, it simply writes all data. The partition overwrite is not applicable yet.
Make sure the data is partitioned when written into a delta table
(
df.write
.format("delta")
.partitionBy("financial_month") #make sure the target lakehouse is partitioned
.mode("overwrite")
.save("/tmp/delta/test_table")
)
For subsequent writes, it overwrites only the target partitions.
The source data frame doesn't require to be filtered by target partitions as the write later will filter by partition, but probably good to reduce the size early before being written.
df = spark.read.table("test.source_table")
df = df.filter("financial_month >= 202405") #select only the target partitions
(
df.write
.format("delta")
.partitionBy("financial_month")
.mode("overwrite")
.option('overwriteSchema', 'true')
.option("replaceWhere", "financial_month >= 202405") #only the involved paritions are overwritten
.save("/tmp/delta/test_table")
)
Note that ".option("replaceWhere", "financial_month >= 202405")" specifies the partitions to be overwritten.
Dynamic overwriting partitions
A preview feature as of 2024-05 is dynamically overwriting only partitions present in the source data frame df.
(df.write
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
)