databricks late arriving data

When using structured streaming to load data, if any window aggregates are used, Spark automatically takes cares of late arriving data. Every aggregate window is like a bucket (time window). As soon as a new timestamp arrives, it opens up a new bucket and starts calculating the aggregates (e.g. Count / avg) for records that fall in that bucket. The buckets stay open so even a record arrives 5 hours later, it can still update the old bucket.

The problem with keeping old buckets open is that it needs to open forever and thus occupies space and computation. The solution is watermarking, i.e. It tells how late the data could be so the engine can drop old state information of buckets. A watermark specifies the trailing gap of time, i.e. How late the data can be. If the trailing gap is 1 hour, that means no data earlier than 1 hour ago is expected, so the engine can safely drop all bucket information earlier that that. Only data that is late within the gap is allowed to be aggregated.

NOTE: For streaming queries which does not have any stateful operation (e.g. No aggregation or reliance on previous record) and for batch queries the watermark is ignored.

#structured stream doesn't handle update / delete by default, if any change is detected, it simply throws an exception

#When you use ignoreChanges, structured streaming will re-process the changes, note that unchanged rows in the same file may still be emitted, therefore your downstream consumers should be able to handle duplicates

.option("ignoreChanges", "true")

.option("startingTimestamp", "2018-10-18")