Structured streaming automatically detects changes in the source and stream the changes to the sink.
It can watch a data lake location to see if any new file is added, or any change is made to a file, and emit the changes down the stream.
There are 3 modes to stream data to a sink:
Append. It only adds new records to the sink. This is ideal for insert only data source.
Complete. The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
Update.
Only the rows that were updated in the Result Table since the last trigger will be written to the external storage.
Note: that this is different from the Complete Mode in that this mode only outputs the rows instead of the whole table. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
Note: if any aggregation, the streaming handles late arriving data. It recounts aggregates for every time bucket.
Below is a good example of calculating the count using sliding window, watermark, and update mode. In this way, the update of count within the watermark window is continuously returned from the stream. It is up to the down stream to handle the update.
Spark supports three types of time windows: tumbling (fixed), sliding and session. The fixed size windows have no overlapped. The sliding windows have overlaps. The session window has varied size (refer to doco for details).
Once start streaming, the spark job initializes the stream, which means it's collecting all current files from the source path in one go. The job status shows as 'stream initializing...'. At this point if you refresh the target directory (ls) you will see many files coming in quickly.
Once all current files have been streamed, it starts sort of a stable mode, in which it only runs when there are new files hit the source path. At this point, you will see new files incrementally coming into the target directory.