spark infer and update schema

By default, the inferSchema is on

However the random sampling used by inferSchema might be wrong because of missing e.g. the max-length string in a column

You can enforce it to use 100% of the data to infer schema. The ratio is 0-1."inferSchema","true").options(samplingRatio=1.0).csv(...)

to change column data type, it has to overwrite the table at the moment


  .withColumn("birthDate", col("birthDate").cast("date"))



  .option("overwriteSchema", "true")



to add new columns, the mergeSchema can append new columns data types

Columns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when:

write or writeStream have .option("mergeSchema", "true")


  .withColumn("birthDate", col("birthDate").cast("date"))


  .option("mergeSchema", "true")

