spark infer and update schema

By default, the inferSchema is on

However the random sampling used by inferSchema might be wrong because of missing e.g. the max-length string in a column

You can enforce it to use 100% of the data to infer schema. The ratio is 0-1.

spark.read.option("inferSchema","true").options(samplingRatio=1.0).csv(...)



to change column data type, it has to overwrite the table at the moment

(spark.read.table(...)

  .withColumn("birthDate", col("birthDate").cast("date"))

  .write

  .mode("overwrite")

  .option("overwriteSchema", "true")

  .saveAsTable(...)

)


to add new columns, the mergeSchema can append new columns data types

Columns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when:

write or writeStream have .option("mergeSchema", "true")


(spark.read.table(...)

  .withColumn("birthDate", col("birthDate").cast("date"))

  .write

  .option("mergeSchema", "true")

  .saveAsTable(...)

)