spark infer and update schema
By default, the inferSchema is on
However the random sampling used by inferSchema might be wrong because of missing e.g. the max-length string in a column
You can enforce it to use 100% of the data to infer schema. The ratio is 0-1.
spark.read.option("inferSchema","true").options(samplingRatio=1.0).csv(...)
to change column data type, it has to overwrite the table at the moment
(spark.read.table(...)
.withColumn("birthDate", col("birthDate").cast("date"))
.write
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(...)
)
to add new columns, the mergeSchema can append new columns data types
Columns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when:
write or writeStream have .option("mergeSchema", "true")
(spark.read.table(...)
.withColumn("birthDate", col("birthDate").cast("date"))
.write
.option("mergeSchema", "true")
.saveAsTable(...)
)