spark infer and update schema

By default, the inferSchema is on

However the random sampling used by inferSchema might be wrong because of missing e.g. the max-length string in a column

You can enforce it to use 100% of the data to infer schema. The ratio is 0-1.

spark.read.option("inferSchema","true").options(samplingRatio=1.0).csv(...)

to change column data type, it has to overwrite the table at the moment

(spark.read.table(...)

.withColumn("birthDate", col("birthDate").cast("date"))

.write

.mode("overwrite")

.option("overwriteSchema", "true")

.saveAsTable(...)

)

to add new columns, the mergeSchema can append new columns data types

Columns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when:

write or writeStream have .option("mergeSchema", "true")

(spark.read.table(...)

.withColumn("birthDate", col("birthDate").cast("date"))

.write

.option("mergeSchema", "true")

.saveAsTable(...)

)

Page updated

Google Sites

Report abuse