Read data files from multiple sub-folders

Spark supports reading files, e.g. parquet files from a folder

df= spark.read.parquet('/datafolder/')

When there are multiple subfolders within the base folder, e.g.

datafolder

-----sub folder1

------------sub sub folder a

-----sub folder2

In this case, running "df= spark.read.parquet('/datafolder/')" will complain about "Unable to infer schema for Parquet. It must be specified manually. ", but even you specify the schema

df= spark.read.format('parquet').schema(schema).load('/datafolder/')

It won't raise an error, but still returns an empty data frame, because there is no data file within the base folder 'data folder'.

To get spark to read through all subfolders and subsubfolders, etc. simply use the wildcard *

df= spark.read.parquet('/datafolder/*/*')

But you need to know the max number of sub folder levels, to determine how many * are needed in the path.

This seems to work for streaming as well.

spark.readStream.schema(schema) .parquet('/datafolder/*/*')

Page updated

Google Sites

Report abuse