spark - add file name to dataframe

Spark/databricks

Get file path and file name from reading multiple files

When reading multiple files or a directory of files into a dataframe, if it needs to record the file name / path, it can use the input_file_name() function to automatically populate the field. This is way better than looping through the files, and add file name one by one.

The input_file_name() returns the full file path, so it also need to split the column by '/'. As in spark, you cannot use the -1 index as in array to access the last element, which is the file name. It needs to use the element_at() function in another line to get the -1 item.

This associates the file name with each row read from the source files.

from pyspark.sql import SparkSession

from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

df = (

spark.read

.csv(['/tmp/abc.csv', 'tmp/123.csv'])

.withColumn('file_path', F.input_file_name()) # '/tmp/abc.csv'

.withColumn('file_name', F.split(F.col('file_path'), '/')) # [tmp, abc.csv]

.withColumn('file_name', F.element_at(F.col('file_name'), -1)) # [abc.csv]

)

display(df)

Page updated

Google Sites

Report abuse