Spark/databricks
Get file path and file name from reading multiple files
When reading multiple files or a directory of files into a dataframe, if it needs to record the file name / path, it can use the input_file_name() function to automatically populate the field. This is way better than looping through the files, and add file name one by one.
The input_file_name() returns the full file path, so it also need to split the column by '/'. As in spark, you cannot use the -1 index as in array to access the last element, which is the file name. It needs to use the element_at() function in another line to get the -1 item.
This associates the file name with each row read from the source files.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = (
spark.read
.csv(['/tmp/abc.csv', 'tmp/123.csv'])
.withColumn('file_path', F.input_file_name()) # '/tmp/abc.csv'
.withColumn('file_name', F.split(F.col('file_path'), '/')) # [tmp, abc.csv]
.withColumn('file_name', F.element_at(F.col('file_name'), -1)) # [abc.csv]
)
display(df)