The original SUN397 dataset contains 108,753 images and 397 distinct labels. The original images are resized to have at most 120,000 pixels, and compressed with 72 JPEG quality. For the purpose of fitting a VGG model, all images need to be converted to shape (224, 224, 3). A simple spark pipeline for image resizing and normalization. We first load the text file containing directory path for all images as a RDD and apply a user defined function load_transform_data to transform the RDD to the form of (Image_array, Label) tuple. Each Image_array now has fixed shape (224, 224, 3) and we convert them back to images. This pipeline prepares us with a new image dataset that has unified shape and ready for model fit. The preprocessed dataset was uploaded to S3 bucket: s3://sun397-new.
To load the dataset, we utilized Spark's built-in image data source ImageSchema to directly pull the images from S3 into a Spark DataFrame where the width, height, number of channels, content and label information of images are stored. The ImageSchema API provides a standard representation for us to code against and abstracts from the details of a particular image representation, which allows us to easily parse the images and transform the data into (Content, Label) pair for training purpose. However, this API is not full-fledged and does not have a mature pipeline for general purpose image handling and preprocessing, such as resizing the dimension varying images to a target shape. Therefore, resizing of the image dataset with our Spark pipeline is an essential step before loading in the data for model training.