Scene classification is one of the fundamental problems in the area of computer vision. However, the databases that were studied in this area have covered a limited range of categories and do not provide diverse scenes. Despite the massive size of the datasets currently used, the number of scene categories that the largest available dataset contains is fewer than 20 classes. Thus, in this project we are interested in training a Convolutional Neural Network on a large-scale dataset with a variety of categories (397 classes).
Convolutional Neural Networks (CNN) are widely used in image analysis. CNN is usually quite computationally demanding to train, even more intensive if the training dataset is larger and hyper-parameter tuning is required. In this project, we aim to explore parallel CNN training for huge image dataset, find feasible solution and compare runtime and performance in different settings. More specifically, we will use a huge image dataset and train CNN for label classification with Spark and using parallelization framework available on AWS.
The dataset we use is SUN397 [1] , an image dataset used for scene understanding.
There are 108,753 images in total, labeled with 397 classes.
Each label covers at least 100 images.
Each image is JPEG file at most 120,000 pixels, and the image size varies.
The total dataset is about 36 GB, which is an important reason for us to do parallelization.
Some example images in SUN397 and their labels
One popular CNN architecture is VGG model. VGG is a deep convolutional neural network developed by Visual Geometry Group from Oxford.
VGG typically has hundreds of millions of parameters to learn. For example, VGG-16 has about 138 millions parameters, and VGG-19 has about 144 millions parameters. Since VGG is widely used, we use some existing VGG benchmark to estimate the training time of our dataset. If we train VGG-16 without GPU acceleration, it takes about 8.5s/batch * (108,753/16) batch/epoch to run, which is approximately 16 hours per epoch. Even if we have GPU acceleration, it still takes about 15 to 20 minutes per epoch.
It is acceptable if we only train once, but usually we need to tune our model by trying different hyperparameters, probably thousands of combination of them. Therefore we are motivated to parallel training on cluster of nodes supporting GPU acceleration.
Benchmark for VGG-16[2]