Unlike in undistributed neural network training, where we can use data generator to generate batches of images for each iteration, SparkModel requires us to store the entire dataset in Spark DataFrame, taking up at least 16 gigabytes of memory in theory. Elephas expects all data prepared eagerly and there is no lazy alternative. In addition, memory overhead in data transformation and spark serialization is high. While our cluster is of small scale and cannot satisfy such huge memory requirement. Aside from the entire dataset, the model itself is huge with over 100 million weights and the updated weights and gradients has to be stored in the memory too. So we must carefully setup Spark configuration to enable tasks to return very large results. We did not find a promising way to solve the memory issue for training on the whole dataset. Instead, we trained on 10% of the entire dataset. For future analysis, we may consider deploying a cluster with larger size of memory and build a more memory efficient pipeline.
Elephas has a very outdated dependency lists and we have to create an Amazon Machine Image to deploy the environment on the whole cluster. However, we have to create an image with the same operating system as the operating system of EMR.
We encountered difficulty processing images from S3 or HDFS to numpy arrays so that we can feed them into network because all python image library can only read image from local file system. In addition, although Spark is equipped with a built-in image data source for loading images into Spark DataFrame, the library lacks a comprehensive pipeline for efficient image preprocessing.
Apply GPU acceleration
Use EMR cluster with G-tier instances
Build a more memory efficient pipeline
We find that Elephas has a lot of limitations and we may want to try building our own distributed neural nets pipeline by using Spark mlib
Create a hyper-parameter tuning pipeline using spark
[1] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva and A. Torralba, "SUN database: Large-scale scene recognition from abbey to zoo," 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, 2010, pp. 3485-3492, doi: 10.1109/CVPR.2010.5539970.
[2] J. Johnson. CNN-benchmarks. (https://github.com/jcjohnson/cnn-benchmarks)
[3] P. Moritz, R. Nishihara, I. Stoica, M.I. Jordan. SPARKNET: TRAINING DEEP NETWORKS IN SPARK