Discussion

Challenges

Memory issue with storing image data and network weights

Unlike in undistributed neural network training, where we can use data generator to generate batches of images for each iteration, SparkModel requires us to store the entire dataset in Spark DataFrame, taking up at least 16 gigabytes of memory in theory. Elephas expects all data prepared eagerly and there is no lazy alternative. In addition, memory overhead in data transformation and spark serialization is high. While our cluster is of small scale and cannot satisfy such huge memory requirement. Aside from the entire dataset, the model itself is huge with over 100 million weights and the updated weights and gradients has to be stored in the memory too. So we must carefully setup Spark configuration to enable tasks to return very large results. We did not find a promising way to solve the memory issue for training on the whole dataset. Instead, we trained on 10% of the entire dataset. For future analysis, we may consider deploying a cluster with larger size of memory and build a more memory efficient pipeline.

Cluster coding setup and dependency installation on each node

Elephas has a very outdated dependency lists and we have to create an Amazon Machine Image to deploy the environment on the whole cluster. However, we have to create an image with the same operating system as the operating system of EMR.

Process images from S3 or HDFS to numpy arrays

We encountered difficulty processing images from S3 or HDFS to numpy arrays so that we can feed them into network because all python image library can only read image from local file system. In addition, although Spark is equipped with a built-in image data source for loading images into Spark DataFrame, the library lacks a comprehensive pipeline for efficient image preprocessing.

Future Work

Apply GPU acceleration
- Use EMR cluster with G-tier instances
Build a more memory efficient pipeline
- We find that Elephas has a lot of limitations and we may want to try building our own distributed neural nets pipeline by using Spark mlib
Create a hyper-parameter tuning pipeline using spark

Reference:

[1] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva and A. Torralba, "SUN database: Large-scale scene recognition from abbey to zoo," 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, 2010, pp. 3485-3492, doi: 10.1109/CVPR.2010.5539970.

[2] J. Johnson. CNN-benchmarks. (https://github.com/jcjohnson/cnn-benchmarks)

[3] P. Moritz, R. Nishihara, I. Stoica, M.I. Jordan. SPARKNET: TRAINING DEEP NETWORKS IN SPARK

Page updated

Google Sites

Report abuse

Discussion

Challenges

Memory issue with storing image data and network weights

Cluster coding setup and dependency installation on each node

Process images from S3 or HDFS to numpy arrays

Future Work

Reference: