For the infrastructure, we use Amazon EMR cluster which has Hadoop and spark available for us. In our experiments, we use m4.xlarge instances for both our master and worker nodes. Since Elephas has some weird software dependencies especially for the core machine learning library like tensorflow and Keras. It's very time consuming to manually set up each node. To solve the issue, we decide to use Amazon Machine Image. We first deployed the required packages on a single EC2 instance and create an Amazon Machine Image based on that. We then launch this Image on the whole EMR cluster so that each node will have the same environment. We dynamically resized the number of worker node for our different experiments.
Core software versions:
Pyspark 2.4.4
Elephas 0.4.3
Tensorflow 1.1.4
Keras 2.2.4
Hadoop 2.8
python 3.7
AWS M4.xlarge specification:
4 CPUS and two threads per CPU
Clock rate 2.4GHZ and 16GB main memory
L1 cache 64k, L2 cache 256k