We decide to use classic convolutional neural network architecture, VGG16, to train our image classification project. The architecture is depicted below. To facilitate distributed learning with VGG16, we use Elephas package that was built on top of Keras since it allows us to run distributed learning models at scale with Spark.
vgg16 network
Elephas implements a class of data-parallel algorithms on top of Keras, using Spark's RDDs and data frames. The general idea is that Keras Models are initialized on the driver, then serialized and shipped to worker nodes, alongside with data and broadcasted model parameters. In the next step, Spark workers deserialize the model, train their chunk of data using Stochastic Gradient descents and send their gradients back to the driver. The "master" model on the driver is updated by an optimizer, which takes gradients either synchronously or asynchronously. (Image from https://github.com/maxpumperla/elephas)
We highlight two training features that we explored in more details since they affect the overall training time. One is communication styles, which include synchronous and asynchronous. Synchronous mode means that the master node does not update model parameters until all worker nodes finish, meaning that there’s an extra wait time compared to asynchronous mode where master node updates parameters each time a worker node finishes training.
Another training features is communication frequency. When frequency of parameter updates is on batch, it means that once a worker node finishes training on 1 batch, it communicates with master node. On epoch means, the worker node only communicates when finishes training one epoch on entire batches. This affects the training time since ‘on batch’ method introduces additional synchronization overheads as it updates more frequently.