Number of serial tasks : N
Time for computing gradient in each task: C
Total runtime: N*C
Assume no communication overhead and K << b:
Total runtime: ≈ N * C/K
Linear speed-up in the number of machines K
Communication Overhead S per iteration:
Total runtime: ≈ N * ( C/K + S)
Speed-up: C / ( C/K + S )
Expect S to be larger with increasing number of machines included in the cluster
For epoch mode(orange line), there are two pieces of lines that have a downward trend in speedup. Notice that both cases have the same number of cores, for instance (1 worker&4 cores, 2 worker&2 cores) but they have different number of workers. We initially assume they have similar performance due to the same number of cores, but instead fewer number of workers actually leads to better performance. Our conjecture is that if we include more workers in our training process, these workers need to set up its own spark context and import necessary libraries, which may increase the communication overhead.
The highest training accuracy of our model achieves 70% while the test accuracy achieves 10%. The reason is that since we encounter memory issues that will be elaborated more on the later section, we have to drop a few layers, including dropout layers, in our network to reduce the total number of parameters in the model. Since the master node will record all the parameters and gradient updates, having too complicated models will take up too much memory space and cause the program to halt. Thus, our simplified model with fewer dropout layers cannot effectively prevent overfitting issues, which results in the discrepancy between training and validation datasets. Therefore it would be hard to compare the performance of the model that incorporates distributed learning and the one without.
Due to memory issue in spark cluster, we reduce the entire data down to 10 percent of its size as our training data. For all our experiments, we trained 5 epochs with batch size of 32 and our baseline model takes about 8 hours to train locally with one core. Here is the speed up plot for different infrastructure. We can observe that synchronization on batches has a really poor speed up result and synchronization by epochs performs much better. We think the primary reason is the communication overhead. Batch synchronization need to communicate with master node more frequently. Furthermore, as we choose synchronize training, for each batch all nodes have to wait for each other to finish before doing the synchronization.