*Note: FLOPS value is calculated as:
8 SP FLOPs/cycle×1.2 GHz×4 cores = 38.4 GFLOPS
**Note: FLOPS dependent on specific CPU model & microarchitecture
Ref: http://www.overclock.net/t/947312/how-many-gflops-does-your-processor-have
https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/
The parameters of AlexNet is 233 MB, and running the inference AlexNet requires more than 1 GB memory. While Raspberry only have about 500 MB free memory available when running Linux system. So it will cause memory error if running it directly on Pi. We solve this problem by adding 2 GB of SWAP memory for Raspberry Pi. However, these virtual memory comes from the SD card on Pi, of which the access and transition speed is pretty low. Therefore the graph computation speed of AlexNet can be greatly reduced.
In our distributed system, we run the main program on host machine, and run worker program on other devices in the cluster, in this case, the workers do not have access to network model ahead of time. So when creating and running the distributed network, both network parameters and feature map data are transmitted across different devices. And since AlexNet has a great number of parameters, the computation speed can be influenced.
(In our experiment, it takes about 70-80 s running AlexNet on a single Pi; while it take 600-900 s running the same network on our distributed cluster.)