Divide and Co-training

Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training

Shuai Zhao, Liguang Zhou, Wenxiao Wang, Deng Cai, Tin Lun Lam, Yangsheng Xu

Overview

The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. In this case, we argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width. To prove it, one large network is divided into several small ones regarding its parameters and regularization components. Each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to increase their diversity. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs. Small networks can also achieve faster inference speed than the large one by concurrent running on different devices. We validate our argument with 8 different neural architectures on common benchmarks through extensive experiments.

Results

CIFAR-100

CIFAR-10

ImageNet

Latency of networks

Bibtex

@misc{2020_SplitNet,

author = {Shuai Zhao and Liguang Zhou and Wenxiao Wang and Deng Cai and Tin Lun Lam and Yangsheng Xu},

title = {Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training},

howpublished = {arXiv},

year = {2020}

}

Acknowledgments

This paper was supported by funding 2019-INT007 from the Shenzhen Institute of Artificial Intelligence and Robotics for Society. This work was also supported in part by The National Nature Science Foundation of China (Grant Nos: 62036009, 61936006). Thanks Yuejin Li for his support in using GPU clusters.

Page updated

Google Sites

Report abuse