Untitled‎ > ‎


HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition
Zhicheng Yan+, Hao Zhang*, Robinson Piramuthu^, Vignesh Jagadeesh^, 
Dennis DeCoste^, Wei Di^, Yizhou Yuo
University of Illinois at Urbana-Champaign+
Carnegie Mellon University*
eBay Research Lab^
The University of Hong Kongo

Figure: Hierarchical Deep Convolutional Neural Network (HD-CNN) architecture.

Figure: A two-level category hierarchy. The categories are taken from ImageNet dataset.

In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories. In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HD-CNN training, component-wise pretraining is followed by global finetuning with a multinomial logistic loss regularized by a coarse category consistency term. In addition, conditional executions of fine category classifiers  and layer parameter compression make HD-CNNs scalable for large-scale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different HD-CNNs and they lower the top-1 error of the standard CNNs by 2.65%, 3.1% and 1.1%, respectively. 
 Method    Error
 Model averaging (2  CIFAR100-NIN nets) 35.13
 DSN 34.68
 CIFAR100-NIN-double 34.26
 dasNet 33.78
 Base: CIFAR100-NIN 35.27
 HD-CNN, no finetuning 33.33
 HD-CNN, finetuning w/o CCC 33.21     
 HD-CNN, finetuning w/ CCC  32.62
Table: 10-view testing errors on CIFAR100 dataset. Notation CCC=coarse category consistency.

 Method top-1, top-5
 Base:ImageNet-NIN 39.76, 17.71
 Model averaging (3 base nets)  38.54, 17.11
 HD-CNN, disjoint CC  38.44, 17.03
 HD-CNN 36.66, 15.80
Table : Comparisons of 10-view testing errors between ImageNet-NIN and HD-CNN. Notation CC=Coarse category.

 Method top-1,top-5
 GoogLeNet,multi-crop N/A,7.9
 VGG-19-layer, dense 24.8,7.5
 VGG-16-layer+VGG-19-layer,dense 24.0,7.1
 Base:ImageNet-VGG-16-layer,dense 24.79,7.50
 HD-CNN,dense 23.69,6.76
Table: Errors on ImageNet validation set. HD-CNN uses building block net VGG-16-layer. 
Subpages (1): HD-CNN Implementation