Visual Saliency Based on Multiscale Deep Features

Guanbin Li       Yizhou Yu
The University of Hong Kong 


Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this CVPR 2015 paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning architecture, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for extracting features at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotation. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.


Guanbin Li and Yizhou Yu, Visual Saliency Based on Multiscale Deep Features, CVPR 2015. [PDF


Figure 1: Visual comparison of saliency maps generated from 10 different methods, including ours (MDF). The ground truth 
(GT) is shown in the last column. MDF consistently produces saliency maps closest to the ground truth. We compare MDF 
against spectral residual (SR[18]), frequency-tuned saliency (FT [1]), saliency filters (SF [29]), geodesic saliency (GS [35]), 
hierarchical saliency (HS [37]), regional based contrast (RC [8]), manifold ranking (MR [38]), optimized weighted contrast 
(wCtr. [40]) and discriminative regional feature integration (DRFI [22]).

Quantitative Comparison

Figure 2: Quantitative comparison of saliency maps generated from 10 different methods on 4 datasets. From left to right: (a) the MSRA-B dataset, (b) the SOD dataset, (c) the iCoSeg dataset, and (d) our own dataset. From top to bottom: (1st row) the precision-recall curves of different methods, (2nd row) the precision, recall and F-measure using an adaptive threshold, and (3rd row) the mean absolute error.


1. The HKU-IS dataset can be downloaded from Google Drive or Baidu YunPlease cite our paper if you use this dataset.
2. Saliency maps of our approach on 9 benchmark data sets can be download here. Including MSRA_B(test part), HKU-IS(test part), ICOSEG,  SED1, SED2, SOD, Pascal-s, Dut-Omron, ECSSD
3. Our trained model and test code is updated! Please download here