Deep GrabCut for Object Selection

Ning Xu, Brian Price, Scott Cohen, Jimei Yang, Thomas Huang

University of Illinois at Urbana-Champaign

Adobe Research



Most previous bounding-box-based segmentation methods assume the bounding box tightly covers the object of interest. However it is common that a rectangle input could be too large or too small. In this paper, we propose a novel segmentation approach that uses a rectangle as a soft constraint by transforming it into an Euclidean distance map. A convolutional encoder-decoder network is trained end-to-end by concatenating images with these distance maps as inputs and predicting the object masks as outputs. Our approach gets accurate segmentation results given sloppy rectangles while being general for both interactive segmentation and instance segmentation. We show our network extends to curve-based input without retraining. We further apply our network to instance-level semantic segmentation and resolve any overlap using a conditional random field. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approaches.


The problem of previous methods on rectangle inputs. The rectangle inputs are outlined in green while the segmentation results are outline in cyan. The first two are produced by interactive methods and the last two are produced by instance segmentation methods. Obviously they cannot get good results when the rectangles are too large or too small.


We propose a method of computing a segmentation from a rectangle, which is useful for both interactive segmentation and instance segmentation. In our approach, a rectangle is first transformed into a Euclidean distance map with the same size as the image input. Then the distance map is concatenated with the image along the channel dimension to construct an input pair to a CEDN model. The final prediction is the object mask.

We extend our segmentation method to convert detection boxes into a instance-level semantic segmentation. In doing so, we not only compute a segmentation for each detection box independently but also resolve any overlap between segments to generate a pixel labeling. (a) Input image and many detection results produced by arbitrary prior methods. (b) Non-max suppression on the detection results and use our segmentation model on each of them independently. (c) Perform dense Conditional Random Field to get per-pixel instance-level labeling. (d) Ground truth.


We first compare with interactive segmentation methods on the GrabCut dataset with varied rectangle sizes. The quantitative results are shown as below. Our method has consistently good performance when rectangles are either too large or too small.

Next we compare with instance segmentation methods on the SDS dataset. The results demonstrates that our method has the best segmentation results given the same detection rectangles.

Some visual results are shown as below.

The results of our pixel-level labeling are shown as below.

Video demos

Our method even generalizes well on arbitrary closed curves even though it is only trained on rectangles. This makes our model more flexible and provide a more natural user interface.



Deep GrabCut for Object Selection. [paper] [supplementary meterials] [models] [Figure 4 data]

Ning Xu, Brian Price, Scott Cohen, Jimei Yang, Thomas Huang

2017 British Machine Vision Conference (BMVC).