6.870: Contextual priming for object detection

Recipies for computing 'gist' features

Papers evaluated
  • A. Torralba. Contextual priming for object detection. IJCV 2003.  (code)
  • D Hoiem, A. Efros, and M Hebert. Geometric context from a single image. ICCV 2005.(code)


It has been shown that 'gist' features (Torralba 03) provide a succint description of the scene structure. The feature representation comprises of statistics of oriented structures within the image. The paper also demonstrates that the 'gist' features can be used to predict the location, size and presence of the object in a scene.  This is useful to prime object detectors. In this evaluation, we consider an alternate representation of 'gist' descriptors. In Hoiem et al. 05, it is shown that the geometric structure of the scene such as surface orientations can be automatically extracted. Here we show that the spatial distribution of surface orientation and surface categories (sky,ground plane and vertical) also provides a practical description of the scene structure. We compare the two descriptions in its ability to predict pedestrian locations on street scenes.


We usedLABELMEto extract 760 color (RGB) images containing pedestrians (labels: person,pedestrian). We cropped the images to be 480x640 pixels.Out of these images, 250 images were used for training and 250 for testing.  For each training image, we extract the y-position of all the pedestrians within the image. 

Structural gist features

We used gist features as described in Torralba 03 (code) to derive the structural gist features. The 960 dimensional feature were reduced to 50 dimensional features using PCA.

Geometric gist features

We used geometric-context features of Hoiem et al.05 to derive the gist features. For each image, 8 feature maps are generated. They are as follows: (1)Ground plane confidence map, (2) Vertical plane confidence map, (3)Vertical (porous objects) confidence map, (4) Vertical (solid object) confidence map, (5) Vertical (left facing) confidence map, (6) Vertical (right facing) confidence map, (7) Vertical (front facing) confidence map , (8) Sky confidence map. We downsample the maps to size 12x16 resulting in a (16x12x8) dimensional feature. We reduced the dimensions to 50 using PCA. 

OriginalGround planeVertical surface
SkyVertical (front-facing)
Vertical (left-facing)
Vertical (right-facing)
Vertical (solid)
Vertical (porous)


We trained a mixture of regressors (with 5 regressors) using the gist features to predict y-location of the pedestrian. 

Both correct:


 Geometric context is correct:


Structural is correct:



Cluster centers

Here we visualize the regressor's cluster centers that are learned by both the models. The cluster centers indicate the canonical views on which the predictions of the mixture are based. 

Structural gist :  

Geometric gist:




We used 250 of the 500 images as test images. These images contained a total of 564 pedestrian instances. For each image, we used gist to generate a probability map of pedestrian location. We then threshold the map to retain top 5%,10%..100% of the entire image. For each threshold we measure the percentage of pedestrian instances whose center lies within the thresholded region.

Recall curve


 Thanks to Thomas Serre for providing the original inspiration for the alternate gist representation.