Track 1

Team Additional Details Aircraft Birds Cars Dogs Shoes Overall
Inria-Xerox 81.4581 71.6931 87.7876 52.9 91.517 77.0712
CafeNet* Results of all 5 verticals using bounding boxes at test time. 78.8479 73.0085 79.5797 57.5333 90.1198 75.8178
Inria-Xerox 75.8776 66.285 84.7034 50.4167 88.6228 73.1811
VisionMetric* Distance metric learning with combo features 75.4875 63.8977 74.3316 55.8667 89.022 71.7211
Symbiotic 75.8476 69.0621 81.0347 44.8917 87.3253 71.6323
Inria-Xerox 80.5881 58.5384 84.6661 35.6167 90.9182 70.0655
CognitiveVision* Using ILSVRC-2012 training set (~1.3M images) to pre-training the DNN. 67.4167 72.7893 64.395 60.5583 84.8303 69.9979
DPD_Berkeley* Track 1 results on test data using the method described above. 68.4668 69.5737 67.4046 50.8417 89.521 69.1615
VisionMetric Distance metric learning with LLC features 73.9274 51.352 69.3073 38.6333 87.3253 64.1091
CognitiveVision Results by the method described in the abstract. 58.8059 51.6931 52.3691 47.3667 78.1437 57.6757
MPG SIFT, RGB-SIFT, Opponent-SIFT, C-SIFT. Fisher Vector with 256 Gaussians, 8 regions. Logistic regression classifiers. 9.45095 54.5676 69.27 42.9167 88.4232 52.9257
MPG Different configurations for dogs and cars data (single scale) 9.45095 56.4677 63.7732 0.975 88.4232 43.818
Infor_FG* DCNN 30.393 9.06212 4.45218 0.816667 35.2295 15.9907
InterfAIce IGBA v1.2, 4 cycles, 4h20min 5.79058 2.55786 1.11926 6.95833 5.98802 4.48281
* Indicates using features learned on  outside data (e.g. ILSVRC2012)

Each number is the accuracy within a particular domain. Overall is the mean across domains.

Track 2

Team Additional Details Aircraft Birds Cars Dogs Shoes Overall
Inria-Xerox 80.7381 49.8173 82.7136 45.7083 88.1238 69.4202
Symbiotic 72.4872 46.0171 77.9878 37.1417 89.1218 64.5511
Inria-Xerox 66.3966 44.5067 76.3462 43.9583 86.3273 63.507
Inria-Xerox 80.7381 34.4458 76.8934 24.4 87.3253 60.7605
DPD_Berkeley* Track 2 results on test data without bounding box 45.5146 42.704 43.3777 41.9083 59.98 46.6969
Infor_FG* 9.66097 5.74909 3.70601 32.7083 4.69062 11.303
InterfAIce IGBA v1.2, 4 cycles, 4h20min 5.43054 2.58222 1.16901 6.94167 5.28942 4.28257
* Indicates using features learned on outside data (e.g. ILSVRC2012)


Team Name Members Abstract
CafeNet Paul Kemp (CafeNet)
Ana Ramirez (CafeNet)
This fine grained object recognition system is based on our own implementation of a deep convolutional neural network proposed by Krizhevsky et al. that won the ImageNet Classification Challenge in 2012. We have pre trained the network with the publicly available Imagenet 2012 data, and fine-tuned it with the Fine Grained Challenge training data that is provided. We pre train the lower levels of the network with a large collection of images from Imagenet to learn the most generic visual features at different levels. At the time of fine-tuning, we remove the 2 top trained layers, the classifier and the fully connected hidden layer, and replace those by a much smaller hidden layer and a classifier for the specific task. It is rather important to keep the hidden layer quite small to avoid overfitting for the datasets of this size.
CognitiveVision Kuiyuan Yang, Microsoft Research
Yalong Bai, Harbin Institute of Technology
Yong Rui, Microsoft Research
Cognitive psychology inspired image classification using Deep Neural Network (DNN). Analogy to the learning process of human being, DNN firstly learns to classify the five basic-level categories (aircraft, bird, car, dog and shoe) then learns to classify the categories at the subordinate level for fine-grained object recognition. Based on the approach, promising results are achieved based on the relatively small training set (about 50K training images in total).
DPD_Berkeley Ning Zhang UC Berkeley
Ryan Farrell Brigham Young University
Forrest Iandola UC Berkeley
Jeff Donahue UC Berkeley
Yangqing Jia UC Berkeley
Ross Girshick UC Berkeley
Trevor Darrell UC Berkeley/ICSI
Track 1: Fine-Grained Classification Our fine-grained classification strategy is called Deformable Part Descriptors (DPD) [2]. Specifically, we use Deformable Parts Models (DPMs) to estimate the pose and localize parts. We extract deep convolutional neural network features (DeCAF) [1] on the DPM parts and ground truth bounding box. Next, we pool the DeCAF part descriptors into a single feature vector using semantic weights across DPM components, as described in [2]. Finally, we do fine-grained classification on the DPD feature vectors using linear SVMs.

Track 2: Fine-Grained Classification without bounding boxes We use DPM detections instead of ground truth bounding boxes. Other than that, we perform the same DPD technique discussed above.

For both tracks, note that our deep convolutional neural network (DeCAF) is trained on ImageNet (ILSVRC2012).

[1] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell. "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." arXiv 2013.
[2] Ning Zhang, Ryan Farrell, Forrest Iandola, and Trevor Darrell. "Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction". ICCV 2013.
Infor_FG Zhenzhong Lan
Zexi Mao
ChenQiang Gao
We simply use deep convolutioal network that is similiar to what has been described here[1]. For domain 4, we use web data queried using the label name.

[1]Krizhevsky, Alex, Ilya Sutskever, and Geoff Hinton. "Imagenet classification with deep convolutional neural networks." Advances in Neural Information Processing Systems 25. 2012.
Inria-Xerox Philippe-Henri Gosselin (Inria, Ensea)
Naila Murray (Xerox)
Hervé Jégou (Inria)
Florent Perronnin (Xerox)
For both tracks, we compute visual features based on dense SIFT and RGB descriptors, spatial coordinates coding, and Fisher Vectors. Then, "one versus all" SVM classifiers are run to predict the category of each image. The setup for these methods are very close to the ones presented in the papers that introduce them. For track1, we extract the box in label files (train and test images), and the resize the extracted region to 100k pixels. For the track2, we never consider the box in label files, and compute visual features on full images resized to 100k pixels.
InterfAIce Maxime Pierson (InterfAIce)
Gaëlle Bachmann (InterfAIce)
Dan Grünstein (InterfAIce)
Based on a new hypothesis on information structure, we have designed and implemented a new type of algorithm aiming at building a deep understanding of any environment. Foundations of the theory claim that the very essence of any environment revolves around a set of behaviors that implies a particular coherence and structure in the information emitted by the environment itself.

The algorithm is built around the goal to represent information through a structure that is constantly optimized. In order to fulfill this objective, the architecture of the algorithm is constituted of two main modules:

1) Using fragments of graph theory, the algorithm provides a “container” in which the information of an environment is projected and structured through smart nodes and links ;

2) Through complex adaptive systems, the “content” is assembled and exploited with recursive non-predefined mechanisms.

In this information graph building algorithm (IGBA), intelligent behaviors emerge from graph structures by establishing a frame to enable the constitution of any classification task, potentially extending concepts already proposed in technologies such as support vector machines or deep learning. The aim is to understand how an environment is structured given the information it projects and to study the consequences of actions made on that environment in regard to a set of goals.

The system is built on a graph structure that is continually optimized by non-predefined mechanisms which are themselves subjected to continuous improvements. The IGBA was tuned not to be restricted in the way it can learn and to choose which information to gather for achieving a specific goal.

To gain scientific credibility for the implemented technology, we have decided in August 2013 to adapt the IGBA for the first time to a practical application. Computer vision is one of the thorniest challenges due to the complexity of visual information. We have then decided to connect the IGBA for that task whereby the content of images represents the processed information.

Due to the limited resources available, we have concentrated our efforts on connecting the IGBA to the fine-grained classification challenge. Results presented are the first report of our technology.

Unlike previous candidates of the ILSVRC2012 and in a total rupture with the current state of the art, we did not pre-process images before treatment to test the way our algorithm could adapt itself to complete the tasks. The IGBA was tuned to choose how and where to look for to catch relevant information in an image.

To monitor complex behaviors of this new algorithm, we had to keep 20% of the training dataset for validation. Consequently, the IGBA was trained on 80% of the provided dataset on a standard machine (3.7 GHz, 8 Go RAM) over 4 hours and 20 minutes.

To the best of our knowledge, we propose here a totally new paradigm to process information and to build intelligence around it. Efforts are now made to improve the IGBA itself but also its first application’s performance in the field of visual recognition.
MPG Hideki Nakayama (The University of Tokyo)
Masaya Okamoto (The University of Tokyo)
Tomoya Tsuda (The University of Tokyo)
Daiki Miyatani (The University of Tokyo)
Kohei Yamamoto (The University of Tokyo)
Our system is based on Fisher Vectors of multiple descriptors. We densely extracted SIFT, RGB-SIFT, Opponent-SIFT, and C-SIFT descriptors from each image. We used a dense grid spacing three pixels, and three different scales for feature extraction. They were first compressed into 64 dimensions via PCA. Then we computed Fisher Vectors using 256 Gaussians from each descriptor. They were extracted from 3x1 and 2x2 regions as well as the entire image. These eight vectors constitute a final image signature.

We fitted a logistic regression classifier independently for a Fisher Vector of each descriptor. Final prediction is conducted through the late fusion of multiple classifiers.
Symbiotic Yuning Chai
Victor Lempitsky
Andrew Zisserman
The method is based on "Symbiotic Segmentation and Part Localization for Fine-Grained Categorization" ICCV 2013. The following operations are applied to each domain independently, results from each domain are merged in the very end.

We train a domain-specific joint parts detection and foreground segmentation model using the training images and their bounding boxes only. The model is applied to all images, generating one foreground segmentation and a set of parts detection windows for each image. Fisher-encoded SIFT and color histogram are extracted from the foreground area and each of the detected parts. All features are concatenated together into the final high dimensional representation, which is fed into a linear SVM for classification. 5-fold bagging is used for track 1 in the linear SVM stage. Vertically mirrored training images are augmented to the original training set for all models (apart from the classification model for dogs).
VisionMetric Qi Qian (Michigan State Univ.)
Shenghuo Zhu (NEC)
Rong Jin (Michigan State Univ.)
Xiaoyu Wang (NEC)
Yuanqing Lin (NEC)
Features are dense HOG with LLC coding, and combining with those from an existing CNN model [1]. A distance metric learning framework obtains a low dimensional embedding. Classifier is a smoothed k-NN.

[1] Decaf, arXiv:1310.1531