||Results of all 5 verticals using bounding boxes at test time.
||Distance metric learning with combo features
||Using ILSVRC-2012 training set (~1.3M images) to pre-training the DNN.
||Track 1 results on test data using the method described above.
||Distance metric learning with LLC features
||Results by the method described in the abstract.
||SIFT, RGB-SIFT, Opponent-SIFT, C-SIFT. Fisher Vector with 256 Gaussians, 8 regions. Logistic regression classifiers.
||Different configurations for dogs and cars data (single scale)
||IGBA v1.2, 4 cycles, 4h20min
* Indicates using features learned on outside data (e.g. ILSVRC2012)
Each number is the accuracy within a particular domain. Overall is the mean across domains.
||Track 2 results on test data without bounding box
||IGBA v1.2, 4 cycles, 4h20min
* Indicates using features learned on outside data (e.g. ILSVRC2012)
||Paul Kemp (CafeNet)
Ana Ramirez (CafeNet)
|This fine grained object recognition system is based on our own implementation of a deep convolutional neural network proposed by Krizhevsky et al. that won the ImageNet Classification Challenge in 2012. We have pre trained the network with the publicly available Imagenet 2012 data, and fine-tuned it with the Fine Grained Challenge training data that is provided. We pre train the lower levels of the network with a large collection of images from Imagenet to learn the most generic visual features at different levels. At the time of fine-tuning, we remove the 2 top trained layers, the classifier and the fully connected hidden layer, and replace those by a much smaller hidden layer and a classifier for the specific task. It is rather important to keep the hidden layer quite small to avoid overfitting for the datasets of this size.
||Kuiyuan Yang, Microsoft Research
Yalong Bai, Harbin Institute of Technology
Yong Rui, Microsoft Research
|Cognitive psychology inspired image classification using Deep Neural Network (DNN). Analogy to the learning process of human being, DNN firstly learns to classify the five basic-level categories (aircraft, bird, car, dog and shoe) then learns to classify the categories at the subordinate level for fine-grained object recognition. Based on the approach, promising results are achieved based on the relatively small training set (about 50K training images in total).
||Ning Zhang UC Berkeley
Ryan Farrell Brigham Young University
Forrest Iandola UC Berkeley
Jeff Donahue UC Berkeley
Yangqing Jia UC Berkeley
Ross Girshick UC Berkeley
Trevor Darrell UC Berkeley/ICSI
|Track 1: Fine-Grained Classification
Our fine-grained classification strategy is called Deformable Part Descriptors (DPD) . Specifically, we use Deformable Parts Models (DPMs) to estimate the pose and localize parts. We extract deep convolutional neural network features (DeCAF)  on the DPM parts and ground truth bounding box. Next, we pool the DeCAF part descriptors into a single feature vector using semantic weights across DPM components, as described in . Finally, we do fine-grained classification on the DPD feature vectors using linear SVMs.
Track 2: Fine-Grained Classification without bounding boxes
We use DPM detections instead of ground truth bounding boxes. Other than that, we perform the same DPD technique discussed above.
For both tracks, note that our deep convolutional neural network (DeCAF) is trained on ImageNet (ILSVRC2012).
 Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell. "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." arXiv 2013.
 Ning Zhang, Ryan Farrell, Forrest Iandola, and Trevor Darrell. "Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction". ICCV 2013.
|We simply use deep convolutioal network that is similiar to what has been described here. For domain 4, we use web data queried using the label name.
Krizhevsky, Alex, Ilya Sutskever, and Geoff Hinton. "Imagenet classification with deep convolutional neural networks." Advances in Neural Information Processing Systems 25. 2012.
||Philippe-Henri Gosselin (Inria, Ensea)
Naila Murray (Xerox)
Hervé Jégou (Inria)
Florent Perronnin (Xerox)
|For both tracks, we compute visual features based on dense SIFT and RGB descriptors, spatial coordinates coding, and Fisher Vectors. Then, "one versus all" SVM classifiers are run to predict the category of each image. The setup for these methods are very close to the ones presented in the papers that introduce them. For track1, we extract the box in label files (train and test images), and the resize the extracted region to 100k pixels. For the track2, we never consider the box in label files, and compute visual features on full images resized to 100k pixels.
||Maxime Pierson (InterfAIce)
Gaëlle Bachmann (InterfAIce)
Dan Grünstein (InterfAIce)
|Based on a new hypothesis on information structure, we have designed and implemented a new type of algorithm aiming at building a deep understanding of any environment. Foundations of the theory claim that the very essence of any environment revolves around a set of behaviors that implies a particular coherence and structure in the information emitted by the environment itself.
The algorithm is built around the goal to represent information through a structure that is constantly optimized. In order to fulfill this objective, the architecture of the algorithm is constituted of two main modules:
1) Using fragments of graph theory, the algorithm provides a “container” in which the information of an environment is projected and structured through smart nodes and links ;
2) Through complex adaptive systems, the “content” is assembled and exploited with recursive non-predefined mechanisms.
In this information graph building algorithm (IGBA), intelligent behaviors emerge from graph structures by establishing a frame to enable the constitution of any classification task, potentially extending concepts already proposed in technologies such as support vector machines or deep learning. The aim is to understand how an environment is structured given the information it projects and to study the consequences of actions made on that environment in regard to a set of goals.
The system is built on a graph structure that is continually optimized by non-predefined mechanisms which are themselves subjected to continuous improvements. The IGBA was tuned not to be restricted in the way it can learn and to choose which information to gather for achieving a specific goal.
To gain scientific credibility for the implemented technology, we have decided in August 2013 to adapt the IGBA for the first time to a practical application. Computer vision is one of the thorniest challenges due to the complexity of visual information. We have then decided to connect the IGBA for that task whereby the content of images represents the processed information.
Due to the limited resources available, we have concentrated our efforts on connecting the IGBA to the fine-grained classification challenge. Results presented are the first report of our technology.
Unlike previous candidates of the ILSVRC2012 and in a total rupture with the current state of the art, we did not pre-process images before treatment to test the way our algorithm could adapt itself to complete the tasks. The IGBA was tuned to choose how and where to look for to catch relevant information in an image.
To monitor complex behaviors of this new algorithm, we had to keep 20% of the training dataset for validation. Consequently, the IGBA was trained on 80% of the provided dataset on a standard machine (3.7 GHz, 8 Go RAM) over 4 hours and 20 minutes.
To the best of our knowledge, we propose here a totally new paradigm to process information and to build intelligence around it. Efforts are now made to improve the IGBA itself but also its first application’s performance in the field of visual recognition.
||Hideki Nakayama (The University of Tokyo)
Masaya Okamoto (The University of Tokyo)
Tomoya Tsuda (The University of Tokyo)
Daiki Miyatani (The University of Tokyo)
Kohei Yamamoto (The University of Tokyo)
|Our system is based on Fisher Vectors of multiple descriptors.
We densely extracted SIFT, RGB-SIFT, Opponent-SIFT, and C-SIFT descriptors from each image.
We used a dense grid spacing three pixels, and three different scales for feature extraction.
They were first compressed into 64 dimensions via PCA.
Then we computed Fisher Vectors using 256 Gaussians from each descriptor.
They were extracted from 3x1 and 2x2 regions as well as the entire image.
These eight vectors constitute a final image signature.
We fitted a logistic regression classifier independently for a Fisher Vector of each descriptor.
Final prediction is conducted through the late fusion of multiple classifiers.
|The method is based on "Symbiotic Segmentation and Part Localization for Fine-Grained Categorization" ICCV 2013. The following operations are applied to each domain independently, results from each domain are merged in the very end.
We train a domain-specific joint parts detection and foreground segmentation model using the training images and their bounding boxes only. The model is applied to all images, generating one foreground segmentation and a set of parts detection windows for each image. Fisher-encoded SIFT and color histogram are extracted from the foreground area and each of the detected parts. All features are concatenated together into the final high dimensional representation, which is fed into a linear SVM for classification. 5-fold bagging is used for track 1 in the linear SVM stage. Vertically mirrored training images are augmented to the original training set for all models (apart from the classification model for dogs).
||Qi Qian (Michigan State Univ.)
Shenghuo Zhu (NEC)
Rong Jin (Michigan State Univ.)
Xiaoyu Wang (NEC)
Yuanqing Lin (NEC)
|Features are dense HOG with LLC coding, and combining with those from an existing CNN model . A distance metric learning framework obtains a low dimensional embedding. Classifier is a smoothed k-NN.
 Decaf, arXiv:1310.1531