Tutorial Overview

in conjunction with the Int'l Conference on Image Analysis and Processing (ICIAP), Naples, Italy - www.iciap2013-naples.org 

Organizers
        Lamberto Ballan - University of Florence, Italy
        Lorenzo SeidenariUniversity of Florence, Italy

Abstract
Automatic image annotation is an important task, in which the goal is to determine the relevance of annotation terms for images. Several efforts have been made in recent years to design and develop effective and efficient algorithms for visual recognition and retrieval. To this end, a common and successful approach is to quantize local visual features (e.g. SIFT) following the well-known bag-of-visual-words paradigm. Then, a classifier (e.g. SVM) can be learned from a collection of images manually labeled as belonging to an object category or not. The goal of this tutorial is to get basic practical experience with image classification. The participants will be guided to implement a system in Matlab based on bag-of-visual-words image representation and will apply it to image classification. The emphasis of the tutorial will be on the important general concepts rather than in depth coverage of contemporary papers.

Intended audience and expected knowledge to be transferred
  • This is an introductory/intermediate tutorial on visual classification. The intended audience for this tutorial are PhD candidates in computer vision in their first/second year of course or experts of other computer science and pattern recognition fields that want to get an indepth knowledge of what is currently the standard architecture of state-of-the-art visual classification systems.
  • The attendees will get a full overview of a bag-of-visual words recognition pipeline: from the feature computation to the learning of the statistical model of visual concepts. The approach will be decomposed in several steps and each step will be inspected in detail. The tutorial attendee will get the tools to debug each step of a visual recognition system. 

Downloads (code, images and features)
Matlab Code: download code (note: download images into the 'img' directory)

Tutorial Outline (September 9, 2013 - Villa Doria D'Angri)

14:30 - 15:50 - Part I      Download slides (part 1)
  • Introduction (20 minutes - slides)
    • define the problem of image categorization;
    • introduce the basic idea bag-of-visual-words models [1,2];
    • main drawbacks and solutions: effective codebooks [4,5], feature coding and pooling, spatial pyramids [3].
Objectives/materials: In this first part of the tutorial we will introduce the basic ideas of BoW models for image categorization, some practical issues related to the implementation of an effective visual recognition system, and we will describe some algorithms and techniques to improve a standard BoW pipeline.
  • Session I - "Standard" BoW pipeline (60 minutes - practical session)
    • local feature sampling (detectors, multi-scale dense sampling);
    • creation of the codebook and feature quantization (i.e. k-means clustering and hard assignment);
    • statistical models: NN, linear SVM, kernelized SVM.
Objectives/materials: In this part of the tutorial we will assist the participants to implement a full BoW pipeline for image classification; we will provide the Matlab code and the data (for a subset of the Caltech-101 dataset) of each one of these steps.

16:20 - 18:30 - Part II     
 Download slides (part 2)     Download slides (part 3)
  • Session II - Advanced BoW models for visual recognition (110 minutes - practical session)
    • feature fusion, local, global, multiple cues (early and late approaches);
    • alternative codebooks: MoG (Fisher Vectors) [6];
    • feature quantization: modern reconstruction based approaches (LLC) [7];
    • improved spatial-pooling, max/average pooling;
    • applications to different domains: bag-of-X (e.g. action recognition in video) [4].
Objectives/materials: In this part of the tutorial we will show a sample of more recent techniques and extensions to the bag-of-words approach.
  • Discussion and Conclusions (20 minutes - slides)
    • implementation and practical details;
    • reference to other relevant works not covered by this tutorial;
    • open problems.
References
[1] G. Csurka, C. Dance, L. Fan, J. Willamowski and C. Bray, "Visual Categorization with Bags of Keypoints", ECCV SLVC Workshop, 2004
[2] J. Sivic and A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in Videos", ICCV, 2003
[3] S. Lazebnik, C. Schmid and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”, CVPR, 2006.
[4] L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari and G. Serra, "Effective Codebooks for Human Action Representation and Classification in Unconstrained Videos", IEEE TMM, 2012
[5] J. C. van Gemert, J. M. Geusebroek, C. J. Veenman and A. W. M. Smeulders, “Kernel codebooks for scene categorization”, ECCV 2008
[6] F. Perronnin, J. Sánchez and T. Mensink, “Improving the Fisher kernel for large-scale image classification”, ECCV, 2010
[7] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang and Y. Gong, “Locality-constrained linear coding for image classification”, CVPR, 2010