Summary
Object recognition plays an important role in robotics, since object/tools first have to be identified in the
scene before they can be manipulated/used. The performance of object recognition largely depends on the training dataset. Usually such training sets are gathered manually by a human operator, a tedious procedure, which ultimately limits the size of the dataset. One reason for manual selection of samples is that results returned by search engines often contain irrelevant images, mainly due to the problem of homographs (words spelled the same but with different meanings).
Two algorithms (Kulvicius et al., 2014; Schoeler et al., 2014a,b) for automated and unsupervised generation of “clean” image databases were developed which can cope with the problem of homographs, i.e., the words that are spelled the same but have different meanings. For example, the word “nut” could mean a hardware or a fruit. For disambiguation we make use of image searches (like Google), text searches and language translations.
In the first approach (called SIMSEA, Kulvicius et al., 2014) we use additional linguistic cues to demarcate our intended meaning of a word. Here, we combine this linguistic refinement with the image-search in the following way. We conduct several different image subsearches, where we pair the basic search term with an additional linguistic cue. For example, if we are interested in the category “nut”, we search for “bolt nut”, “metal nut”, “plastic nut”, etc., depending on the context we are interested in. The expectation is that images that are retrieved by more than one of these subsearches are more likely to be of interest, than those that are retrieved only once. In the second approach (called TRANSCLEAN, Schoeler et al., 2014a,b), in order to address the problem of homographs, we present a method for automatic (without human supervision) generation of task-relevant training sets for object recognition by using the information contained in a language-based command like “tighten the nut”. We ground our approach based on two facts: 1) homographs rarely occur for one word in multiple languages at the same time and 2) context information (action) provided by the command can be used in order to get rid of ambiguous and non-task relevant translations. We evaluated performance of our methods on image classification task (10/15 ambiguous classes) and obtained on average 17% and 20% (SIMSEA and TRANSCLEAN, respectively) improvement in object recognition as compared to standard Google search.
Publications
Video