When you hear about searching, chances are the first image that pops into your mind is a web browser and a web search engine. From ancient Alta Vista to Google, querying capabilities using web crawlers and indexers have shaped the way we pursue and retrieve information: “If you don’t know something, you Google it.” As of 2017, this simple action comes with a caveat: there are billions of websites in the world wide web, totaling approximately half a zettabyte or 10^21 bytes = 1,000,000,000,000,000,000,000 bytes. If you are looking for popular items, such as the CD cover of your favorite band, you will often get to it in seconds. However, if you are looking for a specific scientific image, it may feel like looking for a needle in a haystack. This is why teams at BIDS and CRIC decided to tackle this issue and create a tool tailored to scientific datasets lying in databases not immediately obvious in the WWW.
pyCBIR is a new python tool for content-based image retrieval (CBIR) capable of searching for relevant items in large databases given unseen samples. While much work in CBIR has targeted ads and recommendation systems, our pyCBIR allows general purpose investigation across image domains and experiments. Also, pyCBIR contains different distance metrics and several feature extraction techniques, including a convolutional neural network (CNN).
Problem: 500 Exabyte Haystack
Image capture has turned into an ubiquitous activity in our daily lives, but mechanisms to organize and retrieve images based on their content are available only to a few people or for very specific problems. With significant improvements in image-processing speeds and the availability of large storage systems, developing methods to query and retrieve images is fundamental to simple human activities like cataloguing and conducting complex research, such as synthesizing materials. CBIR systems use computer vision techniques to describe images in terms of their properties in order to search similar samples given an image as the query instead of keywords. For this reason, the system works independently of annotations, which can be time consuming and impossible in some scenarios (e.g., with high-throughput imaging instruments).
Data Science to the Rescue
We proposed a CBIR tool using a python program language called pyCBIR. This tool is composed of six feature-extraction methods and 10 distances (see Figure 1). Searches occur based on a single image (or a set of images) as the query, and then pyCBIR retrieves and ranks the most similar images according to user-selected parameters.