Abstract: Object detection has been one of the fundamental problems in computer vision which aims to detect and identify the location of objects of a particular class such as humans, cars, etc., in an image. It forms the basis of many vision tasks such as surveillance, image captioning, object tracking, etc. With the development of various deep learning models, much research attention has been devoted to object detection, leading to several significant improvements in terms of architecture and inference time. Most of the object detection research in the past few years has been on natural images with real-life objects. The goal of this work is to study object detection for a very different class of images, namely computer-generated scientific plots. Scientific plots such as bar plots, line plots, etc. provide an efficient way of visually representing the data where a table cannot adequately demonstrate the meaningful relationships or patterns between data points. Such plots are frequently found in textbooks, technical reports, academic papers, etc. Interpreting and understanding the underlying data encoded in these plots is considered a test of human aptitude. Hence, it is of interest to build systems which can understand and reason over scientific plots.
Scientific plots differ from natural images in three crucial ways. First, unlike natural images, they combine both text (e.g., axes, tick labels) and visual elements (e.g., bars, and legends). Second, they exhibit significant variation in scale and aspect ratio of objects (e.g., thin dot-lines and thick, long bars). Lastly, there are underlying structural relationships between objects (e.g., a tick label and corresponding bar in a bar-plot) which can be exploited for better understanding and reasoning. Further, localization accuracy is significantly more critical for plots than for natural images. This leads to the following interesting question, “Are existing object detection methods adequate for detecting text and visual elements in scientific plots which are arguably different from the objects found in natural images?” To answer this question, we train and compare the accuracy of nine state-of-the-art object detection networks on the PlotQA dataset with over 220,000 scientific plots. At the standard IOU setting of 0.5, most networks perform well with mAP scores higher than 80% in detecting the relatively simple objects in plots. However, the performance drops drastically when evaluated at a stricter IOU of 0.9 with the best model giving an mAP of 35.70%. Note that such a stricter evaluation is essential when dealing with scientific plots where even minor localization errors can lead to significant fallacies in downstream numerical inferences.
Given this poor performance, we propose minor modifications to existing models by combining ideas from different object detection networks. While this significantly improves the performance, there are still two main issues: (i) performance on textual objects which are essential for reasoning is abysmal, and (ii) inference time is unacceptably large considering the simplicity of plots. Based on these experiments and results, we identify the following considerations for improving object detection on plots: (a) lower inference time, (b) higher precision on textual objects, and (c) more accurate localization with a custom loss function with non-negligible loss values at high IOU (> 0.8). We propose a network, namely, PlotNet which meets all these considerations: It is 16x faster than the best performing competitor and significantly improves upon the accuracy of existing models with an mAP of 93.44% at an IOU of 0.9.
Bio:Mitesh M. Khapra is an Assistant Professor in the Department of Computer Science and Engineering at IIT Madras. While at IIT Madras he plans to pursue his interests in the areas of Deep Learning, Multimodal Multilingual Processing, Dialog systems and Question Answering. Prior to that he worked as a Researcher at IBM Research India. During the four and half years that he spent at IBM he worked on several interesting problems in the areas of Statistical Machine Translation, Cross Language Learning, Multimodal Learning, Argument Mining and Deep Learning. This work led to publications in top conferences in the areas of Computational Linguistics and Machine Learning. Prior to IBM, he completed his PhD and M.Tech from IIT Bombay in Jan 2012 and July 2008 respectively. His PhD thesis dealt with the important problem of reusing resources for multilingual computation. During his PhD he was a recipient of the IBM PhD Fellowship and the Microsoft Rising Star Award. He is also a recipient of the Google Faculty Research Award, 2018.