APEX-1M Dataset And APEX-Net Evaluation Results

Abstract

Automatic extraction of raw data from 2D line plot images is a problem of great importance having many real-world applications. Several algorithms have been proposed for solving this problem. However, these algorithms involve a significant amount of human intervention. To minimize this intervention, we propose APEX-Net, a deep learning based framework with novel loss functions for solving the plot extraction problem. We introduce APEX-1M, a new large scale dataset which contains both the plot images and the raw data. We demonstrate the performance of APEX-Net on the APEX-1M test set and show that it obtains impressive accuracy. We also show visual results of our network on unseen plot images and demonstrate that it extracts the shape of the plots to a great extent. Finally, we develop a GUI based software for plot extraction that can benefit the community at large. The dataset and code will be made publicly available.

APEX-Net Architecture

APEX-Net Results

APEX-Net Evaluation

Loss Functions

Following loss function (Plot Loss) is used to calculate the L2 error between the ground truth plot and its closest predicted plot. Here, K denotes the number of plots in the input image and K̂ denotes the maximum number of predicted plots.

Following loss function (Score Loss) is used to calculate the divergence between the actual probability of plots present in an image which is given by characteristic function of A and predicted probability of plots given by APEX-Net model. This loss forces model to predict the correct number of plots present in a given image. We have set the threshold for correct plot prediction to 0.5 at the time of testing and also we have assumed that the maximum number of plot present in an image should be less than K̂, for our work we fix the value of K̂ to 10.

To train APEX-Net model we add Plot Loss and Score Loss to get the Total Loss:

The intuition behind using these loss functions is as follows: To each of the K ground-truth plot, we assign the closest amongst the K̂ predicted plot. To facilitate the extraction of accurate raw plot data, we minimise the distance between the obtained closest pairs. Further, if a predicted plot gets assigned to a ground-truth plot, we would prefer its score to be close to 1 and 0 otherwise.

Qualitative Results

Plot Loss APEX-1M Test Dataset

Below, we have shown some qualitative results. The first image (top-left) is the input image to the model. The second image (top-right) depicts the plots predicted by the model (total error indicated in the title). Each of the remaining images depict the individual plots contained in the original input image, where the black plot indicates the ground truth and the coloured plot indicates the prediction. The error region between the ground truth and the prediction is indicated in light red colour. The individual error of each constituent plot is indicated in the title. Here, we have taken the cases where 5 plots where present in the image because chances of high error are more in these plots. The shown results vary from low error value to high error values. Assuming Ki is the number of plots present in the i^th image, the total plot prediction error for i^th image is given by (Ej indicating the constituent losses of each plot in the input image):

Performance of APEX-Net Model on APEX-1M Test Dataset

We have used L2 norm as our evaluation metric for how close the predicted plot to the ground truth plot and used mean error as an evaluation metric for plot count.

Eplot represents the mean of Eplot over the entire APEX-1M test set. For more clarification please refer to image carousel present in plot loss of qualitative result section. There we have shown the difference between the plot in red colour. Based on visual inspection we conclude that a total error value of 6.82 is indicative of a good performance by APEX-Net.

Here N is the number of test images present in our test set, which is 2 x 10⁵.

Ecount represents how well our model is performing on predicting the number of plots present in an image from APEX-1M test set.

Sample Images From APEX-1M Dataset

APEX-2.5K Dataset [Subset of APEX-1M]

Train [~370 Mb]: It contains 2000 train images (png format) along with ground truth in numpy format (npy).

Test [~90 Mb]: It contains 500 test images (png format) along with ground truth in numpy format (npy).

Note: To download entire dataset please contact the authors.

APEX-Net GUI Application

Code and Citation

Project Code [ Link ]

Project Paper [ Link ]

@misc{gangopadhyay2021apexnet,

title={APEX-Net: Automatic Plot Extractor Network},

author={Aalok Gangopadhyay and Prajwal Singh and Shanmuganathan Raman},

year={2021},

eprint={2101.06217},

archivePrefix={arXiv},

primaryClass={cs.CV}

}