Srilekha Movva

DATA EXTRACTION FROM INVOICES

Project Objective : The main aim is to extract the structured data from unstructured format. Here I'm going to digitalize the data and store the data that is Extracting the information from invoices images. The dataset which I'm going to use is as follows:http://www.cs.cmu.edu/~aharley/rvl-cdip/. In this total data I'm going to extract only Invoice data.

GITHUB LINK: https://github.com/SrilekhaMovva/Capstone-project

About Dataset:

This collection contains high resolution images of scanned documents of different modules. For this project I'm going to load the labels folder from the dataset and collecting the image paths from the labels folder and reading the actual image and getting the data from that invoice images. In this data set there are many categories but I'm extracting only the invoice images by using category number.

Methodology:

Creating the dataset for the invoice data and then train the model to identify the text from the data by extracting from images.
Training using R-CNN model or YOLO model.
Identify the bounding boxes and extracting text from bounding boxes using google tesseract.
Find accuracy by using IOU(Intersection over Union).
Optical Character Recognition(OCR) for text recognition from the images.

Literature Survey:

Now-a-days in fast growing technology the users are maintaining the scanned invoices to check the expenses. From that scanned images we can extract the date, place, vendor, currency type and overall amount of transaction and also items purchased from the receipt. We are getting invoices for every purchase like cabs, hotels, hospitals etc. Invoice processing is basically the handling of invoice with an automated system that captures and scans invoice and extract data from invoices in a timely and efficient manner. For extracting data manually they are several drawbacks like higher costs, greater manpower, lot of time and a carbon footprint. For a small business organizations also by maintaining the invoices it can helps to analyze there sales and by that analysis they can implement new techniques and can increase there sales. By using Deep Learning and OCR we can automatically extract the tables and text from the images. By digitalizing the information there are many advantages and that data can also be stored in database and can be referred.

Phase:1

Phase:2

Previous Work:

In previous analysis they performed using CNN networks and also they extracted data from the images but not from the invoice data. we have an algorithm to extract the invoice data from pdf format using python. The following link explains the library of converting the data. https://pypi.org/project/invoice2data/
By using Machine Learning also they are extracting the data from images.
My method is different is as I'm annotating the images using LabelImg tool and next implementing with Object Detection Model and then extract the Invoice Numbers from those images.

Phase:3

Initially, I started researching about how to do my project and about how to build an object detection model for my project. I have known about multiple methods that I performed object detection on my data.

Initially, I thought of applying trained models using which I can perform object detection. Here, I used ssd_mobilenet_v1_coco.config model and performed object detection where initially, my images are trained for object detection. Since, I have used a model that has been pre-trained, and the model has directly identified the objects in the image by drawing bounding boxes around them. Here, I used labels from COCO data set and train our images.

Object Detection Model:

Installed all the required Libraries and created an environment for the model and By using tensor board Collab we can find the trends of the data.
By training the model we can get the values of Learning Rate, loss, global step and batch.

Challenges:

The main challenge is we need to annotate the images and also convert those images to XML format and then we need to convert all those XML images into CSV file with the values File name, Xmin, Xmax, Width.
Training the model for a long for getting better accuracy.
Instead of running the model for long time we can use other techniques also but by using this model I got all my images detected on average 90% of accuracy.

Solutions:

Instead of training the model for longer period we can use faster RCNN for detecting the images. we can find the accuracy by using other techniques also.

Conclusion:

By using this model we can detect the labels in the images. We can specify many labels at a time to detect those in images. We can get the accuracy of each label in the bounding boxes. I used to find the invoice number in the images. In future, We can extract all the data in the bounding boxes and also we can use multiple labels to detect in a single images. We can get the detection boxes, classes and masks. For my model I got on average 90% accuracy for detecting the bounding boxes in the images. All the test images will be converted into Numpy array and in output we can visualize the results of detection. We can find accuracy by using other methods like IOU. For every image I got the accuracy of detecting the label in the image so I did not go further.

References:

http://www.cs.cmu.edu/~aharley/icdar15/harley_convnet_icdar15.pdf
http://www.cvisiontech.com/library/document-automation/forms-processing/extract-data-from-invoices.html
https://nanonets.com/blog/invoice-ocr/
ttps://github.com/tzutalin/labelImg
https://medium.com/analytics-vidhya/training-an-object-detection-model-with-tensorflow-api-using-google-colab-4f9a688d5e8b
https://www.tensorflow.org/lite/models/object_detection/overview
https://towardsdatascience.com/object-detection-tensorflow-854c7eb65fa
https://medium.com/swlh/tensorflow-2-object-detection-api-with-google-colab-b2af171e81cc

Page updated

Report abuse