Font Size: a A A

Researches On Unsupervised Image Processing Of Vat Invoices

Posted on:2016-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z G XieFull Text:PDF
GTID:2308330476952914Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese domestic VAT invoice is an important accounting and billing document and is a corporate tax certificate, and it is widely present in dealings among enterprises. The format of it is under strict control of State Administration of Taxation. Financial Sharing Center of big or medium-sized enterprises need to handle a large number of VAT invoices every day, but these invoices are often handled manually in poor effciency. They need automation of unsupervised processing systems for VAT invoices to reduce costs and also to promote their financial management capability. There are some projects of this kinds have been built or have been bring forth to build. The undergoing of enterprise internal ERP plans is providing a good infrastructure for it, and also, the developing of image processing technologies such as OCR and others are coming into a state of commercial feasibility for it, with some extra efforts we can turn the VAT invoice image processing automation into reality.This thesis focuses on the technical aspect of automation realization for unsupervised VAT invoice image processing. New techniques are introduced and new improvements are made to some existing ones. A successful hot running automation case of an enterprise is demonstrated, and the architecture of how these techniques applied together is elaborated in this thesis.The automated invoice processing system discussed by this thesis obtains high-quality colored images of invoices along with their attachments in batches from high-speed scanners. After the segmentation of multiple colored-components, thresholding algorithms will be applied into each component and also including the original image to get binary images. Tabular and grids structures are retrieved from binary images and then input these structures to determine the invoice types. Invoice types help to specify the right strategy how to crop out sub-images out from binary images after correct deskew and send them to OCR engines.OCR engines are responsible for output the result of printed text recognition. Multiple heterogeneous OCR engines are needed to feed in different copy of recognition results for one copy of text image. It verifies the correctness of results by examine the consistency of different copies, or, by examine the business data from ERP systems. If the result of an invoice is identified as correct, the result will be exported into ERP systems to trigger out business flows. Otherwise, it notifies staffs the warning of incorrectness and waiting for manual editing and confirmation.Firstly, some basics of image processing is introduced. For printed character recognition, multiple heterogeneous OCR engines are proposed as a mix solution of character recognition. In this kind of mix solution also integrates some business data imported from other systems of ERP. This solution is capable of telling recognition fault out from correct results.Next, a comparative experimental study of thresholding methods applying for invoice images is conducted. In the study, those of clustering methods and also those of local adaptive methods are compared. As the result of it, Sauvola method is improved by being aligned with Kittler method to avoid black background, and this method is recommended in case of invoice image thresholding applications.According to the color characteristics of invoices, color registration and segmentation method is developed. And then, the detection method of grid line segments based on “directed simple component” is improved. Based on this improved detection method, the tabular grids of invoices are retrieved. It brings important structural information forward to invoice type aliasing, sophisticated image processing, and so on.In order to assign the correct alias of invoice type for images with grids retrieved, the thesis attempts two different approaches, one is modeled from proximity of blob distances, and the other is modeled from similarity of direct graph. Scott algorithm and Blondel algorithm are both confirmed have good capability in case of invoice images. Method of filtering layout points is proposed. Similarity estimation of grids is developed, and it provides the mechanism of identifying image’s type aliases.In addition, in order to deskew images accurately before OCR, this thesis proposes two methods for document images, one is presented from aspect of grid structure, and the other is presented from text paragraph.Finally, a variety of software development techniques and image processing methods be applied and be integrated into a project, and it demonstrates a well running system of the practice.
Keywords/Search Tags:Document imageprocessing, Tabular structure matching, Image recognizing
PDF Full Text Request
Related items