Font Size: a A A

Research On Detection Recognition And Semantic Extraction Of Forms Attached To Customs Documents

Posted on:2019-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z H WuFull Text:PDF
GTID:2428330545977788Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing prosperity of Chinese foreign trade,the customs authorities face many challenges in responding to customs declarations like quick response and control of tariff risks.Filing the attached documents as one of the important sources of data for customs inspection and risk control.At present,the Customs Department mainly adopts the method of manual access due to the low efficiency of manual pro-cessing of accompanying documents,and the insufficient degree of correlation of prod-uct information.It is necessary to develop an efficient automated identification and handling system for accompanying documents to reduce inspection costs and lay the foundation for automated customs clearance.Based on the customs project,which is undertaken by the Customs Department,this paper systematically studies the form identification and semantic extraction of the documents accompanying with customs declaration.As accompanying documents are mainly presented in the form of a variety of scanning form images,this paper consists of pre-processing,detection of the table area in the image,accompanying document table semantic analysis,and the Chinese character recognition as follows.1.In response to problems such as stamp disturbances and changes in the direction of scanning during the accompanying documents,preprocessing steps such as stamp elimination in attached documents,tilt correction based on straight line detection,and changing image sharpness were proposed.The quality of the doc-ument;2.Aiming at the variety of table styles and regions in the attached document im-ages,a table area detection method based on text line similarity matching was proposed.This is because no matter how the table is expressed,the table area must contain a number of product information,and each product information corresponds to a text line in the table.Inspired by this,this paper first determines the text line corresponding to the first product information based on the geomet-ric features extracted from the accompanying documents.Next,the line is used as a text line model to calculate the similarity with other text lines.Find out all similar text lines.Since the interval between adjacent text lines in the table area is relatively fixed,the text lines belonging to the table area are further filtered ac-cording to this geometrical characteristic,and the screened text lines constitute the table area to be analyzed in the accompanying documents.3.There are key inf'ormation such as the country of origin,buyer,seller,and trans-action method in the document accompanying the customs.These key informa-tion usually appear in the form of key-value pairs of "attribute" and "attribute value";however,these key-value pairs It is often irregularly distributed over the entire document image,and the structure of the " attributes" and " attribute val-ues" in the key-value pair is not necessarily regular,and even nested key-value pairs may appear.For the difficulty of semantic representation of accompanying document forms,this paper proposes an information extraction method based on interpretive description language.First,the " attribute" and " attribute val-ue" areas of these information are given through interaction with humans,and then described by language definition,compiler lexical and grammar rules are checked,and finally the parsing program parses the description language to ob-tain the correspondence between"attributes" and"attribute values".This paper designs the grammar rules according to the characteristics of the information in the document image,and uses the BNF paradigm to describe the grammar rules,and uses the syntax analysis program Yacc and the lexical analysis program Lex to build the compiler.4.For the problem that the open-source character recognition tool Tesseract OCR is not accurate in recognizing Chinese characters,this paper collected 50 fonts and generated 700,000 training data through affine transformation,background transformation,etc.,and then used a convolutional neural network model for training.Chinese characters are identified using a convolutional neural network model.Based on the above techniques,this paper also implemented a prototype system for identification and processing of documents attached to Customs.The experimental results on actual customs-attached document datasets show that the proposed method can effectively identify important information in accompanying document images.On the basis of combining with other customs declaration information,we can forecast and control China's customs tariff risks.
Keywords/Search Tags:Customs documents, Recognition, Tilt correction, Table area extraction, Convolutional neural network
PDF Full Text Request
Related items