Font Size: a A A

Design And Implementation Of Image Form Data Recognition System Based On OCR Technology

Posted on:2021-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhaoFull Text:PDF
GTID:2518306557489624Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present,there are still a large number of documents in the information resources.With the continuous development of information technology,it is an inevitable trend to use computers to digitize,store and manage a large number of documents.Form is the most common form of document.In practice,many fields,such as Banks,post offices,public inspection and legal institutions,will produce a large amount of form data,which requires a large amount of human resources to input the form data into the database system.How to use the computer to carry on the OCR(optical character recognition)to the form data becomes the important step of the office automation.The form recognition is different from the general document recognition,it has a variety of forms,and it is difficult to find a good general method to recognize any form,resulting in the table character recognition accuracy is far lower than the pure document character recognition accuracy.This thesis takes the file catalogue form produced by the public procuratorial institution as the research object,and realizes the electronic filing of the file by OCR technology.The main work of this thesis is as follows:(1)Aiming at the character segmentation module,this thesis adopts the method of connected domain analysis combined with dripping algorithm to divide the handwritten number string.Firstly,the string is divided into multiple blocks by connected domain analysis.Then,the adhesive numbers are segmented by dripping algorithm,and improves the traditional dripping algorithm to realize the effective cutting of sticky numbers in handwritten number string.(2)According to the characteristics of the form,this thesis adopts two different character recognition methods,tesseract-OCR and CNN(convolutional neural network)to identify the Chinese and handwritten digits in the form respectively,and optimizes the two methods respectively by means of character base training and network model improvement.Experimental comparison shows that the optimized recognition method has higher recognition rate.(3)Design and implementation of the form data recognition system.The system is divided into input module,information extraction module,information recognition module and output module.Input the image of the file catalog table obtained by scanning or taking photos,by image preprocessing,table line detection and cell extraction,the handwritten number string segment in the table,the data recognition and editing,and the recognition result confirmation.The system is tested in two aspects: function and performance.The test results show that the proposed method is correct and feasible,and the development of the system is of certain value.
Keywords/Search Tags:form recognition, Information extraction, Character segmentation, Tesseract-OCR, CNN
PDF Full Text Request
Related items