Font Size: a A A

Study On Preprocessing And Text Extraction Algorithms For Complex Form Documents

Posted on:2018-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:J PanFull Text:PDF
GTID:2348330512479379Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Document analysis and understanding has been of great concern because of its great significance in content analysis and recognition,content-based retrieval and other fields.Obtaining information automatically from the image document can greatly improve the efficiency of information processing and has important application value.The complex documents containing forms exist in all aspects of our life,and the automatic extraction and recognition of the text information of these documents has great application prospect.The contents of complex document image preprocessing,table detection and text extraction are deeply studied by taking the medical records and express images as the research objects in this paper.The main work is as follows:(1)Table area location and correction.The regional positioning algorithm,which is used to separate the table area from the original image to improve the accuracy and efficiency of the subsequent processing,based on the intersection of straight line intersection is given.Then the perspective transformation algorithm is used to realize the correction of the table area to overcome the influence of the image distortion on the image processing.The experimental results show that the method is an effective approach to locate and correct the table area from complex images.(2)Table detection.Firstly,the local adaptive binarization algorithm based on edge image is improved to make the binarization effect of the table image processed better.And then the Block Adjacency Graph(BAG)is improved to enhance the validity of the algorithm.The patch of line missing is supplement by analyzing the linear structure and the table features at last.Experiments show that the algorithm is efficient to carry though the table detection of table area.(3)Text extraction.After completing the table detection,the improved adjacency vector connection algorithm is used to fill the characters of the broken strokes,thus ensuring the integrity of the text information.Then text paragraph positioning processing is finished according to the table line.Finally the text line segmentation is achieved by analyzing the characteristics of connected components.Experiments show that this method could be used to accomplish the text extraction of table documents.In order to test the effectiveness of the algorithm,300 express documents and 40 medical records are experimented.The experimental results show the effectiveness of the proposed method.
Keywords/Search Tags:Table Recognition, Binarization, Skew Correction, Character Extraction, Text Segmentation
PDF Full Text Request
Related items