Study On Preprocessing And Text Extraction Algorithms For Complex Form Documents

Posted on:2018-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:J Pan

Full Text:PDF

GTID:2348330512479379

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

Document analysis and understanding has been of great concern because of its great significance in content analysis and recognition,content-based retrieval and other fields.Obtaining information automatically from the image document can greatly improve the efficiency of information processing and has important application value.The complex documents containing forms exist in all aspects of our life,and the automatic extraction and recognition of the text information of these documents has great application prospect.The contents of complex document image preprocessing,table detection and text extraction are deeply studied by taking the medical records and express images as the research objects in this paper.The main work is as follows:(1)Table area location and correction.The regional positioning algorithm,which is used to separate the table area from the original image to improve the accuracy and efficiency of the subsequent processing,based on the intersection of straight line intersection is given.Then the perspective transformation algorithm is used to realize the correction of the table area to overcome the influence of the image distortion on the image processing.The experimental results show that the method is an effective approach to locate and correct the table area from complex images.(2)Table detection.Firstly,the local adaptive binarization algorithm based on edge image is improved to make the binarization effect of the table image processed better.And then the Block Adjacency Graph(BAG)is improved to enhance the validity of the algorithm.The patch of line missing is supplement by analyzing the linear structure and the table features at last.Experiments show that the algorithm is efficient to carry though the table detection of table area.(3)Text extraction.After completing the table detection,the improved adjacency vector connection algorithm is used to fill the characters of the broken strokes,thus ensuring the integrity of the text information.Then text paragraph positioning processing is finished according to the table line.Finally the text line segmentation is achieved by analyzing the characteristics of connected components.Experiments show that this method could be used to accomplish the text extraction of table documents.In order to test the effectiveness of the algorithm,300 express documents and 40 medical records are experimented.The experimental results show the effectiveness of the proposed method.

Keywords/Search Tags:

Table Recognition, Binarization, Skew Correction, Character Extraction, Text Segmentation

PDF Full Text Request

Related items

1	Video Text Extraction Technology Research And Application
2	Research On Pre-processing And Character Extraction Of Form Document Recognition
3	Chinese Forum Punctuation Extraction And Recognition,
4	Complex Layout Analysis And Digital Recogntion In Medical Record
5	The Study Of License Plate Image Segmentation And Recognition Algorithm
6	The Research And Achievement Of Character Segmentation In Vehicle Plate Recognition System
7	Research On Text Detection And Recognition In Complex Natural Scene Image
8	Research On The Algorithm And Realization Of Bill Character Recognition System
9	Research On Spray Code Character Recognition Technique Of Plate
10	Research On Embed Text Extraction From Still Images