Font Size: a A A

Template Recognition And Extraction Of Complex Table Document Images

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:J M YangFull Text:PDF
GTID:2428330572473665Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet information technology,more and more institutions have begun to build information systems to achieve paper-less processing of business processes,however,when it comes to collaborative business,due to the limitation of confidentiality and other factors,the inter-institutional information system is difficult to construct.Therefore,the collaborative business is still basically using paper table documents as a business carrier.After receiving business tables,institutions need to enter table information into their information system.The entry work has been carried out manually.However,due to the continuous increase in the number of business in recent years,manual entrycannot meet the timeliness requirements of the business,so the automatic entry of paper table documents becomes an important job.Automatic entry mainly includes text recognition and layout extraction.Text recognition technology has matured,so the focus is on the extraction of table layout,tables can be divided into framed and frameless tables based on layout.After the photocopying scan,the table image is obtained.The purpose of this paper is to extract the layout of the table from the table image.In order to extract the table layout,this thesis defines a table template to automatically identify the structure and content of the table by extracting the template.The template extraction of framed table images is divided into three steps,detecting table frame line,restoring table structure,extracting title fields,the extracted template can be used for classifying framed table images;The template extraction of frameless table images is also divided into three steps,extracting table text blocks,labeling training corpus,and training word segmentation model.The extracted template can be used to verify image recognition result of frameless tables and correct the text block division error.This thesis designs and implements a complex table document image template recognition and extraction system.First of all,the thesis expounds the research background and resear-ch significance,and gives the resear-ch content,main work and chapter arrangement of the thesis.Secondly,the related technologies of table recognition and image similarity analysis were investigated.Then,the system is analyzed and the overall design is carried out.According to the function division,the system is divided into the template extraction and management subsystem and the table recognition and classification subsystem.The system fr-amework diagrams of the two subsystems are given respectively,and module partitioning is performed on the two subsystems.Then the two subsystems are designed and implemented in detail.The algorithm for detecting the table frame line and restoring the table structure is improved.The progressive projection method and the alignment feature search method for restoring the table rows and columns using the spatial position information are proposed.Finally,the functional tests and effects of the two subsystems were demonstrated separately to verify that the overall system met the design principles and achieved the expected results.
Keywords/Search Tags:table recognition, template extraction, table frame line detection, perceptual hashing
PDF Full Text Request
Related items