Font Size: a A A

Research On Table Recognition Technology Of Printed Documents

Posted on:2019-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2428330545457427Subject:Control engineering
Abstract/Summary:PDF Full Text Request
The recognition of printed document images such as books and periodicals is an important branch of pattern recognition.The character recognition in the document image is relatively mature.There are already many commercial products on the market,but the recognition of the table is not enough for practical applications.The robustness of the commercial products for table line distortion is poor.Therefore,it is a meaningful work to study a robust table recognition method.Based on the analysis of the status of domestic and foreign document table recognition and product,this thesis focuses on the document image preprocessing,the recognition of closed square tables,and the recognition technology of printed Chinese characters.The main tasks are as follows.The preprocessing technology of document image is studied.The binarization of the document image by using the Sauvola algorithm is carried out.Due to the existence of table lines in the document image,the Hough transform method is used to correct the inclination of the document image.The bottom-up layout analysis method divides the layout of the document into areas such as text,tables,and drawings.In this thesis,a directed single connected chain method is used to extract the table lines.Based on the traditional methods,the determination of the starting and end points of the connected chain is revised,and the accuracy of the connected chain extraction is improved.By using the least square method,a single connected chain fitting is abstracted into a table line segment.Based on the extracted table segments,the feature point set of the table is obtained according to the relationship between the horizontal and vertical line segment.According to the connection between the feature points and the coordinate relationship,the table cell set of the table is obtained,and the Latex typesetting system table is analyzed.The correct description of the table information also requires recognition of the Chinese characters in each table element.In this paper,text segmentation is performed after the expansion of the text,and then the text segmentation is performed by vertical projection.The convolutional neural network is designed to train the printed Chinese character data set,and the classifier of the Chinese character is obtained.Based on QT and OpenCV,a software system for the recognition of printed document tables is designed.The experimental verification shows that the system can identify high quality tables very well.The recognition rate of distorted and fuzzy low-quality tables reaches 74%,which is higher than some of the existing OCR software.
Keywords/Search Tags:Table recognition, Layout analysis, Directed Single Connected Chain, Chinese character recognition
PDF Full Text Request
Related items