Font Size: a A A

Research On Table Extracting For Document Image

Posted on:2021-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2518306119969779Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
As a very important structured page element,tables appear more and more frequently in documents due to intuitive and concise advantages of information expression.By extracting and parsing the table in the document image,people can more accurately grasp the information contained in document and the relationships between text fields in the table.Therefore,the algorithm research for table extraction is particularly important.Due to the type of document and the style of table are various and there are many interference items in the document that affect the recognition of table,so it is difficult for the table extraction algorithm to accurately extract the table structure in the document,and the robustness of table extraction algorithm to interference items is very poor.In view of the above problems,the table recognition algorithm is deeply researched from two aspects: the robustness to interference terms and improving the accuracy of the algorithm itself.The main related works are as follows:(1)Three strategies are proposed to solve the problem of interference in line extraction of table structure.Firstly,for the correction of skewed document image,skew angle detection and document image correction can be realized by combining the table structure line features and methods of the affine and perspective transform.Secondly,in order to eliminate the interference caused by the seals with the table extraction,a kind of seal removal algorithm based on the RGB color model is proposed,which removes the interference of red and blue seals in the document image by setting the gray value range of R?G?B channel component respectively and establishing the constraint relationship among channel components.Thirdly,in order to prevent the stain area from interfacing with table extraction,a stain removal algorithm based on the morphological processing is proposed,which uses the difference between the stain and effective structural features in the area to eliminate the interference of stains.(2)For the problem that some short line segments need to be spliced into the real table line.A kind of line splicing algorithm based on the probability of belonging to the same line is proposed,and at the same time the concept of probability of belonging to the same line(PBSL)is proposed for the fist time.First,the Hough Transform(Hough)is used to extract the line of table structure.Then,according to the value of PBSL and the reference value of distance between two line segments to determine whether they can be spliced.Finally,duplicate lines are deleted to obtain the the horizontal and vertical lines of table structure.(3)Aiming at table reconstruction,an interference line detection algorithm based on structural analysis is proposed,which classifies and detects interference lines to eliminate interference lines in table structure more comprehensively.According to the structural characteristics of the table,the patch of line missing is supplement to the greatest extent to obtain the complete table.By analyzing the experimental results on the dataset,it is found that the table extraction algorithm proposed in this paper improves the accuracy of horizontal and vertical line detection by 19.16% and 13.14% respectively compared with LSD,and24.50% and 20.17% respectively compared with FLD.The accuracy of table extraction for clear images in this dataset is 96.81%,and when extracting tables from images with distortion interference and seal interference,the accuracy is 92.94% and 87.58%.The experiment results show that the table extraction algorithm proposed in this paper has stronger detection ability for table lines in the document images,and has achieved significant improvement in the robustness to interference and the accuracy of table extraction.
Keywords/Search Tags:Skew Correction, Probability of Belonging to the Same Line(PBSL), Interference Line Detection, Table Recognition
PDF Full Text Request
Related items