| In recent years,with the rapid development of deep learning,office automation and scientific research automation have become more and more popular.With the help of the express train of deep learning,we can make our life and office more efficient and intelligent.Tables are widely used because of their concise data and special presentation methods.They often appear in various documents to present the changes of data.Nowadays,with the increasing number of documents,it has become a very important problem to locate the table position and extract the table content and structure information from the document quickly and efficiently.In the past research,the table processing in PDF document is mainly based on PDF stream and image processing,which has some limitations.With the rapid development of deep learning,more and more researchers are committed to the sub problem of table recognition in document analysis,and many excellent deep learning algorithms have emerged to realize the task of table recognition.Table recognition is mainly divided into table location and table content extraction.After a lot of training,the deep learning method can have good results,and it is our further pursuit to improve the extraction accuracy of the deep learning method.The main contributions of this paper are as follows:(1)This thesis attempts to combine the unique structure information of PDF and the inherent structure of the paper document table,use the image algorithm to determine the table position,and then compare it with the deep learning layoutparser method;In the part of table content extraction,this paper first attempts to use the image algorithm combined with OCR recognition.In the experimental process,it is found that the image algorithm has some limitations,which is difficult to recognize complex tables,and the selection of threshold will directly affect the recognition results.(2)In the text detection stage,the DBNet method based on DOConv is adopted.The DBNet method combined with over parameter convolution DOConv is improved,and the Res Ne St network is used in the backbone to improve the detection accuracy and prediction speed.(3)Use the RARE model to convert the table structure into HTML format,reconstruct the table by parsing HTML,and output the results to excel combined with the character recognition part.Finally,based on the idea of scientific research automation,the content in the table is associated with the text content,and the target reference is found through regular matching and Levenshtein Distance. |