| PDF is currently one of the most widely file format as document carriers.Unlike word processing file formats such as doc and pages,PDF format is based on image printing,whose main goal is accurate content output from printers under cross-platform.So it lacks the typographic description of chapters,paragraphs,and other content in word processing file formats.The premise of content analysis and mining of PDF documents is to effectively identify these typesetting formats and extract the document content in these formats.Current researches on processing PDF mainly contain the rule-based method,the machine-learning method,and the deep learning methods.But they are all single-mode methods,which are not conducive to processing PDF documents with different layouts accurately.For addressing the low accuracy of format recognition in digital PDF documents,this thesis designs multimodal algorithms based on characters,layouts and images to recognize page layout and extract fine-grained document content.For addressing the incomplete content extraction of scanned PDF documents,this thesis designs processing algorithms based on image recognition combined with OCR technology to extract three modal contents of documents: text,images,and tables.The specific research contents are as follows.(1)For dealing with the problem of low accuracy of format recognition in digital PDF documents,this thesis carries out research on multimodal PDF format analysis and content extraction algorithm based on three types of features: character,layout and image.Firstly,the text information in PDF documents is obtained as character features,the text location coordinates and other information are obtained as layout features,and the PDF page images are obtained as image features.Secondly,in order to process the feature information of the three modalities,the end-to-end Multi M neural network module is designed to generate character embedding,layout embedding and image embedding.And the three types of embedding are fused into multimodal embedding for text classification tasks.The post-processing module is designed to integrate the various types of text as well as graphical content.Finally,experiments are conducted to illustrate the improvement of text recognition accuracy of this algorithm using multimodal features.(2)For dealing with the problem of low accuracy of content extraction from scanned PDF documents,this thesis carries out the research of format analysis and content extraction algorithm based on image recognition and OCR.Firstly,the overall image of PDF page is obtained as to carry out image recognition.Secondly,in order to obtain the location of text,table and image regions in the page image,PEOD neural network module is designed to detect and cut the three types of targets.The character recognition module is designed to obtain characters in text and table regions.The post-processing module is designed to organize each part in order and structure to obtain multimodal contents of text,table and image.Finally,the impact of using both image recognition and character recognition on the extraction accuracy is illustrated experimentally also.Through the above two methods to process digital and scanned PDF documents respectively,we can accurately analyze PDF documents and extract the multimodal content of text,table and image. |