Font Size: a A A

Research On The Construction Method Of Streaming Document Corpus Oriented To Structure Understanding

Posted on:2020-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:2438330572475908Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,re-flowable documents have been widely used in social,media,office,and publishing fields.In the face of massive re-flowable documents,how to make computers understand documents accurately is the basis of various researches in all walks of life.Document structure understanding have great application value and practical significance,which not only can lay the foundation for document check and optimization,automatic typesetting,structured information retrieval and other applications,also can contribute to high-level semantic research such as discourse structure analysis and article subject extraction.Due to the complexity of the document format,the differences in typographic style,etc.,computer understand the re-flowable document structure automatically is more difficult.At present,document structure understanding use many rule-based methods.But these method are poorly portable,and rules establishment is time-consuming and labor-intensive.The method based on machine learning can achieve better versatility and scalability.But because of the complexity of re-flowable document format and difficulty of annotation,these method is concerned with the problems of high cost of tagging data and scarcity of corpus.In view of the above questions,this paper focuses on the constructing theory and research methods of re-flowable document corpus for structural understanding,including the establishment of re-flowable document logical structure annotation scheme and annotation method,construction of re-flowable document logical structure annotation corpus and evaluation of the corpus.The main research work and contributions are listed as follows:(1)Aiming at the existing problems that lack of corpus and the complexity of structure annotation in the identification of the re-flowable documents structure by machine learning,this paper draws lessons from of natural language corpus construction,studies the theory and research method of re-flowable document corpus for document structure understanding,and completes the overall design of the corpus.This paper analyzes the requirements of document logical structure recognition research,clarifies the type,collection principle,labeling principle and storage structure of document logical structure labeling corpus,and proposes the overall constructionframework of the document logical structure annotation corpus.(2)Regarding the problems that existing document information description frameworks are not suitable for document structure understanding,and the document feature extraction is not comprehensive enough in current research,this paper proposed a re-flowable document logical structure annotation scheme based on its logical structure features and editing semantic features.To build this annotation scheme,first,we establish a document logical structure description architecture with more general and good extensibility based on DocBook,and then select 22 re-flowable document editing semantic features to form feature vectors,including document content features,style features,object features,after in-depth analysis of streaming document layout style and writing style,finally,we propose a formal model of document logical structure annotation.(3)The complexity of the re-flowable document annotation scheme brings a lot of manual workload.To solve this problem,we propose a three-stage document logical structure semi-automatic annotation method.This method includes a three-stage semi-automatic document logical structure annotation.In the first stage,document metadata is annotated separately from the document aided by the machine;in the second stage,the logical structure of the document is reconstructed automatically based on XSLT;finally in the third stage,the feature vectors are automatically produced by using Word Object Model.In addition,we designed a corpus annotation tool to assist with manual labeling.(4)We construct the annotated corpus of re-flowable document logical structure,including collect document corpus,put forward the annotation process and analyzed the corpus statistically.Furthermore,we evaluate the corpus from three aspects: the efficiency of the annotation,the validity of the corpus,and the size of the corpus.The experiment results show that 1)the semi-automatic annotation method can save labor cost and improve the accuracy of labeling results;2)the re-flowable document logical structure annotation scheme has great contribution to the accuracy and recall rate of the algorithm by extract more effective features;3)the size of re-flowable document structure annotation corpus constructed in this paper can meet the requirement for the document structure recognition model based on machine learning algorithm.
Keywords/Search Tags:corpus construction, features extraction, structure annotation, document structure recognition, machine learning
PDF Full Text Request
Related items