Font Size: a A A

Design And Implementation Of Metadata Extraction Tool For Academic Paper Documents

Posted on:2018-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y C DengFull Text:PDF
GTID:2428330545961123Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popular application of computer technology in various fields,many enterprises and organizations have also begun to realize the significance of information management.In the process of achieving information management,the data of information management mainly in the form of electronic documents and there are many documents in the academic paper format.With the increasing number of such documents and the requirements on the high accuracy of document retrieval,classification and statistical,it is highly necessary to improve the quality of metadata extraction of paper document.In this paper,a mixed model based on BP neural network and support vector machine(SVM)is proposed to extract the metadata information of the relevant text content of this kind of Chinese paper document.Aiming at the problem that the accuracy of the existing metadata extraction method is not high and the adaptability is not strong,an extraction method of mixed model based on BP neural network and Support Vector Machine is proposed.The extraction of metadata from paper document is transformed into the classification of text block.By analyzing and comparing of the several kinds of usual classification methods,the feasibility of method based on BP neural network and support vector machine is obtained.For the text blocks to be classified and identified,the preprocessing is carried out by using the feature rules of the text.The summary metadata and the keyword metadata are extracted by the rule matching method.For the preprocessed text,in order to improve the accuracy of extraction model,the feature vector is constructed by combining the local features of the text and the characteristics of its context blocks.The feature vector of the input text block is classified and identified by using the BP neural network model,and thereby the corresponding metadata type is identified.For the text blocks with unit address metadata and author metadata,the preprocessing is taken by using the separators between blocks.The feature vector of the sub-text is constructed by combining the common names and place names information obtained from corpus.The metadata type of the text is obtained by using the support vector machine model.The metadata extraction tool based on the BP neural network and the support vector machine model is implemented through Java and libsvm library.The experimental result shows the better performance of this mixed model for the document metadata extraction in the academic papers.
Keywords/Search Tags:metadata extraction, feature vector, BP neural network, Support Vector Machine
PDF Full Text Request
Related items