Font Size: a A A

Metadata Extraction Based On Third-order Conditional Random Fields

Posted on:2014-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:H M YuFull Text:PDF
GTID:2268330392964512Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of technology of the digital library and open accessjournal papers, we can access digital academic resources more easily. The papers’metadata should be extracted more and more high-quality because of increasingacademic papers and improving retrieval requirements. Through metadata extractionof academic papers, we can achieve rational organization, access, and statisticalanalysis, and improve the efficiency of retrieve. Therefore, the accurate and automaticextraction of metadata becomes the current hot research issue in construction of digitallibrary. In this paper we propose an approach of metadata extraction from head ofpapers based on Third-order Conditional Random Fields and metadata extraction fromreferences of papers based on hybrid statistic model based on Third-order ConditionalRandom Fields and Support Vector Machines. We analyze and verify their metadataextraction performance in papers.Firstly, aiming at the problems that the existing metadata extraction method is notso accurate and do not make the best of contextual information, in this paper wepropose an approach of metadata extraction from papers based on Third-orderConditional Random Fields (CRFs) by extending the state transition of second-orderCRFs. Firstly, we segment headers of papers into blocks by separators, and extractfeatures from each blocks using features set including local features, layout featuresand lexicon features. Secondly, we introduce of a new smoothing technique andemploy the L-BFGS algorithm for parameter estimation of this model. Finally, weextract the metadata from papers efficiently and accurately using the improved Viterbialgorithm. Experimental results show that the proposed method outperforms otherexisting methods.Secondly, aiming at the characteristic of the references and defect of metadataextraction with single model, we propose a hybrid statistic model based on Third-orderConditional Random Fields and Support Vector Machines. Firstly, we segmentreferences of papers into blocks by separators, and extract features from each block. Secondly, we employ both Third-order Conditional Random Fields and Support VectorMachines to classify the blocks. We employ Sigmoid Function to adjust classificationresult. This method takes good advantage of merit and makes up the shortfall of thetwo models.Finally, we analyze and validate the proposed extraction method based onThird-order Conditional Random Fields and extraction method based on hybrid model,and compare with other existing method. At the same time, we prospect futureresearch.
Keywords/Search Tags:Metadata extraction, Conditional Random Fields, Support VectorMachines, hybrid model
PDF Full Text Request
Related items