Metadata Extraction Based On Third-order Conditional Random Fields

Posted on:2014-09-10

Degree:Master

Type:Thesis

Country:China

Candidate:H M Yu

Full Text:PDF

GTID:2268330392964512

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of technology of the digital library and open accessjournal papers, we can access digital academic resources more easily. The papers’metadata should be extracted more and more high-quality because of increasingacademic papers and improving retrieval requirements. Through metadata extractionof academic papers, we can achieve rational organization, access, and statisticalanalysis, and improve the efficiency of retrieve. Therefore, the accurate and automaticextraction of metadata becomes the current hot research issue in construction of digitallibrary. In this paper we propose an approach of metadata extraction from head ofpapers based on Third-order Conditional Random Fields and metadata extraction fromreferences of papers based on hybrid statistic model based on Third-order ConditionalRandom Fields and Support Vector Machines. We analyze and verify their metadataextraction performance in papers.Firstly, aiming at the problems that the existing metadata extraction method is notso accurate and do not make the best of contextual information, in this paper wepropose an approach of metadata extraction from papers based on Third-orderConditional Random Fields (CRFs) by extending the state transition of second-orderCRFs. Firstly, we segment headers of papers into blocks by separators, and extractfeatures from each blocks using features set including local features, layout featuresand lexicon features. Secondly, we introduce of a new smoothing technique andemploy the L-BFGS algorithm for parameter estimation of this model. Finally, weextract the metadata from papers efficiently and accurately using the improved Viterbialgorithm. Experimental results show that the proposed method outperforms otherexisting methods.Secondly, aiming at the characteristic of the references and defect of metadataextraction with single model, we propose a hybrid statistic model based on Third-orderConditional Random Fields and Support Vector Machines. Firstly, we segmentreferences of papers into blocks by separators, and extract features from each block. Secondly, we employ both Third-order Conditional Random Fields and Support VectorMachines to classify the blocks. We employ Sigmoid Function to adjust classificationresult. This method takes good advantage of merit and makes up the shortfall of thetwo models.Finally, we analyze and validate the proposed extraction method based onThird-order Conditional Random Fields and extraction method based on hybrid model,and compare with other existing method. At the same time, we prospect futureresearch.

Keywords/Search Tags:

Metadata extraction, Conditional Random Fields, Support VectorMachines, hybrid model

PDF Full Text Request

Related items

1	Research Of The Automatic Metadata Extraction Based On The Conditional Random Fields
2	SAR Image Change Detection Based On Conditional Random Fields
3	Information Recognition And Extraction From Chinese Periodical Papers Based On Conditional Random Fields
4	An Self-adaptive BLP Optimal Model Employing Conditional Random Fields
5	Research On Morpheme Analysis Based On Conditional Random Fields In Chinese Natural Language Understanding
6	Research On Personnel Resume Intelligent Extraction System Based On Conditional Random Fields
7	A Study On Chinese Location Names Recognition Based On Conditional Random Fields
8	Research On Online Detection Method Of Reputation Fraud Campaign Based On Conditional Random Fields
9	SAR Image Change Detection Based On Spatially Nonstationary Analysis
10	Web Information Extraction Research Based On Conditonal Random Fields