Font Size: a A A

Research And Implementation On The Technique Of Citation Labeling Based On CRFs Model

Posted on:2014-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:J X ZhouFull Text:PDF
GTID:2298330467977942Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Citations of research papers play a very important role in the process of finding relationships among research papers and entity resolution for research papers in different data sources. Citation sequence labeling is an essential phase in citation entity resolution and other applications on citations. For this, scholars have proposed many methods and models. One kind of methods are based on rules. Resarchers observe the citation sequences and summarize rules which are comprehensive and do not contradict each other. These rules are usually written by regular expressions. Because of the irregularity and variety of the citation data, methods based on statistical learning models cost less manpower and have higher accuracy than others. In all the statistical learning models, the conditional random fields (CRFs) is the best one which is studied and used extensively because it integrates the advantages of generative models like hidden Markov model (HMM) and discriminative models like maximum entropy Markov model (MEMM), at the same time, it avoids their inherent deficiencies and defects.The purpose of this thesis is to study the processing and realizing solution of techniques of selecting granularity and feature selecting in citation sequence labeling based on CRFs. This thesis discusses the use of text features in citation sequence labeling based on CRFs, especially the use of punctuation. It is easy to find out that words between two punctuations in a citation sequence belong to the same semantic item, so they surely share the same label. Aiming at the mistakes probably happened in citation sequence labeling based on CRFs whose granularity is word, we propose the citaion sequence labeling based on CRFs whose granularity is token. We provide a method to select features systematically by inducing the formats of feature selecting based on CRFs whose granularity is token. The feature selecting can be divided into three kinds and we give the definition and selecting methods for each of them. We also realize the system of citation sequence labeling based on CRFs and give the details of the algorithms like forword-backword algorithm, Viterbi algorithm, divide symbol estimating algorithm, states maintenance algorithm, model training set generating algorithm, feature estimating algorithm and so on.Experiments which use the real citation data set from ACM show that in the application of citation sequence labeling, the method based on CRFs whose granularity is token performs better than that based on CRFs whose granularity is word and prove that the more kinds of features the model used, the better the labeling performance will be, which also verify that the systematic method of feature selecting that we provide is useful.
Keywords/Search Tags:CRFs, citation sequence labeling, statistical learning model, web data extraction, feature selecting
PDF Full Text Request
Related items