Font Size: a A A

Researeh On Attribute Relation Extraction From Chinese Online Encyclopedia

Posted on:2015-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y F YangFull Text:PDF
GTID:2268330428478925Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Artificial intelligence technology has been widely applied to human life in recent years, but its application is closely related to the large-scale knowledge base. Attribute relation is an important part of the knowledge base. As an entity relation, it reflects the relation categories of entities, and includes concept instance, trigger word of attribute relation and attribute value.Chinese online encyclopedia conains numerous entry names, attributes, trigger words and attribute values, so it provides rich data source for extracting attribute relation. At present, there are two main problems when we extract attribute relation from Chinese online encyclopedia: the first one is that obtained attributes by existing methods are not conducive to attribute relation mapped to the encyclopedia knowledge base, and most of attributes lack effective trigger words of attribute relation; the second problem is that training corpus based on Chinese online encyclopedia is very scarce, and extraction of attribute relation seriously rely on named entity recognition.In order to solve these problems, this thesis aims to build an effective and large-scale attribute relation knowledge base, and proposes corresponding solutions. The main contents are as follows.First, this thesis researches an approach of extracting attribute names. It makes full use of structured or semi-structured information template’s feature to extract candidate attribute names, and selects target attribute names.Second, based on meta-model, this thesis researches an algorithm for generating trigger words set. By selecting seed words from meta-model, and expending the seed words according to the external dictionary, we research a credibility evaluation method. Based on the expanded seed words, we can extract candidate trigger words repeatedly, and build the trigger words set.Third, this thesis researches a method of training corpus automatic acquisition based on weak supervision. By using entry name and information template to establish attribute relation triples, we use relation triples to back mark text clauses for building training data, and then research an algorithm to optimize training corpus based on naive bayes classifiers and trigger words. By marking attribute values with predefined symbols during back marking and classification process, this method could break the limit of named entities.Finally, this thesis uses conditional random field (CRF) toolkit to generate an extraction model. By tagging entry name and trigger word, we research an automatic conversion method of training corpus, and then select corpus feature and formulate feature template for building the extraction model.In this thesis, interactive encyclopedia entry text is used as data sets. We train "college" category and "company" category respectively to generate the extraction model under different attributes, and test the extracted attribute relation for the same category interactive encyclopedia entry text. Experimental results show that the proposed method not only achieves good extraction performance, but also has a high portability.
Keywords/Search Tags:online encyclopedia, Trigger words, Weak supervision, Naive bayesclassification, Conditional random field (CRF)
PDF Full Text Request
Related items