Font Size: a A A

Research On Ontology Learning And Knowledge Acquisition From Chinese Online Encyclopedia

Posted on:2015-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z GuFull Text:PDF
GTID:1228330461974394Subject:Information security
Abstract/Summary:PDF Full Text Request
Researching, analyzing on big data and changing big data into information and knowledge have important scientific value and practical significance for knowledge engineering and Netinfo Security fields. Ontology knowledgebase is the foundation of automatic question answering, decision support and semantic search engine, etc. However, ontology knowledgebase construction is time-consuming and costly. Chinese Online Encyclopedia (COE) is a big data which is created by network users collaboratively and a product of the wisdom of crowds. The number of potential users accounts for about one quarter of the world’s population. COE provides an ideal resource for large-scale collaborative knowledge acquisition. However, ontology knowledgebase construction is time-consuming and costly. The dissertation analyzes characteristics of COE, researches knowledge acquisition methods applicable to the big data of COE, and provides theoretical basis and algorithms for extracting massive concepts and relations from COE. Its main content includes the following four aspects:(1) Acquisition of taxonomic relations and instance-of relations:The dissertation acquires hyponymy relations between two open categories based on word-occurrence analysis and semantic analysis. And the hyponymy relations are used for generating open category conceptual hierarchy structures. In order to address the problem of large number of conceptual hierarchy structures, a clustering method is proposed to group the conceptual hierarchy structures with the semantic similarity. With the conceptual hierarchy structures, similarity between two open categories can be calculated. On the basis of similarity computation, the open category weight is obtained and instance-of relations between encyclopedia entry and its open categories are acquired. Therefore, the massive classified glossary can be built. The experiment on Hudong encyclopedia dataset shows that the proposed ontology learning method is superior to the typical method.(2) Acquisition of attribute relations:The dissertation considers attribute values as named entities and extracts frequent k-patterns from encyclopedia texts. Through association analysis of frequent k-patterns, candidate attribute words are obtained. The dissertation uses semantic resources to filter duplicate candidate attribute words and establishes the lists of attributes for each class. For each attribute of a class, a boostrapping method is proposed to generate attribute trigger words which are used to mine attribute value extraction patterns from texts. The dissertation applies a hierarchical clustering algorithm to improve the quality of patterns by filtering low frequency and noisy patterns. The experiment on Hudong encyclopedia texts shows the class attribute acquisition method gets more number of attributes which can reflect the class features in comparison with the attributes defined manually, and the attribute value acquisition method achieves a better performance compared with the typical method.(3) Acquisition of entity relations:The dissertation automatically acquires training corpus using the structured information and texts of encyclopedia. Relation words are extracted from training corpus and semantic resources. The training data which doesnot contain relation words is filtered and thus the training corpus is optimized. The dissertation uses n-pattern feature to train the classifier for labelling test data to get relation instances. Experimental result shows the method of relation word filtering can improve the quality of training corpus and n-pattern feature can effectively relieve the data sparsity problem of traditional n-gram features and improve the performance of the classifier. A weakly-supervised relation extraction method named NF-Tri-training is proposed. The method applies Tri-training algorithm to train several classifiers iteratively, obtains new samples from unlabelled data, then adds the new samples to the initial training set, and uses data editing technique at the same time to remove noise of initial training set and the new samples. Experimental result shows the presented method improves the ability of the classifier’s generalization and performance of weakly-supervised relation extraction.(4) Acquisition of part-whole relations:The dissertation extracts concept pairs and context patterns of concept pairs from encyclopedia texts, and constructs the distributional semantic model of concept pairs and their context patterns. A co-clustering algorithm is applied to group the concept pairs with the same semantic relations. L1 regularized logistic regression model is trained to select clustering features and get the context patterns which represent the semantic relation of each cluster. The clusters with part-whole semantic relation are identified using the context pattern of concept pairs and the part-whole relation concept pairs are acquired. The experiment on university texts of Hudong encyclopedia indicates the proposed method is superior to the one-side clustering method and traditional pattern matching method.
Keywords/Search Tags:Chinese Information Processing, Big Data, Knowledge Acquisition, Open Information Extraction, Online Encyclopedia
PDF Full Text Request
Related items