Font Size: a A A

The Integration Of Multiple Semantic Knowledge Bases

Posted on:2012-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Z GuoFull Text:PDF
GTID:1118330362462060Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the World Wide Web, lots of information ?ood intoour life. People's needs for useful information are growing daily. Simply returning re-lated web pages for users'query no longer satisfies such requirements. Machines areexpected to be more intelligent to understand queries, return exactly matched results (orknowledge), and thus to be more helpful in aiding users'decision making. To satisfy suchexpectations, providing knowledge service has been proposed as the future trend of infor-mation retrieval, and semantic analysis will be important building blocks for supportingknowledge service. As one of the key techniques for understanding users'information,a semantic knowledge base (SKB) is constructed to describe the syntax and semantics ofwords or phrases and to define their relationships. It has been one of the most importantresources in natural language understanding. There are kinds of SKBs that are constructedvia various ways. But limited by their construction methods, these SKBs are either toosmall in scale, or lacks of dynamic semantic, or too simple in knowledge representationsand thus lacks of semantic annotation. So the seeking of methods for constructing largescale, rich semantic annotation SKB is still an open task for natural language process-ing researchers. And the automatic integration of multiple SKBs provides a ?exible andeffective way to address these issues.Although there have been a lot of research on multiple SKBs integration in recentyears, there still lacks an effective way to integrate most of the existing SKB resources,especially online encyclopedia KBs such as Wikipedia and Baidu Baike.In this thesis, we firstly analyze the problems in multiple SKBs integration, espe-cially the knowledge choosing and knowledge inconsistencies problems. Then BaiduBaike, Wikipedia(Chinese version) and Hudong Baike are chosen as our integrationsources, and a unified integration framework is proposed for their integration. Then, byintroducing the semantic dictionary HowNet, the thesis gives an approach on integratingBaidu Baike and HowNet. The unified"class-attribute-entity-attribute value"frameworkgives a solution for the consistency problem in multiple SKBs integration.The construction of"class-attribute"template is the core of the multiple SKBs in-tegration framework. To solve this issue, a multi-filters driven class attribute extraction approach is proposed. It extracts class attributes from online encyclopedias. The cat-alog labels in the instances of encyclopedia KBs are chosen as the corpus for the firsttime. Then a series of filters are implemented on the raw candidate attributes sets, so thatthe noises and the redundancy information are removed, and similar candidate attributesare combined. Finally, selected candidate attributes are ranked based on their dispersionstatistics. Our experiments show that high accuracy is ensured by using multiple filters todeal with candidate class attributes and using the dispersion statistics to rank the selectedattributes.In the research of class attribute extraction from encyclopedia KBs, to solve theproblem in the coverage of the extracted class attributes, a novel class attribute extractionapproach is presented by using semantic relatedness computing. This approach concen-trates on mining these potential attributes in the candidate attributes set, which have thehigher semantic relatedness. Firstly the co-occurrence between every two candidate at-tributes are counted and the tolerance class of a candidate attribute is defined using theco-occurrence based on the tolerance rough set theory, and then the co-occurrence andthe normalized Google distance (NGD) are combined together as a constraint in obtain-ing an upper approximation of the attribute set of the target class. In the experiments wecompare the tolerance rough set based method and the method using semantic related-ness computing. The results show the necessity of the introduction of NGD in computingsemantic relatedness. Meanwhile, by comparing with the dispersion-based method, themethod based on semantic relatedness computing is proved to be able to find the potentiallow-ranking attributes that have high semantic relatedness. The high extraction accuracyis ensured with a higher coverage rate.The last part of this thesis is the research on the applications of semantic relatednesscomputing in search ranking. The web site factor is introduced in computing the semanticrelatedness between the anchor texts and the web page, and then the transition probabilitymatrix between web sites are adjusted. Meantime, the updating frequency of a web siteis counted and a novel site ranking algorithm that uses semantic relatedness and timefrequency is proposed. The semantic relatedness between a query string and a web siteis computed using web site features, and a novel ranking algorithm based on web sitefeature identification is presented. Experiments show that the introduction of semanticrelatedness is effective in the ranking applications.
Keywords/Search Tags:multiple semantic knowledge bases integration, class attribute extraction, semantic relatedness, tolerance rough set, website feature extraction
PDF Full Text Request
Related items