Font Size: a A A

Research And Implementation On Computing Semantic Relatedness Using Chinese Wikipedia

Posted on:2012-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2218330362460221Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Computing semantic relatedness is one of the most important problems in Natural Language Processing (NLP) field, which also plays a critical role in many NLP applications, such as information retrieval, text classification, word sense disambiguation, example-based machine translation. Because of the particularity of Chinese and some other reasons, the research of computing semantic relatedness in Chinese is much behind of the research of English. In order to improve the relative NLP technology, the research of computing semantic relatedness in Chinese is of great worth.This paper mainly studies algorithms of computing semantic relatedness using Chinese wikipedia links and taxonomy. First, this paper introduces research background and related research methods of computing semantic relatedness in order to understand this research area better. Second, this paper applies algorithms which are based on tree taxonomy like WordNet to Chinese Wikipedia. Because Wikipedia taxonomy is a directed acyclic graph rather than a tree, we propose a multi-path semantic reletedness algorithm. Third, this paper applies WLM (Wikipedia Link-based Measure) algorithm to Chinese Wikipedia and proposes WLT (Wikipedia Links and Taxonomy based measure) algorithm using wikipedia links and taxonomy. We combined algorithms based on taxonomy and WLM or WLT. The experimental results show that the combined algorithms are better than algorithms only based on Wikipedia links or on Wikipedia taxonomy. Finally, the semantic relatdeness algoritms based on Wikipedia are used in the YHPODS system: The first is topic keyword association and the second is semantic-based classification.In addition, we build a manual evaluated test collection named Words-240 to evaluate the accuracy of semantic relatedness algorithms. Because of large amount of data in Wikipedia, we proposed some methods such as using memory cache and file cache, optimizing the database tables and building a database connection pool to imporve efficiency of the algorithms. Taken advantage of these measures, the time consumed by the algorithm is decreased by dozens of times.
Keywords/Search Tags:Semantic Relatednes, Semantic Similarity, Wikipedia, Natural Language Processing
PDF Full Text Request
Related items