Font Size: a A A

Research Of Semantic Relatedness Measure Based On Wikipedia Structure

Posted on:2013-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:C C SunFull Text:PDF
GTID:2298330467478167Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As WEB2.0comes out and develops very fast, quite an amount of WEB information is produced and spreads. People hope that they get information from computers very soon, which is important to them. People hope that computers can mine information automatically and intelligently and can understand and deal with natural languages well. Semantic relatedness between words and phrases is very important to these applications of computers. As a fundamental field of research, semantic relatedness is popular among information information retrieval, spelling check, text classification, text clustering, artificial intelligence, natural language process related application such as word sense disambiguation, automatic summary, intelligent answer and machine translation and so on.It’s a tough and complicated task for computers to judge the semantic relatedness between words, which needs many concepts and relationships between them of entities in real world, common senses and knowledge about special fields. Some researchers use statistical analysis of large corpora to compute semantic relatedness while others deal with knowledge bases and get lexical structures such as taxonomies and thesauri to compute semantic relatedness. However, both are limited by background knowledge; the former is bad structures and imprecise, and scalability and scope limit the latter.Wikipedia is an excellent semantic knowledge base, consisting of the article referenced network and the category tree, which are two structures like networks, with quite amounts of explicit semantic knowledge in good structures. To compute semantic relatedness between words or phrases, at first, we map the target words to wiki-concepts, which will be defined in chapter3; then, we compute semantic relatedness between wiki-concepts to get semantic relatedness between the target words. The main contributions and innovations of the thesis are as follows:1) We introduce background information, current developments and defects of research on semantic relatedness computing. The definition of semantic relatedness and its evaluation measures are stated. Traditional semantic relatedness algorithms are introduced and their advantages and disadvantages are analyzed.2) A simple semantic relatedness algorithm named RelArtNetSimple is proposed based on the wikipedia article referenced network and Jaccard coefficient; then, wiki-concept nodes and links get weights and wiki-concepts are divided by layers; finally, a new semantic relatedness algorithm named RelArtNet comes out, which bases on hierarchically divided wiki-concepts with weights in the wikipedia article referenced network.3) We propose a semantic relatedness algorithm based on content of the category tree and also a semantic relatedness algorithm based on the structure of the category tree. A new semantic relatedness algorithm named RelCatTree comes out, based on the wikipedia category tree, with both advantages of the former two algorithms.4) Correlations between humans’ judgments and algorithms’ results are used to comment semantic relatedness algorithms. Spearman coefficient is applied to get correlations between target algorithms and humans’ judgments. Three popular testing sets are used, which are Miller and Charles (1991, consisting of30pairs), Rubenstein and Goodenough’s (1965, consisting of65pairs) and WordSim-353datasets (Finkelstein et al.,2002, consisting of353pairs). The experiments results proves good complexity of the WSR algorithm we proposed.
Keywords/Search Tags:semantic relatedness, wikipedia, article referenced network, category tree
PDF Full Text Request
Related items