Font Size: a A A

Semantic Relevance Metric Algorithm Research Based On Wikipedia

Posted on:2016-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:ZhuFull Text:PDF
GTID:2308330461452120Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The concept of semantic correlation between the degree of association refers to the two concepts in the semantic level. How to measure the concept of semantic relevance is the basic work in the field of natural language processing research needs. In the field of information retrieval, semantic correlation measure is considered to be one of the best ways to improve the retrieval results. Meanwhile, in the field of text clustering and text categorization, terms of use semantic relevance can greatly improve text classification or clustering effect. Furthermore, semantic relevance measure is also widely used text disambiguation, text summarization and text automatic translation and other fields.First, this paper introduces an overview of Wikipedia, analyze its characteristics and advantages, and a detailed description and a link to its page structure network structure and classification system. Meanwhile, from the perspective of background knowledge introduce semantic relevancy calculation method based on large-scale corpus and ontology-based semantic correlation methods, and analyzes all kinds of methods of calculation principles, scope and advantages and disadvantages. Besides introducing beyond traditional methods, focusing also analyzed the most commonly used several methods to Wikipedia for background knowledge.Secondly, the correlation algorithm graph structure leads SimRank, describes the calculation principles of the algorithm, and the algorithm is applied to the concept of correlation exists Wikipedia computing defects, such as the role of the adjacent node is equated as well as through higher computational complexity, To do this improved algorithm proposed. First, the definition of the concept of semantic influence node to the destination node to distinguish different effects on the adjacent node size; secondly the use of conditional probability-based classification of IC and a method based on the concept of computing nodes in the semantic influence, thereby SimRank algorithm is improved, gives the concepts and calculation methods CPSIS ICSIS. However, CPSIS algorithm analysis found that the algorithm to calculate more time out of the link is better, worse results in less time out of the link. ICSIS classification tree algorithm uses information can better make up less access to link information brought too few problems. Consequently, under applicable both methods, the formation of a common type of algorithm SIS. In addition, the analysis found that the complexity of the algorithm is too high, the calculation amount is too large, and therefore uses three methods to reduce the computational complexity of the Wikipedia link graph based on the characteristics, namely:(1) According to the characteristics of the iterative convergence, precisely calculated to meet the actual application requirements for accuracy the number of iterations required;(2) select the necessary computing nodes join, as without the zero-degree nodes join iterative calculations;(3) to set the threshold for each iteration, ignoring relevant too small adjoining node.Finally, to complete the experiment, obtained experimental results. Analysis found that, compared with the traditional method based on ontology and large-scale corpus-based approach, the effect is much better than these methods. Compared with the three methods based on Wikipedia WikiRelate, WLM and ESA, the method in this paper two methods generally better than the first method, the concept is not commonly used for computing the correlation is better than later. Thus, the description of this algorithm is better versatility.
Keywords/Search Tags:Concepts Related to Calculation, Wikipedia, Links Network
PDF Full Text Request
Related items