Font Size: a A A

An Algorithm Based On Correlation Coefficient To Find Scientific Communities

Posted on:2009-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:X L YangFull Text:PDF
GTID:2178360272970611Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Because of the features of promptness, rapidity and wideness, more and more scientists accustom to publish their documental papers on the World Wide Web (WWW) to be studied, referred and deepened by other scientists. However, because of the network's hugeness and disorder, to find related papers called scientific community has become an intractable problem. The closely connected subgraph in a citation graph is often considered as a community of related papers. Nowadays, citation analysis has become an important branch of link analysis.This paper analyzed the model of citation graph, compared the algorithm based on times cited, PageRank and similarity-based algorithm. The latter two algorithms are both better than the algorithm based on times cited. Similarity-based algorithm is the most accurate. Nevertheless, the searching results of the three algorithms often contain some unrelated papers, which influence the precision of the found communities.Panupong et al. brought in a scientific community finding algorithm according to the similarity of two nearby papers, using the random walk model, but that algorithm lacks of scientific explanation. By deeply studying citation relationship, this essay analyses the concepts and problems of RWGC algorithm. The formula used by RWGC algorithm's lack of scientific basis influences the result. Based on RWGC, this essay puts forward an improved algorithm based on correlation coefficient--RWGC-CC (The Improved Random Walk Graph Clustering Algorithm Based on Correlation Coefficient). It takes 3 layers of referring relation into consideration, modeling documents into variables as well as modeling the similarity of adjacent documents into correlation coefficient in Probability. This correlation coefficient reflects the degree of similarity, so it gives the mathematic explanation of similarity. This paper also deeply analyses the relationship of similarity threshold and the size of a community by some experimentsThe experimental result shows that RWGC-CC increases the precision by 15% than RWGC, meanwhile, RWGC-CC removes the iteration process while calculating similarity, saves a lot of time and improves the efficiency.
Keywords/Search Tags:Random Walk, Correlation Coefficient, Scientific Community, Citation Graph
PDF Full Text Request
Related items