Font Size: a A A

The Research And Implementation Of Similar Web Pages Indentification Algorithm

Posted on:2012-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:F DuanFull Text:PDF
GTID:2178330335460206Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of Internet has profoundly changed people's lives and greatly promoted the progress of society. Particularly, the Internet has been a very important platform for people's accessing information and communication.The Internet provides a lot of information resources which greatly facilitates the efficiency of accessing information. But there is a lot of repeated, similar web pages information which is not meaningful for people. On the contrary, it may bring unnecessary consumption of resources. Therefore, how to identify these similar web pages information is a subject worthy of study.This thesis describes the study background, the current domestic and foreign study state of similar web pages identification technologies. Make an in-depth research on related similar web pages identification algorithms. Analysis and summarize the advantages and disadvantages of some algorithms. Based on the classic DSC algorithm and Simhash algorithm, seek positive improvements:Use word sequence, sequence weight instead of word, weight in Simhash algorithm. Bring the relative position information and the features of web pages not only word frequency into the calculating sequence weight. So it will include more comprehensive web page information which is helpful for improving the performance of the algorithm.Finally, implement a simple test system based on the improved algorithm. Use the real web pages from the Internet to test the algorithm's effectiveness. Define precision and recall rate as the standard for comparison. Compare the result of improved algorithm with DSC algorithm and Simhash algorithm. Then make a summary.
Keywords/Search Tags:Similar Web Pages, DSC Algorithm, Simhash, Identification
PDF Full Text Request
Related items