Font Size: a A A

Research On Near-Duplicates Detection Algorithm Of Search Engine

Posted on:2009-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhengFull Text:PDF
GTID:2178360275471914Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
There are a great number of duplicate or near-duplicate web pages in the Internet, and a lot of problems arise from the duplicates, such as it is a great burden for both crawlers and the Internet, and it leads to duplicate indecies and extra space to store files and a great effect on the performance of search engines. Some effective duplicate page detection algorithms can discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance.This paper has reviewed the development of search engines, analised the principles of search engines, and then research the situation of different duplicate detection algorithms. On the basis of analysis of the advantages and disadvantages of current duplicate detection algorithms, it came to a conclusion that an excellent duplicate detection technique has to satisfy two necessary conditions. Simhash fingerprinting and Shingling are two classic algorithms of duplicate detection. On the basis of complete analysis of these two algorithms, several improved solutions were proposed, according to the two necessary conditions. In order to get more web page features, the word weight has taken the word position into account, had all the shingle to generate the final fingerprint of a web page, and even integrated the two algorithms to improve the performance further, on the basis of their characteristics. Since the huge magnitude of shingles, a new method, which based on all the words included in the shingle, was proposed to generate the fingerprint of a shingle.The proposed improved solutions is based on the web page feature selection, in order to improve the accuracy of the computation of the similarity between two duplicates. Then a prototype was developed on the basis of Manku's algorithm, which is aimed to detect duplicate web pages efficiently, to confirm the effectiveness and efficiency of the improved solutions. Finally, a crawler system having the prototype as a subsystem was developed to detect duplicate web pages online.
Keywords/Search Tags:Search Engine, Near-Duplicate, Similarity, Fingerprint
PDF Full Text Request
Related items