Research On Near-Duplicates Detection Algorithm Of Search Engine

Posted on:2009-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:P Zheng

Full Text:PDF

GTID:2178360275471914

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

There are a great number of duplicate or near-duplicate web pages in the Internet, and a lot of problems arise from the duplicates, such as it is a great burden for both crawlers and the Internet, and it leads to duplicate indecies and extra space to store files and a great effect on the performance of search engines. Some effective duplicate page detection algorithms can discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance.This paper has reviewed the development of search engines, analised the principles of search engines, and then research the situation of different duplicate detection algorithms. On the basis of analysis of the advantages and disadvantages of current duplicate detection algorithms, it came to a conclusion that an excellent duplicate detection technique has to satisfy two necessary conditions. Simhash fingerprinting and Shingling are two classic algorithms of duplicate detection. On the basis of complete analysis of these two algorithms, several improved solutions were proposed, according to the two necessary conditions. In order to get more web page features, the word weight has taken the word position into account, had all the shingle to generate the final fingerprint of a web page, and even integrated the two algorithms to improve the performance further, on the basis of their characteristics. Since the huge magnitude of shingles, a new method, which based on all the words included in the shingle, was proposed to generate the fingerprint of a shingle.The proposed improved solutions is based on the web page feature selection, in order to improve the accuracy of the computation of the similarity between two duplicates. Then a prototype was developed on the basis of Manku's algorithm, which is aimed to detect duplicate web pages efficiently, to confirm the effectiveness and efficiency of the improved solutions. Finally, a crawler system having the prototype as a subsystem was developed to detect duplicate web pages online.

Keywords/Search Tags:

Search Engine, Near-Duplicate, Similarity, Fingerprint

PDF Full Text Request

Related items

1	Research Of Deleting Duplicate Web Pages On Campus Search Engine
2	Study And Applications Of Duplicate Web Page's Elimination And Clustering Algorithm In Search Engine System Of Colleges And Universities
3	Research And Implementation Of Meta Search Engine
4	Research And Implementation Of The Small-scale Search Engine Based On Lucene
5	Research On Near-duplicate Detection Algorithm
6	Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine
7	Research And Implementation On Search Engine System Remove Duplicate Webpages
8	The Research On Key Technologies Of Search Engine Under Cloud Environment
9	Research On The Key Technology Of Meta Search And Their Implementation
10	Ontology-based Search Engine Research And Design