Font Size: a A A

Research And Implementation On The Parallelization Of The Near-duplicate Page Removal Algorithm

Posted on:2010-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y B DingFull Text:PDF
GTID:2178330338485118Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Due to the Web mirrors and plagiarism, duplicate and near-duplicate web pages are creating a series of problems for search engines. It either leads to the extra duplicate page index storage or exerts great burden to the search engines. At the same time, the searching results filling with duplicate content give the users an annoying feedback. Thus, the effective duplicate page detection algorithms are needed to discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance.The search engine technology advances quickly in recent years,and the near-duplicate removal is crucial for Web data gathering. Firstly, web page's topic content block should be recognized while noise information such as advertisement should be removed. Then the content will be tokenized according to the lexicon and the page's feature set will be computed based on shingle algorithm. At last, Simhash'algorithm will be adopted to compute the fingerprint of the page'feature set. This fingerprint can represent web page's topic content feature, and if two have near duplicate content, their fingerprint will have small Hamming Distance. Moreover, the thesis puts forward the improved parallel-processing method for traditional Shingle method for near-duplicate page removal.At last, this thesis presents the architecture of the Web page gathering system and implements a near-duplicate page removal prototype system. The prototype system has two working mode: one is on-line working mode which single fingerprint will be compared with huge number of previous fingerprint data; the other is parallel-processing mode which a set of fingerprints will be compared with huge number of previous fingerprint data. To distinguish from the on-line mode, previous fingerprint data is split to fixed-sized data blocks which are stored on the distributed-computing platform and the hamming-distance computing process between the set of fingerprint and the previous fingerprint data is based on the Map/Reduce programming paradigm. Through the experiment, the prototype system adopting the above parallel-processing method effectively solves the near-duplicate page removal problem, and reaches high efficiency and accuracy.
Keywords/Search Tags:Web gathering, Near-duplicate, Map/Reduce, Fingerprint, Hamming-distance
PDF Full Text Request
Related items