Research And Implementation On The Parallelization Of The Near-duplicate Page Removal Algorithm

Posted on:2010-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y B Ding

Full Text:PDF

GTID:2178330338485118

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Due to the Web mirrors and plagiarism, duplicate and near-duplicate web pages are creating a series of problems for search engines. It either leads to the extra duplicate page index storage or exerts great burden to the search engines. At the same time, the searching results filling with duplicate content give the users an annoying feedback. Thus, the effective duplicate page detection algorithms are needed to discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance.The search engine technology advances quickly in recent years,and the near-duplicate removal is crucial for Web data gathering. Firstly, web page's topic content block should be recognized while noise information such as advertisement should be removed. Then the content will be tokenized according to the lexicon and the page's feature set will be computed based on shingle algorithm. At last, Simhash'algorithm will be adopted to compute the fingerprint of the page'feature set. This fingerprint can represent web page's topic content feature, and if two have near duplicate content, their fingerprint will have small Hamming Distance. Moreover, the thesis puts forward the improved parallel-processing method for traditional Shingle method for near-duplicate page removal.At last, this thesis presents the architecture of the Web page gathering system and implements a near-duplicate page removal prototype system. The prototype system has two working mode: one is on-line working mode which single fingerprint will be compared with huge number of previous fingerprint data; the other is parallel-processing mode which a set of fingerprints will be compared with huge number of previous fingerprint data. To distinguish from the on-line mode, previous fingerprint data is split to fixed-sized data blocks which are stored on the distributed-computing platform and the hamming-distance computing process between the set of fingerprint and the previous fingerprint data is based on the Map/Reduce programming paradigm. Through the experiment, the prototype system adopting the above parallel-processing method effectively solves the near-duplicate page removal problem, and reaches high efficiency and accuracy.

Keywords/Search Tags:

Web gathering, Near-duplicate, Map/Reduce, Fingerprint, Hamming-distance

PDF Full Text Request

Related items

1	Research On PUF Circuit Design Based On Memristor Array
2	Research On Memristor-based Hamming Distance Calculation Methods And Realization
3	The Research About Text Similarity Measuring Through Hamming-Distance And Semantics
4	The Circuit Design Of In-memory Calculation On 9T SRAM For Hamming Distance And Boolean Logic Applications
5	Research Of Improved Near-Duplicate Removal
6	Key Predistribution Schemes Based On A Class Of Orthogonal Arrays
7	Adaptive measures of similarity---Fuzzy Hamming distance---and its applications to pattern recognition problems
8	Research On The Index For Similar Search Based On Hamming Distance
9	A Design And Implementation, Test Case Generation Method Based On Hamming Distance
10	Fingerprint Acquisition System Usb Interface Design